Another optimization question - can I speed up this 32 bit multiply?

This thread leads me to another question:

If the chip has a hardware multiplier which takes two clock cycles...why doesn't the compiler use it to do shift operations instead of creating a loop of single-bit shifts?