In my latest project, I've had the need to optimize a lot of stuff, and I learned that a fast way to do a shift by 4 on a byte is to swap nybbles. But when I started looking at the assembly output of the compiler, I noticed >>4 and <<4 weren't being optimized to a swap command, but were instead loops with single shifts were being generated. Doing four shifts in a row would be faster than that. And a single swap command with a mask would be even faster still.
I've had to resort to using inline assembly like this:
asm volatile("swap %0" : "=r" (dl) : "0" (dl)); // SWAP high and low nybbles of dl
I thought maybe the issue was the compiler was set to optimize for code size rather than speed but since the swap takes fewer bytes I don't think that could be it. I was wondering though whether the compiler is optimized for speed or code size. It would be nice if we could choose in the IDE.
Anyway I was talking to some guys over on avrfreaks, and that's where I learned the latest GCC does produce a swap in those circumstances.