Shift by 12:
(1) 1e9c: 2c e0 ldi r18, 0x0C ; 12
(1) 1e9e: b6 95 lsr r27
(1) 1ea0: a7 95 ror r26
(1) 1ea2: 97 95 ror r25
(1) 1ea4: 87 95 ror r24
(1) 1ea6: 2a 95 dec r18
(1/2) 1ea8: d1 f7 brne .-12 ; 0x1e9e <__vector_13+0x12c>
(1) 1eaa: 28 2f mov r18, r24
I count 8 cycles in that loop, x12, -1 so 95 cycles +1 for the mov, so 96.
And now the shift by 8 then a shift by 4:
(1) 1e9c: 89 2f mov r24, r25
(1) 1e9e: 9a 2f mov r25, r26
(1) 1ea0: ab 2f mov r26, r27
(1) 1ea2: bb 27 eor r27, r27
(1) 1ea4: 9c 01 movw r18, r24
sample = sample >> 4; // For 16-bit sample input. (output 12 bit sample)
(1) 1ea6: 84 e0 ldi r24, 0x04 ; 4
(1) 1ea8: 36 95 lsr r19
(1) 1eaa: 27 95 ror r18
(1) 1eac: 8a 95 dec r24
(1/2) 1eae: e1 f7 brne .-8 ; 0x1ea8 <__vector_13+0x136>
Jeez, I can already tell before I even add it up that that is going to be way less.
Let's see, 5 cycles for the shift by 8, and 6 cycles in that loop x4, -1 (unless I shouldn't count the ldi instruction? I can't tell where the jump is jumping back to but I think maybe it's to lsr)
That gives me a total of.... 28 cycles! Holy crap that is 3x as fast.
I guess the lesson here is if there's a shift by 8 that can be done, do it explicitly, because the compiler ain't gonna do it for ya. And don't shift by 4 twice if you can do a shift by 8 instead.