Another optimization question - can I speed up this 32 bit multiply?

Shift by 12:

(1)    1e9c:	2c e0       	ldi	r18, 0x0C	; 12
(1)    1e9e:	b6 95       	lsr	r27
(1)    1ea0:	a7 95       	ror	r26
(1)    1ea2:	97 95       	ror	r25
(1)    1ea4:	87 95       	ror	r24
(1)    1ea6:	2a 95       	dec	r18
(1/2) 1ea8:	d1 f7       	brne	.-12     	; 0x1e9e <__vector_13+0x12c>
(1)    1eaa:	28 2f       	mov	r18, r24

I count 8 cycles in that loop, x12, -1 so 95 cycles +1 for the mov, so 96.

And now the shift by 8 then a shift by 4:

(1)    1e9c:	89 2f       	mov	r24, r25
(1)    1e9e:	9a 2f       	mov	r25, r26
(1)    1ea0:	ab 2f       	mov	r26, r27
(1)    1ea2:	bb 27       	eor	r27, r27
(1)    1ea4:	9c 01       	movw	r18, r24
	sample = sample >> 4;	// For 16-bit sample input.  (output 12 bit sample)
(1)    1ea6:	84 e0       	ldi	r24, 0x04	; 4
(1)    1ea8:	36 95       	lsr	r19
(1)    1eaa:	27 95       	ror	r18
(1)    1eac:	8a 95       	dec	r24
(1/2) 1eae:	e1 f7       	brne	.-8      	; 0x1ea8 <__vector_13+0x136>

Jeez, I can already tell before I even add it up that that is going to be way less.

Let's see, 5 cycles for the shift by 8, and 6 cycles in that loop x4, -1 (unless I shouldn't count the ldi instruction? I can't tell where the jump is jumping back to but I think maybe it's to lsr)
That gives me a total of.... 28 cycles! Holy crap that is 3x as fast.

I guess the lesson here is if there's a shift by 8 that can be done, do it explicitly, because the compiler ain't gonna do it for ya. And don't shift by 4 twice if you can do a shift by 8 instead.