Another optimization question - can I speed up this 32 bit multiply?

Just some disassemblies I did while trying to optimize the C code:

Here was the original code, where I did the obvious thing and shifted out the unnecessary bits and moved the high byte into place for my 12-bit sample:

// Calculate 12-bit sample:
	sample = ((0X80 ^ playpos[1])<<4) | playpos[0]>>4;
    1e64:	81 81       	ldd	r24, Z+1	; 0x01
    1e66:	80 58       	subi	r24, 0x80	; 128
    1e68:	48 2f       	mov	r20, r24
    1e6a:	50 e0       	ldi	r21, 0x00	; 0
    1e6c:	b4 e0       	ldi	r27, 0x04	; 4
    1e6e:	44 0f       	add	r20, r20
    1e70:	55 1f       	adc	r21, r21
    1e72:	ba 95       	dec	r27
    1e74:	e1 f7       	brne	.-8      	; 0x1e6e <__vector_13+0xfc>
    1e76:	80 81       	ld	r24, Z
    1e78:	90 e0       	ldi	r25, 0x00	; 0
    1e7a:	a4 e0       	ldi	r26, 0x04	; 4
    1e7c:	95 95       	asr	r25
    1e7e:	87 95       	ror	r24
    1e80:	aa 95       	dec	r26
    1e82:	e1 f7       	brne	.-8      	; 0x1e7c <__vector_13+0x10a>
    1e84:	48 2b       	or	r20, r24
    1e86:	59 2b       	or	r21, r25

Then I figured that if I were optimizing it by hand, surely it would be easier to optimize out a shift by 8 instead, so I tried that:

	sample = 0X80 ^ playpos[1];
    1e64:	81 81       	ldd	r24, Z+1	; 0x01
	sample = sample << 8;
    1e66:	38 2f       	mov	r19, r24
    1e68:	30 58       	subi	r19, 0x80	; 128
    1e6a:	20 e0       	ldi	r18, 0x00	; 0
	sample = sample | playpos[0];
    1e6c:	80 81       	ld	r24, Z
    1e6e:	48 2f       	mov	r20, r24
    1e70:	50 e0       	ldi	r21, 0x00	; 0
    1e72:	42 2b       	or	r20, r18
    1e74:	53 2b       	or	r21, r19
	sample = sample >> 4;
    1e76:	a4 e0       	ldi	r26, 0x04	; 4
    1e78:	56 95       	lsr	r21
    1e7a:	47 95       	ror	r20
    1e7c:	aa 95       	dec	r26
    1e7e:	e1 f7       	brne	.-8      	; 0x1e78 <__vector_13+0x106>

I haven't actually counted the cycles, but I see one less jump in there and I think four fewer loops as a result. And it looks simpler.