Just some disassemblies I did while trying to optimize the C code:
Here was the original code, where I did the obvious thing and shifted out the unnecessary bits and moved the high byte into place for my 12-bit sample:
// Calculate 12-bit sample:
sample = ((0X80 ^ playpos[1])<<4) | playpos[0]>>4;
1e64: 81 81 ldd r24, Z+1 ; 0x01
1e66: 80 58 subi r24, 0x80 ; 128
1e68: 48 2f mov r20, r24
1e6a: 50 e0 ldi r21, 0x00 ; 0
1e6c: b4 e0 ldi r27, 0x04 ; 4
1e6e: 44 0f add r20, r20
1e70: 55 1f adc r21, r21
1e72: ba 95 dec r27
1e74: e1 f7 brne .-8 ; 0x1e6e <__vector_13+0xfc>
1e76: 80 81 ld r24, Z
1e78: 90 e0 ldi r25, 0x00 ; 0
1e7a: a4 e0 ldi r26, 0x04 ; 4
1e7c: 95 95 asr r25
1e7e: 87 95 ror r24
1e80: aa 95 dec r26
1e82: e1 f7 brne .-8 ; 0x1e7c <__vector_13+0x10a>
1e84: 48 2b or r20, r24
1e86: 59 2b or r21, r25
Then I figured that if I were optimizing it by hand, surely it would be easier to optimize out a shift by 8 instead, so I tried that:
sample = 0X80 ^ playpos[1];
1e64: 81 81 ldd r24, Z+1 ; 0x01
sample = sample << 8;
1e66: 38 2f mov r19, r24
1e68: 30 58 subi r19, 0x80 ; 128
1e6a: 20 e0 ldi r18, 0x00 ; 0
sample = sample | playpos[0];
1e6c: 80 81 ld r24, Z
1e6e: 48 2f mov r20, r24
1e70: 50 e0 ldi r21, 0x00 ; 0
1e72: 42 2b or r20, r18
1e74: 53 2b or r21, r19
sample = sample >> 4;
1e76: a4 e0 ldi r26, 0x04 ; 4
1e78: 56 95 lsr r21
1e7a: 47 95 ror r20
1e7c: aa 95 dec r26
1e7e: e1 f7 brne .-8 ; 0x1e78 <__vector_13+0x106>
I haven't actually counted the cycles, but I see one less jump in there and I think four fewer loops as a result. And it looks simpler.