Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

```
// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023. 4095*1023 / 1024 = 4091 max.
1e96: 70 e0 ldi r23, 0x00 ; 0
1e98: 54 e0 ldi r21, 0x04 ; 4
1e9a: 75 95 asr r23
1e9c: 67 95 ror r22
1e9e: 5a 95 dec r21
1ea0: e1 f7 brne .-8 ; 0x1e9a <__vector_13+0x116>
1ea2: 30 e0 ldi r19, 0x00 ; 0
1ea4: 44 e0 ldi r20, 0x04 ; 4
1ea6: 22 0f add r18, r18
1ea8: 33 1f adc r19, r19
1eaa: 4a 95 dec r20
1eac: e1 f7 brne .-8 ; 0x1ea6 <__vector_13+0x122>
1eae: 62 2b or r22, r18
1eb0: 73 2b or r23, r19
1eb2: 88 27 eor r24, r24
1eb4: 77 fd sbrc r23, 7
1eb6: 80 95 com r24
1eb8: 98 2f mov r25, r24
1eba: 2f 85 ldd r18, Y+15 ; 0x0f
1ebc: 38 89 ldd r19, Y+16 ; 0x10
1ebe: 40 e0 ldi r20, 0x00 ; 0
1ec0: 50 e0 ldi r21, 0x00 ; 0
1ec2: 0e 94 56 18 call 0x30ac ; 0x30ac <__mulsi3>
1ec6: 9b 01 movw r18, r22
1ec8: ac 01 movw r20, r24
1eca: 9a e0 ldi r25, 0x0A ; 10
1ecc: 56 95 lsr r21
1ece: 47 95 ror r20
1ed0: 37 95 ror r19
1ed2: 27 95 ror r18
1ed4: 9a 95 dec r25
1ed6: d1 f7 brne .-12 ; 0x1ecc <__vector_13+0x148>
```

This is the pertinent section of code from the wave lib:

```
uint8_t dh, dl;
if (playing->BitsPerSample == 16) {
// 16-bit is signed
dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1. So xoring 101 with 000 gives 101. While xoring 101 with 100 gives 001. I guess fipping this one bit is all that is needed to convert from signed to unsigned!
dl = playpos[0];
playpos += 2;
}
else {
// 8-bit is unsigned
dh = playpos[0];
dl = 0;
playpos++;
}
#if DVOLUME
// dh, dl -> 16 bit tmp:
//uint16_t tmp = (dh << 8) | dl;
//tmp >>= playing->volume;
// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023. 4095*1023 / 1024 = 4091 max.
#endif //DVOLUME
// Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
PORTB &= ~0b10000;
// Send 16 bits to the DAC.
// First 4 bits are configuration bits. 0011 = (DAC A, unbuffered, 1x gain, not muted)
// Last 12 bits specify the output voltage. (MSB first)
//SPI.transfer((tmp >> 8)|0b00110000);
//SPI.transfer(tmp);
#if DVOLUME
SPDR = (tmp >> 8)|0b00110000; // Requires 12-bit tmp. Does the C compiler optimize out the shift here?
while (!(SPSR & _BV(SPIF)));
SPDR = tmp;
while (!(SPSR & _BV(SPIF)));
#else
SPDR = 0b00110000 | (dh >> 4);
while (!(SPSR & _BV(SPIF)));
SPDR = (dh << 4) | (dl >> 4);
while (!(SPSR & _BV(SPIF)));
#endif
// Take the SS pin high to de-select the DAC:
PORTB |= 0b10000;
```

Right now, I’m converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024, the max value of volume.

I believe there are some special commands for multiplying with fixed point numbers which might help, but I’ve also read that it’s difficult to do stuff like multiplying in inline asm because of limitations on what registers you can use.

I’m also thinking maybe since dh is already in a byte, that rather than shift it right and or it with dl to make a 12 bit number, I could leave dh and dl alone, somehow multiply each by the volume adjustment carrying the excess from dl over to dh and then do a 14 bit shift to divide by the max volume and also reduce the 16 bit sample to the 12 bit one I need.

Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that’s done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

One thing I don’t want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it. That would result in too large of steps for the volume. I want to read a pot and have the volume change smoothly.