Go Down

Topic: Another optimization question - can I speed up this 32 bit multiply? (Read 2 times) previous topic - next topic

scswift

Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

Code: [Select]

// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.
    1e96: 70 e0        ldi r23, 0x00 ; 0
    1e98: 54 e0        ldi r21, 0x04 ; 4
    1e9a: 75 95        asr r23
    1e9c: 67 95        ror r22
    1e9e: 5a 95        dec r21
    1ea0: e1 f7        brne .-8      ; 0x1e9a <__vector_13+0x116>
    1ea2: 30 e0        ldi r19, 0x00 ; 0
    1ea4: 44 e0        ldi r20, 0x04 ; 4
    1ea6: 22 0f        add r18, r18
    1ea8: 33 1f        adc r19, r19
    1eaa: 4a 95        dec r20
    1eac: e1 f7        brne .-8      ; 0x1ea6 <__vector_13+0x122>
    1eae: 62 2b        or r22, r18
    1eb0: 73 2b        or r23, r19
    1eb2: 88 27        eor r24, r24
    1eb4: 77 fd        sbrc r23, 7
    1eb6: 80 95        com r24
    1eb8: 98 2f        mov r25, r24
    1eba: 2f 85        ldd r18, Y+15 ; 0x0f
    1ebc: 38 89        ldd r19, Y+16 ; 0x10
    1ebe: 40 e0        ldi r20, 0x00 ; 0
    1ec0: 50 e0        ldi r21, 0x00 ; 0
    1ec2: 0e 94 56 18 call 0x30ac ; 0x30ac <__mulsi3>
    1ec6: 9b 01        movw r18, r22
    1ec8: ac 01        movw r20, r24
    1eca: 9a e0        ldi r25, 0x0A ; 10
    1ecc: 56 95        lsr r21
    1ece: 47 95        ror r20
    1ed0: 37 95        ror r19
    1ed2: 27 95        ror r18
    1ed4: 9a 95        dec r25
    1ed6: d1 f7        brne .-12      ; 0x1ecc <__vector_13+0x148>


This is the pertinent section of code from the wave lib:

Code: [Select]

  uint8_t dh, dl;
  if (playing->BitsPerSample == 16) {
 
    // 16-bit is signed
    dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1.  So xoring 101 with 000 gives 101.  While xoring 101 with 100 gives 001.  I guess fipping this one bit is all that is needed to convert from signed to unsigned!
    dl = playpos[0];
    playpos += 2;
  }
  else {
 
    // 8-bit is unsigned
    dh = playpos[0];
    dl = 0;
    playpos++;
  }
 
#if DVOLUME

// dh, dl -> 16 bit tmp:
//uint16_t tmp = (dh << 8) | dl;
//tmp >>= playing->volume;

// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
      PORTB &= ~0b10000;

     // Send 16 bits to the DAC.
// First 4 bits are configuration bits. 0011 = (DAC A, unbuffered, 1x gain, not muted)
// Last 12 bits specify the output voltage. (MSB first) 

//SPI.transfer((tmp >> 8)|0b00110000);
//SPI.transfer(tmp);

#if DVOLUME

    SPDR = (tmp >> 8)|0b00110000; // Requires 12-bit tmp.  Does the C compiler optimize out the shift here?
    while (!(SPSR & _BV(SPIF)));

    SPDR = tmp;
    while (!(SPSR & _BV(SPIF)));

#else

    SPDR = 0b00110000 | (dh >> 4);
    while (!(SPSR & _BV(SPIF)));

    SPDR = (dh << 4) | (dl >> 4);
    while (!(SPSR & _BV(SPIF)));

#endif

     // Take the SS pin high to de-select the DAC:
      PORTB |= 0b10000;




Right now, I'm converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024, the max value of volume.
I believe there are some special commands for multiplying with fixed point numbers which might help, but I've also read that it's difficult to do stuff like multiplying in inline asm because of limitations on what registers you can use.

I'm also thinking maybe since dh is already in a byte, that rather than shift it right and or it with dl to make a 12 bit number, I could leave dh and dl alone, somehow multiply each by the volume adjustment carrying the excess from dl over to dh and then do a 14 bit shift to divide by the max volume and also reduce the 16 bit sample to the 12 bit one I need.

Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it.  That would result in too large of steps for the volume.  I want to read a pot and have the volume change smoothly. 


scswift

I found some code here, for an 8*16 unsigned multiply with a 24 bit result...  That would work for my needs, but from what some of the posters said after I'm not sure if it's optimal.  But it looks a lot more optimal than what the compiler generated with my code...

http://www.avrfreaks.net/index.php?name=PNphpBB2&file=printview&t=74626&start=0

Code: [Select]

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"mul %A1, %B2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)


I'm not really sure what any of that is doing, but the thing which makes me think it may not be optimal is it has a couple clr commands and some of the posters on avrfreaks said those were bad.  I guess CLR is "clear register" and takes 1 cycle.  And it seems like R16 which is cleared is just moved into a bunch of other registers, to zero them perhaps?  Or maybe something else is going on with it I can't see?

darryl

Quote
One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it.  That would result in too large of steps for the volume.  I want to read a pot and have the volume change smoothly.


but if your reading a pot, via a 10bit ADC, then to set the volume you will only ever have 1024 range ( 0->1023 ) for values to go from min to max, so why not use a shift ? if you need the value to a larger size ?
--
 Darryl

scswift


Quote
One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it.  That would result in too large of steps for the volume.  I want to read a pot and have the volume change smoothly.


but if your reading a pot, via a 10bit ADC, then to set the volume you will only ever have 1024 range ( 0->1023 ) for values to go from min to max, so why not use a shift ? if you need the value to a larger size ?


I'm not sure what you mean.

I have a 12 bit sample.  Let's say the value of it is 4095.
Now let's say I read my pot and the value of it is 512. 
512/1024 = 0.5
So to get the value of my sample after adjusting by my volume, I go 4095 * 0.5, and I get 2047.

But floats are expensive.  So how can I get rid of them?

Well, I can change the order of my operations.  I can instead multiply 4095 by my ADC reading... 512, an integer... and THEN divide by 1024, which I can do by shifting.

But, since my sample is 12 bits, multiplying by a 10 bit number may cause the result to exceed 16 bits.  So I have to multiply it into a 32 bit number.  At least if I do it in C.
OR, I can decide to use only 8 bits of my ADC, do it in assembly, start from a 16 bit sample, multply that into a 24 but number, and then just discard the lowest 8 bits.

scswift

So I was looking at that asm function to multiply 8*16... I've got it partially figured out:

Code: [Select]

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \        /// r16 = 0
"mul %A1, %A2 \n\t" \   //  Multiply a1 and a2 and put the result in r1:r0
"movw %A0, r0 \n\t" \   // Move word from r0 to A0?  Does it move only a byte from r0?  Or does it move r0 and r1?  I assume the latter.
"mov %C0, r16 \n\t" \   // Move r16 into C0.  Isn't r16 just 0?
"mov %D0, r16 \n\t" \   // Move R16 into D0.  Again, isn't r16 0?
"mul %A1, %B2 \n\t" \  // Multiply a1 and b2 and put the result in r1:r0
"add %B0, r0 \n\t" \   // B0 = B0 + r0
"adc %C0, r1 \n\t" \   // C0 = C0 + r1 + carry?  Carry from where?  From the previous add instruction?  I think ADD sets a carry flag if it overflows.
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   Isn't r16 still 0?
"clr r1 \n\t" \        // r1 = 0?  Why? 
: \
"=&r" (longRes) \  // I have no idea what the following does.  Or what %A0 %C0 % D0 are.
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)

Go Up