Pages: [1] 2 3   Go Down
Author Topic: Another optimization question - can I speed up this 32 bit multiply?  (Read 2346 times)
0 Members and 1 Guest are viewing this topic.
Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

Code:
// dh, dl -> 12 bit tmp:
  uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.
    1e96: 70 e0        ldi r23, 0x00 ; 0
    1e98: 54 e0        ldi r21, 0x04 ; 4
    1e9a: 75 95        asr r23
    1e9c: 67 95        ror r22
    1e9e: 5a 95        dec r21
    1ea0: e1 f7        brne .-8      ; 0x1e9a <__vector_13+0x116>
    1ea2: 30 e0        ldi r19, 0x00 ; 0
    1ea4: 44 e0        ldi r20, 0x04 ; 4
    1ea6: 22 0f        add r18, r18
    1ea8: 33 1f        adc r19, r19
    1eaa: 4a 95        dec r20
    1eac: e1 f7        brne .-8      ; 0x1ea6 <__vector_13+0x122>
    1eae: 62 2b        or r22, r18
    1eb0: 73 2b        or r23, r19
    1eb2: 88 27        eor r24, r24
    1eb4: 77 fd        sbrc r23, 7
    1eb6: 80 95        com r24
    1eb8: 98 2f        mov r25, r24
    1eba: 2f 85        ldd r18, Y+15 ; 0x0f
    1ebc: 38 89        ldd r19, Y+16 ; 0x10
    1ebe: 40 e0        ldi r20, 0x00 ; 0
    1ec0: 50 e0        ldi r21, 0x00 ; 0
    1ec2: 0e 94 56 18 call 0x30ac ; 0x30ac <__mulsi3>
    1ec6: 9b 01        movw r18, r22
    1ec8: ac 01        movw r20, r24
    1eca: 9a e0        ldi r25, 0x0A ; 10
    1ecc: 56 95        lsr r21
    1ece: 47 95        ror r20
    1ed0: 37 95        ror r19
    1ed2: 27 95        ror r18
    1ed4: 9a 95        dec r25
    1ed6: d1 f7        brne .-12      ; 0x1ecc <__vector_13+0x148>

This is the pertinent section of code from the wave lib:

Code:
  uint8_t dh, dl;
  if (playing->BitsPerSample == 16) {
 
    // 16-bit is signed
    dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1.  So xoring 101 with 000 gives 101.  While xoring 101 with 100 gives 001.  I guess fipping this one bit is all that is needed to convert from signed to unsigned!
    dl = playpos[0];
    playpos += 2;
  }
  else {
 
    // 8-bit is unsigned
    dh = playpos[0];
    dl = 0;
    playpos++;
  }
 
#if DVOLUME

// dh, dl -> 16 bit tmp:
//uint16_t tmp = (dh << 8) | dl;
//tmp >>= playing->volume;

// dh, dl -> 12 bit tmp:
  uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
      PORTB &= ~0b10000;

     // Send 16 bits to the DAC.
// First 4 bits are configuration bits. 0011 = (DAC A, unbuffered, 1x gain, not muted)
// Last 12 bits specify the output voltage. (MSB first) 

//SPI.transfer((tmp >> 8)|0b00110000);
//SPI.transfer(tmp);

#if DVOLUME

    SPDR = (tmp >> 8)|0b00110000; // Requires 12-bit tmp.  Does the C compiler optimize out the shift here?
    while (!(SPSR & _BV(SPIF)));

    SPDR = tmp;
    while (!(SPSR & _BV(SPIF)));

#else

    SPDR = 0b00110000 | (dh >> 4);
    while (!(SPSR & _BV(SPIF)));

    SPDR = (dh << 4) | (dl >> 4);
    while (!(SPSR & _BV(SPIF)));

#endif

     // Take the SS pin high to de-select the DAC:
      PORTB |= 0b10000;



Right now, I'm converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024, the max value of volume.
I believe there are some special commands for multiplying with fixed point numbers which might help, but I've also read that it's difficult to do stuff like multiplying in inline asm because of limitations on what registers you can use.

I'm also thinking maybe since dh is already in a byte, that rather than shift it right and or it with dl to make a 12 bit number, I could leave dh and dl alone, somehow multiply each by the volume adjustment carrying the excess from dl over to dh and then do a 14 bit shift to divide by the max volume and also reduce the 16 bit sample to the 12 bit one I need.

Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it.  That would result in too large of steps for the volume.  I want to read a pot and have the volume change smoothly. 

Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I found some code here, for an 8*16 unsigned multiply with a 24 bit result...  That would work for my needs, but from what some of the posters said after I'm not sure if it's optimal.  But it looks a lot more optimal than what the compiler generated with my code...

http://www.avrfreaks.net/index.php?name=PNphpBB2&file=printview&t=74626&start=0

Code:
#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"mul %A1, %B2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)

I'm not really sure what any of that is doing, but the thing which makes me think it may not be optimal is it has a couple clr commands and some of the posters on avrfreaks said those were bad.  I guess CLR is "clear register" and takes 1 cycle.  And it seems like R16 which is cleared is just moved into a bunch of other registers, to zero them perhaps?  Or maybe something else is going on with it I can't see?
Logged

U.K
Offline Offline
Jr. Member
**
Karma: 1
Posts: 72
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it.  That would result in too large of steps for the volume.  I want to read a pot and have the volume change smoothly.

but if your reading a pot, via a 10bit ADC, then to set the volume you will only ever have 1024 range ( 0->1023 ) for values to go from min to max, so why not use a shift ? if you need the value to a larger size ?
Logged

--
 Darryl

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it.  That would result in too large of steps for the volume.  I want to read a pot and have the volume change smoothly.

but if your reading a pot, via a 10bit ADC, then to set the volume you will only ever have 1024 range ( 0->1023 ) for values to go from min to max, so why not use a shift ? if you need the value to a larger size ?

I'm not sure what you mean.

I have a 12 bit sample.  Let's say the value of it is 4095.
Now let's say I read my pot and the value of it is 512. 
512/1024 = 0.5
So to get the value of my sample after adjusting by my volume, I go 4095 * 0.5, and I get 2047.

But floats are expensive.  So how can I get rid of them?

Well, I can change the order of my operations.  I can instead multiply 4095 by my ADC reading... 512, an integer... and THEN divide by 1024, which I can do by shifting.

But, since my sample is 12 bits, multiplying by a 10 bit number may cause the result to exceed 16 bits.  So I have to multiply it into a 32 bit number.  At least if I do it in C.
OR, I can decide to use only 8 bits of my ADC, do it in assembly, start from a 16 bit sample, multply that into a 24 but number, and then just discard the lowest 8 bits.
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

So I was looking at that asm function to multiply 8*16... I've got it partially figured out:

Code:
#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \        /// r16 = 0
"mul %A1, %A2 \n\t" \   //  Multiply a1 and a2 and put the result in r1:r0
"movw %A0, r0 \n\t" \   // Move word from r0 to A0?  Does it move only a byte from r0?  Or does it move r0 and r1?  I assume the latter.
"mov %C0, r16 \n\t" \   // Move r16 into C0.  Isn't r16 just 0?
"mov %D0, r16 \n\t" \   // Move R16 into D0.  Again, isn't r16 0?
"mul %A1, %B2 \n\t" \  // Multiply a1 and b2 and put the result in r1:r0
"add %B0, r0 \n\t" \   // B0 = B0 + r0
"adc %C0, r1 \n\t" \   // C0 = C0 + r1 + carry?  Carry from where?  From the previous add instruction?  I think ADD sets a carry flag if it overflows.
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   Isn't r16 still 0?
"clr r1 \n\t" \        // r1 = 0?  Why? 
: \
"=&r" (longRes) \  // I have no idea what the following does.  Or what %A0 %C0 % D0 are.
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I guess r1 is used as a "zero register" by GCC.  I'm not sure why, I guess it's for speed, but that might explain why the code resets it at the end. 
« Last Edit: November 05, 2012, 05:40:20 am by scswift » Logged

Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 216
Posts: 13664
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
// dh, dl -> 12 bit tmp:
    uint32_t tmp = (dh << 4) | (dl >> 4);
   tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.

you could try to keep it 16 bit,

uint16_t tmp = dh >> 2;        // ~~~ ((dh << 4) | (dl >> 4)) >> 10  do the shift 10 first
tmp = tmp * playing -> volume;

it may have less distinct values but maybe it does the work well enough?
it should definitely be faster smiley
 
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
// dh, dl -> 12 bit tmp:
    uint32_t tmp = (dh << 4) | (dl >> 4);
   tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.

you could try to keep it 16 bit,

uint16_t tmp = dh >> 2;        // ~~~ ((dh << 4) | (dl >> 4)) >> 10  do the shift 10 first
tmp = tmp * playing -> volume;

it may have less distinct values but maybe it does the work well enough?
it should definitely be faster smiley

I'm afraid I need a lot better than 6-bit audio quality.  And I really don't want to go lower than 12-bit.
Logged

Valencia, Spain
Offline Offline
Faraday Member
**
Karma: 145
Posts: 5460
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

Code:
// dh, dl -> 12 bit tmp:
  uint32_t tmp = (dh << 4) | (dl >> 4);


Those multiple-bit shifts are very bad for optimization. They get done in a loop, one bit at a time.

Right now, I'm converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024

Maybe better to convert dh and dl to a 16 bit value, avoiding the shifts.

You're doing the right thing though - look at the disassembly...see what the compiler is up to.

Last time I optimized something like this I ended up creating a special struct:
Code:
union ByteInt16 {
  int val;
  struct {
    // Access to the bytes of 'val'
    byte lo, hi;
  } bytes;
  ByteInt16& operator=(int n) { val = n;  }
  operator int() const { return val;      }
};

That way I can address individual bytes of the integers directly and avoid shifting by 8 - the compiler can be really stupid sometimes. If you keep the useful parts of the numbers aligned on 8-bit boundaries you can make a massive difference to the code.

(I also made a similar struct for 32 bit values...)
« Last Edit: November 05, 2012, 03:46:10 pm by fungus » Logged

No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

United Kingdom
Offline Offline
Tesla Member
***
Karma: 224
Posts: 6613
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

Do you mean like this?

Code:
uint16_t vol = playing->volume >> 2;
uint16_t loVal = (uint16_t)dl * vol;
uint16_t hiVal = (uint16_t)dh * vol;
uint32_t val = ((uint32_t)hiVal << 8) | loVal;
uint16_t rslt = (uint16_t)(val >> 8);

You can obviously write this as a single expression if you want to, however the compiler should optimise all the temporary variables away anyway.
Logged

Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

Do you mean like this?

Code:
uint16_t vol = playing->volume >> 2;
uint16_t loVal = (uint16_t)dl * vol;
uint16_t hiVal = (uint16_t)dh * vol;
uint32_t val = ((uint32_t)hiVal << 8) | loVal;
uint16_t rslt = (uint16_t)(val >> 8);

You can obviously write this as a single expression if you want to, however the compiler should optimise all the temporary variables away anyway.

Something like that, but I don't think the compiler is going to produce very optimal code with that.  It will probably convert numbers to 16 bit and 32 bit where it doesn't need to.  (A problem I heard about in threads on avfreaks dealing with assembly multiplication routines.)  Also I don't think "uint32_t val = ((uint32_t)hiVal << smiley-cool | loVal;" is correct.  If I understand what you're doing, you're trying to multiply dl by vol and put it in an integer... (which I believe will cause the compiler to first convert dl to an int) and the overflow is carried into the high byte for that, but I don't think ORing that high byte with dh's low byte will give the right result.  I think you would have to add it.

Also, I would do the comversion of the volume to 8-bit outside the DAC interrupt since I won't be reading and updating that very often compared to how often I need to compute a new sample.
Logged

United Kingdom
Offline Offline
Tesla Member
***
Karma: 224
Posts: 6613
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

You're right, you would have to add the low and high parts (not 'or' them) because they overlap. I don't think this approach will generate such compact code as you could write in assembler, but it may beat the code generated by a uint32_t multiply.
Logged

Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I've figured out most of what the inline assembly for that multiply is doing, though I don't understand the math that's being done yet, or why r1 is reset to 0 at the end:

Code:
#define MultiU8X16to24(longRes, charIn1, intIn2) \  // I think longRes = "long result" which may mean the result is 32 bit.  charIn = 8 bit value input, intIn = 16 bit value input
asm volatile ( \
"clr r16 \n\t" \        /// r16 = 0
"mul %A1, %A2 \n\t" \   //  Multiply A1 and A2 (first byte of charIn and intIn) and put the result in r1:r0
"movw %A0, r0 \n\t" \   // Move word from r1:r0 (first byte of charIn and intIn) to B0:A0 (first two bytes of longRes)
"mov %C0, r16 \n\t" \   // Move r16 into C0.  C0 (longRes) = 0
"mov %D0, r16 \n\t" \   // Move r16 into D0.  D0 (longRes) = 0
"mul %A1, %B2 \n\t" \  // Multiply A1 and B2 (first byte of charIn and second byte of intIn) and put the result in r1:r0
"add %B0, r0 \n\t" \   // B0 = B0 + r0
"adc %C0, r1 \n\t" \   // C0 = C0 + r1 + carry (carry is set if previous add overflows)
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   r16 is still 0, so this is really D0 = D0 + carry
"clr r1 \n\t" \        // r1 = 0  ... Why reset r1 to 0 here? 
: \                  // Output operand list
"=&r" (longRes) \  // "=" = write only, "&" = output only,  "r" = any register
: \                // Input operand list
"a" (charIn1), \ // "a" = simple upper registers, r16 to r23
"a" (intIn2) \
: \                // Clobber list (may be omitted if nothing clobbered)
"r16" \            // R16 got clobbered.  But so did r0 and r1 and they're not here.  Odd.
)

%0 or %A0, %B0 %C0, %D0 refers to longRes
%1 or %A1 refers to charIn1
%2 or %A2, %B2 refers to intIn2



asm(code : output operand list : input operand list [: clobber list]);

If operands do not fit into a single register, the compiler will automatically assign enough registers to hold the entire operand.

In the assembler code you use:

%A0 to refer to the lowest byte of the first operand,
%A1 to the lowest byte of the second operand and so on.

The next byte of the first operand will be %B0, the next byte %C0 and so on.
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1363
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Strange...

Code:
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   r16 is still 0, so this is really D0 = D0 + carry

Since the result should be 24 bits, and D0 is set to 0 earlier in the code, it seems like this line shouldn't be here, and would do nothing.  Carry should never be set, and D0 should never be anything but 0.

Logged

USA
Offline Offline
Sr. Member
****
Karma: 13
Posts: 365
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Here's a link describing a rule about r1, and why you keep finding it set to zero:

http://www.nongnu.org/avr-libc/user-manual/FAQ.html#faq_reg_usage
Logged

Pages: [1] 2 3   Go Up
Jump to: