Go Down

Topic: Another optimization question - can I speed up this 32 bit multiply? (Read 2 times) previous topic - next topic

scswift

I guess r1 is used as a "zero register" by GCC.  I'm not sure why, I guess it's for speed, but that might explain why the code resets it at the end. 

robtillaart

Quote
// dh, dl -> 12 bit tmp:
   uint32_t tmp = (dh << 4) | (dl >> 4);
   tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.


you could try to keep it 16 bit,

uint16_t tmp = dh >> 2;        // ~~~ ((dh << 4) | (dl >> 4)) >> 10  do the shift 10 first
tmp = tmp * playing -> volume;

it may have less distinct values but maybe it does the work well enough?
it should definitely be faster :)

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

scswift


Quote
// dh, dl -> 12 bit tmp:
   uint32_t tmp = (dh << 4) | (dl >> 4);
   tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.


you could try to keep it 16 bit,

uint16_t tmp = dh >> 2;        // ~~~ ((dh << 4) | (dl >> 4)) >> 10  do the shift 10 first
tmp = tmp * playing -> volume;

it may have less distinct values but maybe it does the work well enough?
it should definitely be faster :)


I'm afraid I need a lot better than 6-bit audio quality.  And I really don't want to go lower than 12-bit.

fungus


Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

Code: [Select]

// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);




Those multiple-bit shifts are very bad for optimization. They get done in a loop, one bit at a time.


Right now, I'm converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024


Maybe better to convert dh and dl to a 16 bit value, avoiding the shifts.

You're doing the right thing though - look at the disassembly...see what the compiler is up to.

Last time I optimized something like this I ended up creating a special struct:
Code: [Select]

union ByteInt16 {
 int val;
 struct {
   // Access to the bytes of 'val'
   byte lo, hi;
 } bytes;
 ByteInt16& operator=(int n) { val = n;  }
 operator int() const { return val;      }
};


That way I can address individual bytes of the integers directly and avoid shifting by 8 - the compiler can be really stupid sometimes. If you keep the useful parts of the numbers aligned on 8-bit boundaries you can make a massive difference to the code.

(I also made a similar struct for 32 bit values...)
No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

dc42


Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.


Do you mean like this?

Code: [Select]

uint16_t vol = playing->volume >> 2;
uint16_t loVal = (uint16_t)dl * vol;
uint16_t hiVal = (uint16_t)dh * vol;
uint32_t val = ((uint32_t)hiVal << 8) | loVal;
uint16_t rslt = (uint16_t)(val >> 8);


You can obviously write this as a single expression if you want to, however the compiler should optimise all the temporary variables away anyway.
Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

Go Up