Go Down

Topic: Another optimization question - can I speed up this 32 bit multiply? (Read 2 times) previous topic - next topic

scswift



Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.


Do you mean like this?

Code: [Select]

uint16_t vol = playing->volume >> 2;
uint16_t loVal = (uint16_t)dl * vol;
uint16_t hiVal = (uint16_t)dh * vol;
uint32_t val = ((uint32_t)hiVal << 8) | loVal;
uint16_t rslt = (uint16_t)(val >> 8);


You can obviously write this as a single expression if you want to, however the compiler should optimise all the temporary variables away anyway.


Something like that, but I don't think the compiler is going to produce very optimal code with that.  It will probably convert numbers to 16 bit and 32 bit where it doesn't need to.  (A problem I heard about in threads on avfreaks dealing with assembly multiplication routines.)  Also I don't think "uint32_t val = ((uint32_t)hiVal << 8) | loVal;" is correct.  If I understand what you're doing, you're trying to multiply dl by vol and put it in an integer... (which I believe will cause the compiler to first convert dl to an int) and the overflow is carried into the high byte for that, but I don't think ORing that high byte with dh's low byte will give the right result.  I think you would have to add it.

Also, I would do the comversion of the volume to 8-bit outside the DAC interrupt since I won't be reading and updating that very often compared to how often I need to compute a new sample.

dc42

You're right, you would have to add the low and high parts (not 'or' them) because they overlap. I don't think this approach will generate such compact code as you could write in assembler, but it may beat the code generated by a uint32_t multiply.
Formal verification of safety-critical software, software development, and electronic design and prototyping. See http://www.eschertech.com. Please do not ask for unpaid help via PM, use the forum.

scswift

I've figured out most of what the inline assembly for that multiply is doing, though I don't understand the math that's being done yet, or why r1 is reset to 0 at the end:

Code: [Select]

#define MultiU8X16to24(longRes, charIn1, intIn2) \  // I think longRes = "long result" which may mean the result is 32 bit.  charIn = 8 bit value input, intIn = 16 bit value input
asm volatile ( \
"clr r16 \n\t" \        /// r16 = 0
"mul %A1, %A2 \n\t" \   //  Multiply A1 and A2 (first byte of charIn and intIn) and put the result in r1:r0
"movw %A0, r0 \n\t" \   // Move word from r1:r0 (first byte of charIn and intIn) to B0:A0 (first two bytes of longRes)
"mov %C0, r16 \n\t" \   // Move r16 into C0.  C0 (longRes) = 0
"mov %D0, r16 \n\t" \   // Move r16 into D0.  D0 (longRes) = 0
"mul %A1, %B2 \n\t" \  // Multiply A1 and B2 (first byte of charIn and second byte of intIn) and put the result in r1:r0
"add %B0, r0 \n\t" \   // B0 = B0 + r0
"adc %C0, r1 \n\t" \   // C0 = C0 + r1 + carry (carry is set if previous add overflows)
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   r16 is still 0, so this is really D0 = D0 + carry
"clr r1 \n\t" \        // r1 = 0  ... Why reset r1 to 0 here? 
: \                  // Output operand list
"=&r" (longRes) \  // "=" = write only, "&" = output only,  "r" = any register
: \                // Input operand list
"a" (charIn1), \ // "a" = simple upper registers, r16 to r23
"a" (intIn2) \
: \                // Clobber list (may be omitted if nothing clobbered)
"r16" \            // R16 got clobbered.  But so did r0 and r1 and they're not here.  Odd.
)

%0 or %A0, %B0 %C0, %D0 refers to longRes
%1 or %A1 refers to charIn1
%2 or %A2, %B2 refers to intIn2



asm(code : output operand list : input operand list [: clobber list]);

If operands do not fit into a single register, the compiler will automatically assign enough registers to hold the entire operand.

In the assembler code you use:

%A0 to refer to the lowest byte of the first operand,
%A1 to the lowest byte of the second operand and so on.

The next byte of the first operand will be %B0, the next byte %C0 and so on.

scswift

Strange...

Code: [Select]
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   r16 is still 0, so this is really D0 = D0 + carry

Since the result should be 24 bits, and D0 is set to 0 earlier in the code, it seems like this line shouldn't be here, and would do nothing.  Carry should never be set, and D0 should never be anything but 0.


tmd3


Go Up