Another optimization question - can I speed up this 32 bit multiply?

Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

	// dh, dl -> 12 bit tmp:
 	uint32_t tmp = (dh << 4) | (dl >> 4);
	tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max. 
    1e96:	70 e0       	ldi	r23, 0x00	; 0
    1e98:	54 e0       	ldi	r21, 0x04	; 4
    1e9a:	75 95       	asr	r23
    1e9c:	67 95       	ror	r22
    1e9e:	5a 95       	dec	r21
    1ea0:	e1 f7       	brne	.-8      	; 0x1e9a <__vector_13+0x116>
    1ea2:	30 e0       	ldi	r19, 0x00	; 0
    1ea4:	44 e0       	ldi	r20, 0x04	; 4
    1ea6:	22 0f       	add	r18, r18
    1ea8:	33 1f       	adc	r19, r19
    1eaa:	4a 95       	dec	r20
    1eac:	e1 f7       	brne	.-8      	; 0x1ea6 <__vector_13+0x122>
    1eae:	62 2b       	or	r22, r18
    1eb0:	73 2b       	or	r23, r19
    1eb2:	88 27       	eor	r24, r24
    1eb4:	77 fd       	sbrc	r23, 7
    1eb6:	80 95       	com	r24
    1eb8:	98 2f       	mov	r25, r24
    1eba:	2f 85       	ldd	r18, Y+15	; 0x0f
    1ebc:	38 89       	ldd	r19, Y+16	; 0x10
    1ebe:	40 e0       	ldi	r20, 0x00	; 0
    1ec0:	50 e0       	ldi	r21, 0x00	; 0
    1ec2:	0e 94 56 18 	call	0x30ac	; 0x30ac <__mulsi3>
    1ec6:	9b 01       	movw	r18, r22
    1ec8:	ac 01       	movw	r20, r24
    1eca:	9a e0       	ldi	r25, 0x0A	; 10
    1ecc:	56 95       	lsr	r21
    1ece:	47 95       	ror	r20
    1ed0:	37 95       	ror	r19
    1ed2:	27 95       	ror	r18
    1ed4:	9a 95       	dec	r25
    1ed6:	d1 f7       	brne	.-12     	; 0x1ecc <__vector_13+0x148>

This is the pertinent section of code from the wave lib:

  uint8_t dh, dl;
  if (playing->BitsPerSample == 16) {
  
    // 16-bit is signed
    dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1.  So xoring 101 with 000 gives 101.  While xoring 101 with 100 gives 001.  I guess fipping this one bit is all that is needed to convert from signed to unsigned! 
    dl = playpos[0];
    playpos += 2;
  }
  else {
  
    // 8-bit is unsigned
    dh = playpos[0];
    dl = 0;
    playpos++;
  }
  
#if DVOLUME

	// dh, dl -> 16 bit tmp:
	//uint16_t tmp = (dh << 8) | dl;
	//tmp >>= playing->volume;

	// dh, dl -> 12 bit tmp:
 	uint32_t tmp = (dh << 4) | (dl >> 4);
	tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max. 

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
     		PORTB &= ~0b10000; 

     // Send 16 bits to the DAC.
	// First 4 bits are configuration bits. 0011 = (DAC A, unbuffered, 1x gain, not muted)
	// Last 12 bits specify the output voltage. (MSB first)  
	
		//SPI.transfer((tmp >> 8)|0b00110000);
		//SPI.transfer(tmp);

#if DVOLUME 
		
    		SPDR = (tmp >> 8)|0b00110000; // Requires 12-bit tmp.  Does the C compiler optimize out the shift here?
    		while (!(SPSR & _BV(SPIF)));
		
    		SPDR = tmp;
    		while (!(SPSR & _BV(SPIF)));

#else

    		SPDR = 0b00110000 | (dh >> 4);
    		while (!(SPSR & _BV(SPIF)));
		
    		SPDR = (dh << 4) | (dl >> 4);
    		while (!(SPSR & _BV(SPIF)));

#endif

     // Take the SS pin high to de-select the DAC:
      	PORTB |= 0b10000;

Right now, I'm converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024, the max value of volume.
I believe there are some special commands for multiplying with fixed point numbers which might help, but I've also read that it's difficult to do stuff like multiplying in inline asm because of limitations on what registers you can use.

I'm also thinking maybe since dh is already in a byte, that rather than shift it right and or it with dl to make a 12 bit number, I could leave dh and dl alone, somehow multiply each by the volume adjustment carrying the excess from dl over to dh and then do a 14 bit shift to divide by the max volume and also reduce the 16 bit sample to the 12 bit one I need.

Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it. That would result in too large of steps for the volume. I want to read a pot and have the volume change smoothly.

I found some code here, for an 8*16 unsigned multiply with a 24 bit result... That would work for my needs, but from what some of the posters said after I'm not sure if it's optimal. But it looks a lot more optimal than what the compiler generated with my code...

http://www.avrfreaks.net/index.php?name=PNphpBB2&file=printview&t=74626&start=0

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"mul %A1, %B2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)

I'm not really sure what any of that is doing, but the thing which makes me think it may not be optimal is it has a couple clr commands and some of the posters on avrfreaks said those were bad. I guess CLR is "clear register" and takes 1 cycle. And it seems like R16 which is cleared is just moved into a bunch of other registers, to zero them perhaps? Or maybe something else is going on with it I can't see?

One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it. That would result in too large of steps for the volume. I want to read a pot and have the volume change smoothly.

but if your reading a pot, via a 10bit ADC, then to set the volume you will only ever have 1024 range ( 0->1023 ) for values to go from min to max, so why not use a shift ? if you need the value to a larger size ?

darryl:

One thing I don't want to do is use shift for volume, or try to squeeze everything into 16 volume levels so I only need to use a 16 bit number to hold the sample when I multiply it. That would result in too large of steps for the volume. I want to read a pot and have the volume change smoothly.

but if your reading a pot, via a 10bit ADC, then to set the volume you will only ever have 1024 range ( 0->1023 ) for values to go from min to max, so why not use a shift ? if you need the value to a larger size ?

I'm not sure what you mean.

I have a 12 bit sample. Let's say the value of it is 4095.
Now let's say I read my pot and the value of it is 512.
512/1024 = 0.5
So to get the value of my sample after adjusting by my volume, I go 4095 * 0.5, and I get 2047.

But floats are expensive. So how can I get rid of them?

Well, I can change the order of my operations. I can instead multiply 4095 by my ADC reading... 512, an integer... and THEN divide by 1024, which I can do by shifting.

But, since my sample is 12 bits, multiplying by a 10 bit number may cause the result to exceed 16 bits. So I have to multiply it into a 32 bit number. At least if I do it in C.
OR, I can decide to use only 8 bits of my ADC, do it in assembly, start from a 16 bit sample, multply that into a 24 but number, and then just discard the lowest 8 bits.

So I was looking at that asm function to multiply 8*16... I've got it partially figured out:

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \        /// r16 = 0
"mul %A1, %A2 \n\t" \   //  Multiply a1 and a2 and put the result in r1:r0
"movw %A0, r0 \n\t" \   // Move word from r0 to A0?  Does it move only a byte from r0?  Or does it move r0 and r1?  I assume the latter.
"mov %C0, r16 \n\t" \   // Move r16 into C0.  Isn't r16 just 0?
"mov %D0, r16 \n\t" \   // Move R16 into D0.  Again, isn't r16 0?
"mul %A1, %B2 \n\t" \  // Multiply a1 and b2 and put the result in r1:r0
"add %B0, r0 \n\t" \   // B0 = B0 + r0
"adc %C0, r1 \n\t" \   // C0 = C0 + r1 + carry?  Carry from where?  From the previous add instruction?  I think ADD sets a carry flag if it overflows.
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   Isn't r16 still 0?
"clr r1 \n\t" \        // r1 = 0?  Why?  
: \
"=&r" (longRes) \  // I have no idea what the following does.  Or what %A0 %C0 % D0 are.
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)

I guess r1 is used as a "zero register" by GCC. I'm not sure why, I guess it's for speed, but that might explain why the code resets it at the end.

// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023. 4095*1023 / 1024 = 4091 max.

you could try to keep it 16 bit,

uint16_t tmp = dh >> 2; // ~~~ ((dh << 4) | (dl >> 4)) >> 10 do the shift 10 first
tmp = tmp * playing -> volume;

it may have less distinct values but maybe it does the work well enough?
it should definitely be faster :slight_smile:

robtillaart:

// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);
tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023. 4095*1023 / 1024 = 4091 max.

you could try to keep it 16 bit,

uint16_t tmp = dh >> 2; // ~~~ ((dh << 4) | (dl >> 4)) >> 10 do the shift 10 first
tmp = tmp * playing -> volume;

it may have less distinct values but maybe it does the work well enough?
it should definitely be faster :slight_smile:

I'm afraid I need a lot better than 6-bit audio quality. And I really don't want to go lower than 12-bit.

scswift:
Hey guys,

I want to add fine volume control to the WaveHC lib and I want to optimize it as much as possible.

This is what the compiler spat out when I did what I thought would become a few multiply and shift instructions:

	// dh, dl -> 12 bit tmp:
uint32_t tmp = (dh << 4) | (dl >> 4);

Those multiple-bit shifts are very bad for optimization. They get done in a loop, one bit at a time.

scswift:
Right now, I'm converting dh and dl to a 12 bit value since that is what the dac can handle, then I multiply that by volume, and divide by 1024

Maybe better to convert dh and dl to a 16 bit value, avoiding the shifts.

You're doing the right thing though - look at the disassembly...see what the compiler is up to.

Last time I optimized something like this I ended up creating a special struct:

union ByteInt16 {
  int val;
  struct {
    // Access to the bytes of 'val'
    byte lo, hi;
  } bytes;
  ByteInt16& operator=(int n) { val = n;  }
  operator int() const { return val;      }
};

That way I can address individual bytes of the integers directly and avoid shifting by 8 - the compiler can be really stupid sometimes. If you keep the useful parts of the numbers aligned on 8-bit boundaries you can make a massive difference to the code.

(I also made a similar struct for 32 bit values...)

scswift:
Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

Do you mean like this?

uint16_t vol = playing->volume >> 2;
uint16_t loVal = (uint16_t)dl * vol;
uint16_t hiVal = (uint16_t)dh * vol;
uint32_t val = ((uint32_t)hiVal << 8) | loVal;
uint16_t rslt = (uint16_t)(val >> 8);

You can obviously write this as a single expression if you want to, however the compiler should optimise all the temporary variables away anyway.

dc42:

scswift:
Or, I could forget about using all 10 bits of the ADC, use an 8 bit value for volume, and somehow multiply dl by that carrying the excess to dh, and if dh overflows carry that on into dhl, and then multiply dh by volume carrying into dhl, and once that's done, I can divide by 256 by simply dropping dl and using dh and dhl as my low and high bytes, perhaps shifting right by 4 to convert down to a 12 bit value.

Do you mean like this?

uint16_t vol = playing->volume >> 2;

uint16_t loVal = (uint16_t)dl * vol;
uint16_t hiVal = (uint16_t)dh * vol;
uint32_t val = ((uint32_t)hiVal << 8) | loVal;
uint16_t rslt = (uint16_t)(val >> 8);




You can obviously write this as a single expression if you want to, however the compiler should optimise all the temporary variables away anyway.

Something like that, but I don't think the compiler is going to produce very optimal code with that. It will probably convert numbers to 16 bit and 32 bit where it doesn't need to. (A problem I heard about in threads on avfreaks dealing with assembly multiplication routines.) Also I don't think "uint32_t val = ((uint32_t)hiVal << 8) | loVal;" is correct. If I understand what you're doing, you're trying to multiply dl by vol and put it in an integer... (which I believe will cause the compiler to first convert dl to an int) and the overflow is carried into the high byte for that, but I don't think ORing that high byte with dh's low byte will give the right result. I think you would have to add it.

Also, I would do the comversion of the volume to 8-bit outside the DAC interrupt since I won't be reading and updating that very often compared to how often I need to compute a new sample.

You're right, you would have to add the low and high parts (not 'or' them) because they overlap. I don't think this approach will generate such compact code as you could write in assembler, but it may beat the code generated by a uint32_t multiply.

I've figured out most of what the inline assembly for that multiply is doing, though I don't understand the math that's being done yet, or why r1 is reset to 0 at the end:

#define MultiU8X16to24(longRes, charIn1, intIn2) \  // I think longRes = "long result" which may mean the result is 32 bit.  charIn = 8 bit value input, intIn = 16 bit value input
asm volatile ( \
"clr r16 \n\t" \        /// r16 = 0
"mul %A1, %A2 \n\t" \   //  Multiply A1 and A2 (first byte of charIn and intIn) and put the result in r1:r0
"movw %A0, r0 \n\t" \   // Move word from r1:r0 (first byte of charIn and intIn) to B0:A0 (first two bytes of longRes)
"mov %C0, r16 \n\t" \   // Move r16 into C0.  C0 (longRes) = 0 
"mov %D0, r16 \n\t" \   // Move r16 into D0.  D0 (longRes) = 0
"mul %A1, %B2 \n\t" \  // Multiply A1 and B2 (first byte of charIn and second byte of intIn) and put the result in r1:r0
"add %B0, r0 \n\t" \   // B0 = B0 + r0
"adc %C0, r1 \n\t" \   // C0 = C0 + r1 + carry (carry is set if previous add overflows)
"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   r16 is still 0, so this is really D0 = D0 + carry
"clr r1 \n\t" \        // r1 = 0  ... Why reset r1 to 0 here?  
: \                  // Output operand list
"=&r" (longRes) \  // "=" = write only, "&" = output only,  "r" = any register
: \                // Input operand list
"a" (charIn1), \ // "a" = simple upper registers, r16 to r23
"a" (intIn2) \
: \                // Clobber list (may be omitted if nothing clobbered)
"r16" \            // R16 got clobbered.  But so did r0 and r1 and they're not here.  Odd.
)

%0 or %A0, %B0 %C0, %D0 refers to longRes
%1 or %A1 refers to charIn1
%2 or %A2, %B2 refers to intIn2



asm(code : output operand list : input operand list [: clobber list]);

If operands do not fit into a single register, the compiler will automatically assign enough registers to hold the entire operand. 

In the assembler code you use: 

%A0 to refer to the lowest byte of the first operand, 
%A1 to the lowest byte of the second operand and so on. 

The next byte of the first operand will be %B0, the next byte %C0 and so on.

Strange...

"adc %D0, r16 \n\t" \  // D0 = D0 + r16 + carry   r16 is still 0, so this is really D0 = D0 + carry

Since the result should be 24 bits, and D0 is set to 0 earlier in the code, it seems like this line shouldn't be here, and would do nothing. Carry should never be set, and D0 should never be anything but 0.

Here's a link describing a rule about r1, and why you keep finding it set to zero:

http://www.nongnu.org/avr-libc/user-manual/FAQ.html#faq_reg_usage

What the hell is going on here?

#define MultiU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"mul %B1, %A2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" \
) 



#define MultiU8X16to24(longRes, charIn1, intIn2) \ 
asm volatile ( \
"clr r16 \n\t" \       
"mul %A1, %A2 \n\t" \   
"movw %A0, r0 \n\t" \   
"mov %C0, r16 \n\t" \   
"mov %D0, r16 \n\t" \   
"mul %A1, %B2 \n\t" \  
"add %B0, r0 \n\t" \   
"adc %C0, r1 \n\t" \   
"adc %D0, r16 \n\t" \  
"clr r1 \n\t" \
: \                  
"=&r" (longRes) \  
: \                
"a" (charIn1), \ 
"a" (intIn2) \
: \                
"r16" \            
)

I was getting an error in the second define, "expected ) before : token", so I copied another example assembly function which is nearly identical, pasted that before the one that's broken, and I get no errors on that one, but of course the second one still errors.

Now I'm getting "expected unqualified ID before string constant" on:
"mul %A1, %A2 \n\t" \

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \      
"mul %A1, %A2 \n\t" \  
"movw %A0, r0 \n\t" \  
"mov %C0, r16 \n\t" \  
"mov %D0, r16 \n\t" \   
"mul %A1, %B2 \n\t" \ 
"add %B0, r0 \n\t" \   
"adc %C0, r1 \n\t" \   
"adc %D0, r16 \n\t" \ 
"clr r1 \n\t" \        
: \                  
"=&r" (longRes) \  
: \                
"a" (charIn1), \ 
"a" (intIn2) \
: \                
"r16" \            
)

I removed the first backslash which seems to have fixed that error, but I'm back to to getting an error about expecitng ) before : ...

"clr r1 \n\t" \        
: \

Right there, on the : \ line.

What the hell!

I'm still getting that stupid error no matter what I do!

I copied example code from here:
http://www.nongnu.org/avr-libc/user-manual/inline_asm.html

asm volatile("mov __tmp_reg__, %A0" "\n\t"
             "mov %A0, %B0"         "\n\t"
             "mov %B0, __tmp_reg__" "\n\t"
             : "=r" (value)
             : "0" (value)
            );

And I get THE SAME ERROR!

I've been copying examples from everywhere into new sketches and NONE of them are working. Is there some kind of bug in the IDE?

uint8_t n=10;
__asm(" inc #0\n"::"r");