Another optimization question - can I speed up this 32 bit multiply?

THIS one works. Why does THIS one work?

#define MultiU16X16toH16(intRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"mov r27, r1 \n\t" \
"mul %B1, %B2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"mul %B1, %A2 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (intRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" , "r27" \
)

You have GOT to be kidding me.

The problem was a warning I saw a few times and ignored: "blackslash and newline seperatedby space". I didn't know what that meant and it looked like the newlines in my assembly code were all fine, and it was just a warning so I looked elsewhere for the problem.

But apparently that was the problem. This damn thing cares if there are INVISIBLE SPACES after that frigging backslash.

So apparently all the code examples I pasted that weren't working had some of these invisible spaces in there, and even though what I could SEE looked virtually identical, it wouldn't compiled because of those damn spaces.

A backslash at the end of a line has a special meaning (line folding). But a backslash elsewhere has another meaning, eg.

Serial.print ("foo\n");

In the context of line-oriented text, especially source code for some programming languages, it is often used at the end of a line to indicate that the trailing newline character should be ignored, so that the following line is treated as if it were part of the current line. In this context it may be called a "continuation". The GNU make manual says, "We split each long line into two lines using backslash-newline; this is like using one long line, but is easier to read."

So backslash+newline has a special meaning. Not backslash+space+newline. Sorry.

If it helps, a lot of us have fallen for that. Plus it's another reason not to use defines, especially ones that go over multiple lines. Those trailing spaces can be hard to see.

I understand, but I don't think there could be ANY good reason the compiler does not ignore spaces after that trailing backslash.

I mean "blackslash followed by spces and then a linefeed" doesn't seem like a more complicated rule than just looking for an immediate linefeed, nor can I think of any good examples where such a rule would trip things up.

Plus it's another reason not to use defines

What do you mean by that anyway? I shouldn't be using defines? How else can I use inline assembly?

What do you mean? Without using defines? Defines are just a text pre-preprocessing thing. An example from optiboot.c:

  __asm__ __volatile__ (
    "   com %[ch]\n" // ones complement, carry set
    "   sec\n"
    "1: brcc 2f\n"
    "   cbi %[uartPort],%[uartBit]\n"
    "   rjmp 3f\n"
    "2: sbi %[uartPort],%[uartBit]\n"
    "   nop\n"
    "3: rcall uartDelay\n"
    "   rcall uartDelay\n"
    "   lsr %[ch]\n"
    "   dec %[bitcnt]\n"
    "   brne 1b\n"
    :
    :
      [bitcnt] "d" (10),
      [ch] "r" (ch),
      [uartPort] "I" (_SFR_IO_ADDR(UART_PORT)),
      [uartBit] "I" (UART_TX_BIT)
    :
      "r25"
  );

No define there. No backslashes (apart from the \n inside the assembler string).

Note that there is no \t in optiboot.c.

scswift:
I shouldn't be using defines? How else can I use inline assembly?

Inline functions are a good choice...
http://code.google.com/p/arduino-tiny/source/browse/trunk/hardware/tiny/cores/tiny/TinyDebugSerial.h#54

Does that mean the \t isn't needed? Some seem to use it some don't. The code I'm using had them already.

As for defines, I know what they do, but what I'm saying is without them I'd have to drop that assembly in everywhere I want to use it. I can't use it like a function, but without the overhead of a function call. Unless I can use an inline function and the compiler knows not to do all kinds of crazy stuff with the function parameters.

scswift:
Does that mean the \t isn't needed?

Being after the newline, effectively it adds whitespace at the start of the next line. The optiboot example had spaces before the opcodes. I presume that if you have something before a space it is a label. Personally I like to see my lines in one place. In other words, if you want leading spaces you have them on that line, not as a trailing space on the previous line.

Makes sense to me.

Just some disassemblies I did while trying to optimize the C code:

Here was the original code, where I did the obvious thing and shifted out the unnecessary bits and moved the high byte into place for my 12-bit sample:

// Calculate 12-bit sample:
	sample = ((0X80 ^ playpos[1])<<4) | playpos[0]>>4;
    1e64:	81 81       	ldd	r24, Z+1	; 0x01
    1e66:	80 58       	subi	r24, 0x80	; 128
    1e68:	48 2f       	mov	r20, r24
    1e6a:	50 e0       	ldi	r21, 0x00	; 0
    1e6c:	b4 e0       	ldi	r27, 0x04	; 4
    1e6e:	44 0f       	add	r20, r20
    1e70:	55 1f       	adc	r21, r21
    1e72:	ba 95       	dec	r27
    1e74:	e1 f7       	brne	.-8      	; 0x1e6e <__vector_13+0xfc>
    1e76:	80 81       	ld	r24, Z
    1e78:	90 e0       	ldi	r25, 0x00	; 0
    1e7a:	a4 e0       	ldi	r26, 0x04	; 4
    1e7c:	95 95       	asr	r25
    1e7e:	87 95       	ror	r24
    1e80:	aa 95       	dec	r26
    1e82:	e1 f7       	brne	.-8      	; 0x1e7c <__vector_13+0x10a>
    1e84:	48 2b       	or	r20, r24
    1e86:	59 2b       	or	r21, r25

Then I figured that if I were optimizing it by hand, surely it would be easier to optimize out a shift by 8 instead, so I tried that:

	sample = 0X80 ^ playpos[1];
    1e64:	81 81       	ldd	r24, Z+1	; 0x01
	sample = sample << 8;
    1e66:	38 2f       	mov	r19, r24
    1e68:	30 58       	subi	r19, 0x80	; 128
    1e6a:	20 e0       	ldi	r18, 0x00	; 0
	sample = sample | playpos[0];
    1e6c:	80 81       	ld	r24, Z
    1e6e:	48 2f       	mov	r20, r24
    1e70:	50 e0       	ldi	r21, 0x00	; 0
    1e72:	42 2b       	or	r20, r18
    1e74:	53 2b       	or	r21, r19
	sample = sample >> 4;
    1e76:	a4 e0       	ldi	r26, 0x04	; 4
    1e78:	56 95       	lsr	r21
    1e7a:	47 95       	ror	r20
    1e7c:	aa 95       	dec	r26
    1e7e:	e1 f7       	brne	.-8      	; 0x1e78 <__vector_13+0x106>

I haven't actually counted the cycles, but I see one less jump in there and I think four fewer loops as a result. And it looks simpler.

And here's the code if I generate a 16 bit sample instead:

	sample = 0X80 ^ playpos[1];
    1e64:	81 81       	ldd	r24, Z+1	; 0x01
	sample = sample << 8;
    1e66:	38 2f       	mov	r19, r24
    1e68:	30 58       	subi	r19, 0x80	; 128
    1e6a:	20 e0       	ldi	r18, 0x00	; 0
	sample = sample | playpos[0];
    1e6c:	80 81       	ld	r24, Z
    1e6e:	48 2f       	mov	r20, r24
    1e70:	50 e0       	ldi	r21, 0x00	; 0
    1e72:	42 2b       	or	r20, r18
    1e74:	53 2b       	or	r21, r19

Now why generate a 16 bit sample if I need a 12 bit sample? Well, I need to do a shift later for the volume anyway, and the 8x16->24 multiply routine can handle a 16 bit sample without overflowing... So maybe if I shift by 12...

	uint32_t tmp;
	MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
    1e86:	2f 85       	ldd	r18, Y+15	; 0x0f
    1e88:	00 27       	eor	r16, r16
    1e8a:	24 9f       	mul	r18, r20
    1e8c:	c0 01       	movw	r24, r0
    1e8e:	a0 2f       	mov	r26, r16
    1e90:	b0 2f       	mov	r27, r16
    1e92:	25 9f       	mul	r18, r21
    1e94:	90 0d       	add	r25, r0
    1e96:	a1 1d       	adc	r26, r1
    1e98:	b0 1f       	adc	r27, r16
    1e9a:	11 24       	eor	r1, r1
	
	//sample = tmp >> 8; // For 12-bit sample input.  This should be optimized to just copying some bytes.		
	sample = tmp >> 12; // For 16-bit sample input.  (output 12 bit sample)
    1e9c:	2c e0       	ldi	r18, 0x0C	; 12
    1e9e:	b6 95       	lsr	r27
    1ea0:	a7 95       	ror	r26
    1ea2:	97 95       	ror	r25
    1ea4:	87 95       	ror	r24
    1ea6:	2a 95       	dec	r18
    1ea8:	d1 f7       	brne	.-12     	; 0x1e9e <__vector_13+0x12c>
    1eaa:	28 2f       	mov	r18, r24

Hm, well how does that compare to when I shifted before I did the volume adjustment?

	// Adjust 12-bit sample by volume:
	uint32_t tmp;
	MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
    1e9c:	2f 85       	ldd	r18, Y+15	; 0x0f
    1e9e:	00 27       	eor	r16, r16
    1ea0:	24 9f       	mul	r18, r20
    1ea2:	c0 01       	movw	r24, r0
    1ea4:	a0 2f       	mov	r26, r16
    1ea6:	b0 2f       	mov	r27, r16
    1ea8:	25 9f       	mul	r18, r21
    1eaa:	90 0d       	add	r25, r0
    1eac:	a1 1d       	adc	r26, r1
    1eae:	b0 1f       	adc	r27, r16
    1eb0:	11 24       	eor	r1, r1
	sample = tmp >> 8; // This should be optimized to just copying some bytes.
    1eb2:	89 2f       	mov	r24, r25
    1eb4:	9a 2f       	mov	r25, r26
    1eb6:	ab 2f       	mov	r26, r27
    1eb8:	bb 27       	eor	r27, r27
    1eba:	28 2f       	mov	r18, r24

Hm...

I wonder how the shift by 12 compares to doing a shift by 8 and then a shift by 4?

	// Adjust sample by volume:
	uint32_t tmp;
	MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
    1e86:	2f 85       	ldd	r18, Y+15	; 0x0f
    1e88:	00 27       	eor	r16, r16
    1e8a:	24 9f       	mul	r18, r20
    1e8c:	c0 01       	movw	r24, r0
    1e8e:	a0 2f       	mov	r26, r16
    1e90:	b0 2f       	mov	r27, r16
    1e92:	25 9f       	mul	r18, r21
    1e94:	90 0d       	add	r25, r0
    1e96:	a1 1d       	adc	r26, r1
    1e98:	b0 1f       	adc	r27, r16
    1e9a:	11 24       	eor	r1, r1
	
	sample = tmp >> 8; // For 12-bit sample input.  This should be optimized to just copying some bytes.		
    1e9c:	89 2f       	mov	r24, r25
    1e9e:	9a 2f       	mov	r25, r26
    1ea0:	ab 2f       	mov	r26, r27
    1ea2:	bb 27       	eor	r27, r27
    1ea4:	9c 01       	movw	r18, r24
	//sample = tmp >> 12; // For 16-bit sample input.  (output 12 bit sample)
	sample = sample >> 4;	// For 16-bit sample input.  (output 12 bit sample)
    1ea6:	84 e0       	ldi	r24, 0x04	; 4
    1ea8:	36 95       	lsr	r19
    1eaa:	27 95       	ror	r18
    1eac:	8a 95       	dec	r24
    1eae:	e1 f7       	brne	.-8      	; 0x1ea8 <__vector_13+0x136>

Shift by 12:

(1)    1e9c:	2c e0       	ldi	r18, 0x0C	; 12
(1)    1e9e:	b6 95       	lsr	r27
(1)    1ea0:	a7 95       	ror	r26
(1)    1ea2:	97 95       	ror	r25
(1)    1ea4:	87 95       	ror	r24
(1)    1ea6:	2a 95       	dec	r18
(1/2) 1ea8:	d1 f7       	brne	.-12     	; 0x1e9e <__vector_13+0x12c>
(1)    1eaa:	28 2f       	mov	r18, r24

I count 8 cycles in that loop, x12, -1 so 95 cycles +1 for the mov, so 96.

And now the shift by 8 then a shift by 4:

(1)    1e9c:	89 2f       	mov	r24, r25
(1)    1e9e:	9a 2f       	mov	r25, r26
(1)    1ea0:	ab 2f       	mov	r26, r27
(1)    1ea2:	bb 27       	eor	r27, r27
(1)    1ea4:	9c 01       	movw	r18, r24
	sample = sample >> 4;	// For 16-bit sample input.  (output 12 bit sample)
(1)    1ea6:	84 e0       	ldi	r24, 0x04	; 4
(1)    1ea8:	36 95       	lsr	r19
(1)    1eaa:	27 95       	ror	r18
(1)    1eac:	8a 95       	dec	r24
(1/2) 1eae:	e1 f7       	brne	.-8      	; 0x1ea8 <__vector_13+0x136>

Jeez, I can already tell before I even add it up that that is going to be way less.

Let's see, 5 cycles for the shift by 8, and 6 cycles in that loop x4, -1 (unless I shouldn't count the ldi instruction? I can't tell where the jump is jumping back to but I think maybe it's to lsr)
That gives me a total of.... 28 cycles! Holy crap that is 3x as fast.

I guess the lesson here is if there's a shift by 8 that can be done, do it explicitly, because the compiler ain't gonna do it for ya. And don't shift by 4 twice if you can do a shift by 8 instead.

New and improved hardware SPI DAC update code for WaveHC, with fine volume control:

(volume needs to be changed to a uint8_t in the wave class if it isn't, I forget what it defaulted to)

// Beware hidden spaces after \'s in defines!
// I guess no ; is needed after the asm because when you call MultiU8X16to24() you will put one after it, and that will just be replaced with the code below.

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"mul %A1, %B2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)


//------------------------------------------------------------------------------
// timer interrupt for DAC
ISR(TIMER1_COMPA_vect) {
  if (!playing) return;

  if (playpos >= playend) { // If we've reached the end of the buffer...
    if (sdstatus == SD_READY) {
    
      // swap double buffers
      playpos = sdbuff;
      playend = sdend;
      sdbuff = sdbuff != buffer1 ? buffer1 : buffer2;
      
      sdstatus = SD_FILLING;
      // interrupt to call SD reader
	    TIMSK1 |= _BV(OCIE1B);
    }
    else if (sdstatus == SD_END_FILE) {
      playing->stop();
      return;
    }
    else {
      // count overrun error if not at end of file
      if (playing->remainingBytesInChunk) {
        playing->errors++;
      }
      return;
    }
  }

  //uint8_t dh, dl;

  uint16_t sample;

  if (playing->BitsPerSample == 16) {
  
    // 16-bit is signed
    //dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1.  So xoring 101 with 000 gives 101.  While xoring 101 with 100 gives 001.  I guess fipping this one bit is all that is needed to convert from signed to unsigned! 
    //dl = playpos[0];
    
	// Calculate 16-bit sample:

		sample = 0X80 ^ playpos[1];
		sample = sample << 8;
		sample = sample | playpos[0];

	playpos += 2;

  }
  else {
  
    // 8-bit is unsigned
    //dh = playpos[0];
    //dl = 0;

	// Calculate 16-bit sample:
		sample = playpos[0]<<8;   
	
    playpos++;

  }
  
#if DVOLUME

	// dh, dl -> 16 bit tmp:
	//uint16_t tmp = (dh << 8) | dl;
	//tmp >>= playing->volume;
	
	// dh, dl -> 12 bit tmp:
 	//uint32_t tmp = (dh << 4) | (dl >> 4);
	//tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max. 

	// Adjust sample by volume:
		uint32_t tmp;
		MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
	
		sample = tmp >> 8; // Divide by 256 to finish adusting sample by volume.   	
	sample = sample >> 4;	// Convert 16 bit sample to 12 bit.

		// It is 3x faster to shift by 8 and then a shift by 4 than to let the compiler figure out it can do the same with a shift by 12!

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
     		PORTB &= ~0b10000; 

     // Send 16 bits to the DAC.
	// First 4 bits are configuration bits. 0011 = (DAC A, unbuffered, 1x gain, not muted)
	// Last 12 bits specify the output voltage. (MSB first)  
	
		//SPI.transfer((tmp >> 8)|0b00110000);
		//SPI.transfer(tmp);

#if DVOLUME 

    		SPDR = (sample >> 8)|0b00110000; // sample is 12-bit
    		while (!(SPSR & _BV(SPIF)));
		
    		SPDR = sample; //SPDR = tmp;
    		while (!(SPSR & _BV(SPIF)));

#else

		// This code is broken because I no longer store dh and dl seperately.

    		SPDR = 0b00110000 | (dh >> 4);
    		while (!(SPSR & _BV(SPIF)));
		
    		SPDR = (dh << 4) | (dl >> 4);
    		while (!(SPSR & _BV(SPIF)));

#endif

     // Take the SS pin high to de-select the DAC:
      	PORTB |= 0b10000;

}

This could probably be optimized quite a bit more if pure assembler was used stating from the point where the sample data was loaded into the sample variable. If some of those shifts could be removed that would save a lot of cycles.

This thread leads me to another question:

If the chip has a hardware multiplier which takes two clock cycles...why doesn't the compiler use it to do shift operations instead of creating a loop of single-bit shifts?

Hm... tried to get rid of one assembly instruction and have the OR operation done to r19 instead of copying it to another register and doing it on that, but the compiler instead added a movw command earlier in the code to move the data into new registers anyway:

	sample = tmp >> 8; // For 12-bit sample input.  This should be optimized to just copying some bytes.		
    1e9c:	89 2f       	mov	r24, r25
    1e9e:	9a 2f       	mov	r25, r26
    1ea0:	ab 2f       	mov	r26, r27
    1ea2:	bb 27       	eor	r27, r27
    1ea4:	9c 01       	movw	r18, r24
	//sample = tmp >> 12; // For 16-bit sample input.  (output 12 bit sample)
	sample = sample >> 4;	// For 16-bit sample input.  (output 12 bit sample)
    1ea6:	84 e0       	ldi	r24, 0x04	; 4
    1ea8:	36 95       	lsr	r19
    1eaa:	27 95       	ror	r18
    1eac:	8a 95       	dec	r24
    1eae:	e1 f7       	brne	.-8      	; 0x1ea8 <__vector_13+0x136>

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
     		PORTB &= ~0b10000; 
    1eb0:	2c 98       	cbi	0x05, 4	; 5
		//SPI.transfer(tmp);

#if DVOLUME 

		//SPDR = (tmp >> 8)|0b00110000; // tmp = 12-bit.  Does the C compiler optimize out the shift here?
    		SPDR = (sample >> 8)|0b00110000; // sample = 12-bit
    1eb2:	83 2f       	mov	r24, r19
    1eb4:	80 63       	ori	r24, 0x30	; 48
    1eb6:	8e bd       	out	0x2e, r24	; 46
    		while (!(SPSR & _BV(SPIF)));
    1eb8:	0d b4       	in	r0, 0x2d	; 45
    1eba:	07 fe       	sbrs	r0, 7
    1ebc:	fd cf       	rjmp	.-6      	; 0x1eb8 <__vector_13+0x146>
		sample = tmp >> 8; // Divide by 256 to finish adusting sample by volume.   	
    1e9c:	89 2f       	mov	r24, r25
    1e9e:	9a 2f       	mov	r25, r26
    1ea0:	ab 2f       	mov	r26, r27
    1ea2:	bb 27       	eor	r27, r27
	sample = sample >> 4;	// Convert 16 bit sample to 12 bit.
    1ea4:	24 e0       	ldi	r18, 0x04	; 4
    1ea6:	96 95       	lsr	r25
    1ea8:	87 95       	ror	r24
    1eaa:	2a 95       	dec	r18
    1eac:	e1 f7       	brne	.-8      	; 0x1ea6 <__vector_13+0x134>
		// It is 3x faster to shift by 8 and then a shift by 4 than to let the compiler figure out it can do the same with a shift by 12!

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
     		PORTB &= ~0b10000; 
    1eae:	2c 98       	cbi	0x05, 4	; 5
		//SPI.transfer((tmp >> 8)|0b00110000);
		//SPI.transfer(tmp);

#if DVOLUME 
		
		sample = sample | 0b0011000000000000;
    1eb0:	90 63       	ori	r25, 0x30	; 48
		SPDR = sample >> 8;
    1eb2:	9e bd       	out	0x2e, r25	; 46
    		//SPDR = (sample >> 8)|0b00110000; // sample is 12-bit
    		while (!(SPSR & _BV(SPIF)));
    1eb4:	0d b4       	in	r0, 0x2d	; 45
    1eb6:	07 fe       	sbrs	r0, 7
    1eb8:	fd cf       	rjmp	.-6      	; 0x1eb4 <__vector_13+0x142>

fungus:
If the chip has a hardware multiplier which takes two clock cycles...why doesn't the compiler use it to do shift operations instead of creating a loop of single-bit shifts?

I don't know. Possibly we don't have the latest version of the compiler. Possibly that particular optimization was omitted from the code generation section.