Poor GCC optimization

Hi,

I have been writing some time intensive code for Arduino and after analyzing the generated asm of the time intensive bits it seems that GCC does pretty poor job in optimizing the C++ code. Here's an example (smp is of type int8_t and res is int16_t):

      uint8_t vol=channel->volume;
      res+=(smp*vol)>>8;
    37c6:	2e 2f       	mov	r18, r30
    37c8:	33 27       	eor	r19, r19
    37ca:	27 fd       	sbrc	r18, 7
    37cc:	30 95       	com	r19
    37ce:	8f 85       	ldd	r24, Y+15	; 0x0f
    37d0:	90 e0       	ldi	r25, 0x00	; 0
    37d2:	fc 01       	movw	r30, r24
    37d4:	2e 9f       	mul	r18, r30
    37d6:	c0 01       	movw	r24, r0
    37d8:	2f 9f       	mul	r18, r31
    37da:	90 0d       	add	r25, r0
    37dc:	3e 9f       	mul	r19, r30
    37de:	90 0d       	add	r25, r0
    37e0:	11 24       	eor	r1, r1
    37e2:	89 2f       	mov	r24, r25
    37e4:	99 0f       	add	r25, r25
    37e6:	99 0b       	sbc	r25, r25
    37e8:	a8 0e       	add	r10, r24
    37ea:	b9 1e       	adc	r11, r25

So in this case GCC emits 3 (!) muls for the simple 8-bit x 8-bit multiplication, which should be single mul on ATmega328, and all this code seems pretty excessive for these lines of C++. So, I would say that there are no optimizations enabled by the GCC at all. Is there some compiler flags where I can enable optimization or is the AVR port of GCC just so poor in optimizing code?

player.ino (280 KB)

mod_player.h (4.4 KB)

mod_player.cpp (11.1 KB)

How are smp and vol declared? Also channel->volume?

Can you make a full sketch that demonstrates your point and not just two lines of code?

Here's an example (smp is of type int8_t and res is int16_t):

Sorry, I missed that bit. :slight_smile:

I made a test:

volatile byte volume;
volatile byte smp;

int16_t res;

void setup ()
  {
  uint8_t vol = volume;
  res += (smp * vol) >> 8;  
  Serial.println (res);
  }  // end of setup

void loop () { }

Generated for setup:

000000c0 <setup>:
  c0:	60 91 10 01 	lds	r22, 0x0110
  c4:	80 91 11 01 	lds	r24, 0x0111
  c8:	68 9f       	mul	r22, r24
  ca:	b0 01       	movw	r22, r0
  cc:	11 24       	eor	r1, r1
  ce:	67 2f       	mov	r22, r23
  d0:	77 0f       	add	r23, r23
  d2:	77 0b       	sbc	r23, r23
  d4:	80 91 12 01 	lds	r24, 0x0112
  d8:	90 91 13 01 	lds	r25, 0x0113
  dc:	68 0f       	add	r22, r24
  de:	79 1f       	adc	r23, r25
  e0:	70 93 13 01 	sts	0x0113, r23
  e4:	60 93 12 01 	sts	0x0112, r22
  e8:	88 e9       	ldi	r24, 0x98	; 152
  ea:	91 e0       	ldi	r25, 0x01	; 1
  ec:	4a e0       	ldi	r20, 0x0A	; 10
  ee:	50 e0       	ldi	r21, 0x00	; 0
  f0:	0e 94 0c 03 	call	0x618	; 0x618 <_ZN5Print7printlnEii>
  f4:	08 95       	ret

Just one mul there. I made the variables volatile otherwise the compiler would probably optimize everything away. It actually optimizes very aggressively.

Added the files to the original post.

It's the mod_player::mix_buffer_batch() function

JarkkoL:
So, I would say that there are no optimizations enabled by the GCC at all.

And I would say that is not correct:

/Applications/Arduino_1.0.5.app/Contents/Resources/Java/hardware/tools/avr/bin/avr-g++ -c -g -Os -Wall -fno-exceptions 
   -ffunction-sections -fdata-sections -mmcu=atmega328p -DF_CPU=16000000L -MMD -DUSB_VID=null -DUSB_PID=null 
   -DARDUINO=105

The -Os switch is enabled.

       -Os Optimize for size, but not at the expense of speed.  -Os enables all -O2 optimizations that do not typically increase code size.
           However, instructions are chosen for best performance, regardless of size.

I made a mistake with the data types.

Revised sketch:

volatile uint8_t volume;
volatile int8_t smp;

int16_t res;

void setup ()
  {
  uint8_t vol = volume;
  res += (smp * vol) >> 8;  
  Serial.println (res);
  }  // end of setup

void loop () { }

Gives:

000000c0 <setup>:
  c0:	20 91 10 01 	lds	r18, 0x0110
  c4:	80 91 11 01 	lds	r24, 0x0111
  c8:	30 e0       	ldi	r19, 0x00	; 0
  ca:	99 27       	eor	r25, r25
  cc:	87 fd       	sbrc	r24, 7
  ce:	90 95       	com	r25
  d0:	28 9f       	mul	r18, r24
  d2:	b0 01       	movw	r22, r0
  d4:	29 9f       	mul	r18, r25
  d6:	70 0d       	add	r23, r0
  d8:	38 9f       	mul	r19, r24
  da:	70 0d       	add	r23, r0
  dc:	11 24       	eor	r1, r1
  de:	67 2f       	mov	r22, r23
  e0:	77 0f       	add	r23, r23
  e2:	77 0b       	sbc	r23, r23
  e4:	80 91 12 01 	lds	r24, 0x0112
  e8:	90 91 13 01 	lds	r25, 0x0113
  ec:	68 0f       	add	r22, r24
  ee:	79 1f       	adc	r23, r25
  f0:	70 93 13 01 	sts	0x0113, r23
  f4:	60 93 12 01 	sts	0x0112, r22
  f8:	88 e9       	ldi	r24, 0x98	; 152
  fa:	91 e0       	ldi	r25, 0x01	; 1
  fc:	4a e0       	ldi	r20, 0x0A	; 10
  fe:	50 e0       	ldi	r21, 0x00	; 0
 100:	0e 94 14 03 	call	0x628	; 0x628 <_ZN5Print7printlnEii>
 104:	08 95       	ret

I think your problem here is that you are mixing signed and unsigned 8-bit types.

https://www.securecoding.cert.org/confluence/display/seccode/INT02-C.+Understand+integer+conversion+rules

According to the rules for integer promotion (from that page):

Integer Promotions

Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int. Integer promotions are applied as part of the usual arithmetic conversions to certain argument expressions; operands of the unary +, -, and ~ operators; and operands of the shift operators. The following code fragment shows the application of integer promotions:
...

So the compiler is required to promoted your operands to integers, hence the extra multiplications. There is a difference between poor optimization, and the compiler following rules it is required to follow.

Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though. This isn't the only missed optimization but there are plenty of others as well as I can tell. I was hoping to keep this code in C++ but it seems I would have to resort to inline asm to fix these performance issues ): By simply writing the multiplication in asm I was already able to increase the sampling rate from 25KHz to 30KHz

Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though.

Are you sure?

Multiplying an unsigned 8-bit number by a signed one means you have these ranges:

(0 to 255) X (-128 to +127).

In any case can't you have them as unsigned types? I showed that only took one mul if you do that.

This isn't the only missed optimization but there are plenty of others as well as I can tell.

What would they be?

This might be relevant:

Are you sure?

Multiplying an unsigned 8-bit number by a signed one means you have these ranges:

(0 to 255) X (-128 to +127).

In any case can't you have them as unsigned types? I showed that only took one mul if you do that.

There is special "mulsu" instruction for that purpose. You can't use unsigned mul.

What would they be?

For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.

I would have to agree that the 'optimizer' is an optimistic name :slight_smile: The game is given away by the flag -Os. In other words the gcc authors admit that they can deal with size (true), speed (doubtful) but not both at the same time. Hand assembly can do both at the same time, but I gave that up 20 years ago.

If you are really that constrained, you have no choice but to keep modifying the code and checking the assembly results until you trick the 'optimizer' into correct behavior. Or simply writing the assembler yourself. Or use a faster chip. Maybe its time for 32 bits.

Ok, maybe I could change IDE to use -O2 instead, and maybe it generates better code. Or is the an optimization #pragma expression for GCC I can use?

I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.

JarkkoL:
For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.

Looks like that was fixed in 4.7.0 in 2011.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50910

JarkkoL:
I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.

I haven't done it myself but you could try altering your build process to use the later avr-gcc program.

As far as I understand, C has no concept of the idea that the result of an operation can be a different type than the operands (as in "an 8x8 multiply gives a 16bit result.") So I think that...

res += (smp * vol) >> 8;

has two possible interpretations:

  1. smp and vol are 8 bits, and are promoted to 16bit quantities before the multiply, giving a 16bit result. (this is what is happening, right?)
  2. smp and vol are 8 bits, the result is 8 bits, shifting right by 8 bits gives zero, and the statement is a no-op. (this is not what you want, is it?)

You wanted "smp and vol are 8 bits, the result of a multiply is 16bits, give me the high 8 of those 16bits" - C is not going to do that.

"mixed-size math" remains one of the areas where assembly language retains a big advantage over C.

It's easy to do and works very well.

I haven't done it myself but you could try altering your build process to use the later avr-gcc program.

Ah, that's a good idea. I didn't realize GCC that comes with Arduino download is from 2008.

It's easy to do and works very well.

Great, can you tell me how to update avr-gcc exactly?

Edit: Is the latest GCC 4.8.1 available for AVR & Windows that I could easily install?

http://forum.arduino.cc/index.php?topic=51984.msg371307#msg371307
http://forum.arduino.cc/index.php?topic=37965.0

Atmel provides the latest compiler here...