Go Down

Topic: Poor GCC optimization (Read 2 times) previous topic - next topic

JarkkoL

Jul 27, 2013, 06:26 am Last Edit: Jul 27, 2013, 06:42 am by JarkkoL Reason: 1
Hi,

I have been writing some time intensive code for Arduino and after analyzing the generated asm of the time intensive bits it seems that GCC does pretty poor job in optimizing the C++ code. Here's an example (smp is of type int8_t and res is int16_t):
Code: [Select]

     uint8_t vol=channel->volume;
     res+=(smp*vol)>>8;
   37c6: 2e 2f       mov r18, r30
   37c8: 33 27       eor r19, r19
   37ca: 27 fd       sbrc r18, 7
   37cc: 30 95       com r19
   37ce: 8f 85       ldd r24, Y+15 ; 0x0f
   37d0: 90 e0       ldi r25, 0x00 ; 0
   37d2: fc 01       movw r30, r24
   37d4: 2e 9f       mul r18, r30
   37d6: c0 01       movw r24, r0
   37d8: 2f 9f       mul r18, r31
   37da: 90 0d       add r25, r0
   37dc: 3e 9f       mul r19, r30
   37de: 90 0d       add r25, r0
   37e0: 11 24       eor r1, r1
   37e2: 89 2f       mov r24, r25
   37e4: 99 0f       add r25, r25
   37e6: 99 0b       sbc r25, r25
   37e8: a8 0e       add r10, r24
   37ea: b9 1e       adc r11, r25

So in this case GCC emits 3 (!) muls for the simple 8-bit x 8-bit multiplication, which should be single mul on ATmega328, and all this code seems pretty excessive for these lines of C++. So, I would say that there are no optimizations enabled by the GCC at all. Is there some compiler flags where I can enable optimization or is the AVR port of GCC just so poor in optimizing code?

Nick Gammon

How are smp and vol declared? Also channel->volume?

Can you make a full sketch that demonstrates your point and not just two lines of code?
http://www.gammon.com.au/electronics

Nick Gammon

Quote

Here's an example (smp is of type int8_t and res is int16_t):


Sorry, I missed that bit. :)
http://www.gammon.com.au/electronics

Nick Gammon

I made a test:

Code: [Select]

volatile byte volume;
volatile byte smp;

int16_t res;

void setup ()
  {
  uint8_t vol = volume;
  res += (smp * vol) >> 8; 
  Serial.println (res);
  }  // end of setup

void loop () { }


Generated for setup:


Code: [Select]

000000c0 <setup>:
  c0: 60 91 10 01 lds r22, 0x0110
  c4: 80 91 11 01 lds r24, 0x0111
  c8: 68 9f        mul r22, r24
  ca: b0 01        movw r22, r0
  cc: 11 24        eor r1, r1
  ce: 67 2f        mov r22, r23
  d0: 77 0f        add r23, r23
  d2: 77 0b        sbc r23, r23
  d4: 80 91 12 01 lds r24, 0x0112
  d8: 90 91 13 01 lds r25, 0x0113
  dc: 68 0f        add r22, r24
  de: 79 1f        adc r23, r25
  e0: 70 93 13 01 sts 0x0113, r23
  e4: 60 93 12 01 sts 0x0112, r22
  e8: 88 e9        ldi r24, 0x98 ; 152
  ea: 91 e0        ldi r25, 0x01 ; 1
  ec: 4a e0        ldi r20, 0x0A ; 10
  ee: 50 e0        ldi r21, 0x00 ; 0
  f0: 0e 94 0c 03 call 0x618 ; 0x618 <_ZN5Print7printlnEii>
  f4: 08 95        ret


Just one mul there. I made the variables volatile otherwise the compiler would probably optimize everything away. It actually optimizes very aggressively.
http://www.gammon.com.au/electronics

JarkkoL


Can you make a full sketch that demonstrates your point and not just two lines of code?

Added the files to the original post.

It's the mod_player::mix_buffer_batch() function

Nick Gammon

#5
Jul 27, 2013, 06:44 am Last Edit: Jul 27, 2013, 09:04 am by Nick Gammon Reason: 1

So, I would say that there are no optimizations enabled by the GCC at all.


And I would say that is not correct:

Code: [Select]

/Applications/Arduino_1.0.5.app/Contents/Resources/Java/hardware/tools/avr/bin/avr-g++ -c -g -Os -Wall -fno-exceptions
   -ffunction-sections -fdata-sections -mmcu=atmega328p -DF_CPU=16000000L -MMD -DUSB_VID=null -DUSB_PID=null
   -DARDUINO=105


The -Os switch is enabled.

Code: [Select]

      -Os Optimize for size, but not at the expense of speed.  -Os enables all -O2 optimizations that do not typically increase code size.
          However, instructions are chosen for best performance, regardless of size.  
http://www.gammon.com.au/electronics

Nick Gammon

I made a mistake with the data types.

Revised sketch:

Code: [Select]

volatile uint8_t volume;
volatile int8_t smp;

int16_t res;

void setup ()
  {
  uint8_t vol = volume;
  res += (smp * vol) >> 8; 
  Serial.println (res);
  }  // end of setup

void loop () { }


Gives:

Code: [Select]

000000c0 <setup>:
  c0: 20 91 10 01 lds r18, 0x0110
  c4: 80 91 11 01 lds r24, 0x0111
  c8: 30 e0        ldi r19, 0x00 ; 0
  ca: 99 27        eor r25, r25
  cc: 87 fd        sbrc r24, 7
  ce: 90 95        com r25
  d0: 28 9f        mul r18, r24
  d2: b0 01        movw r22, r0
  d4: 29 9f        mul r18, r25
  d6: 70 0d        add r23, r0
  d8: 38 9f        mul r19, r24
  da: 70 0d        add r23, r0
  dc: 11 24        eor r1, r1
  de: 67 2f        mov r22, r23
  e0: 77 0f        add r23, r23
  e2: 77 0b        sbc r23, r23
  e4: 80 91 12 01 lds r24, 0x0112
  e8: 90 91 13 01 lds r25, 0x0113
  ec: 68 0f        add r22, r24
  ee: 79 1f        adc r23, r25
  f0: 70 93 13 01 sts 0x0113, r23
  f4: 60 93 12 01 sts 0x0112, r22
  f8: 88 e9        ldi r24, 0x98 ; 152
  fa: 91 e0        ldi r25, 0x01 ; 1
  fc: 4a e0        ldi r20, 0x0A ; 10
  fe: 50 e0        ldi r21, 0x00 ; 0
100: 0e 94 14 03 call 0x628 ; 0x628 <_ZN5Print7printlnEii>
104: 08 95        ret

http://www.gammon.com.au/electronics

Nick Gammon

I think your problem here is that you are mixing signed and unsigned 8-bit types.

https://www.securecoding.cert.org/confluence/display/seccode/INT02-C.+Understand+integer+conversion+rules

According to the rules for integer promotion (from that page):

Quote

Integer Promotions

Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int. Integer promotions are applied as part of the usual arithmetic conversions to certain argument expressions; operands of the unary +, -, and ~ operators; and operands of the shift operators. The following code fragment shows the application of integer promotions:
...


So the compiler is required to promoted your operands to integers, hence the extra multiplications. There is a difference between poor optimization, and the compiler following rules it is required to follow.
http://www.gammon.com.au/electronics

JarkkoL

Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though. This isn't the only missed optimization but there are plenty of others as well as I can tell. I was hoping to keep this code in C++ but it seems I would have to resort to inline asm to fix these performance issues ): By simply writing the multiplication in asm I was already able to increase the sampling rate from 25KHz to 30KHz

Nick Gammon

Quote

Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though.


Are you sure?

Multiplying an unsigned 8-bit number by a signed one means you have these ranges:

(0 to 255) X (-128 to +127).

In any case can't you have them as unsigned types? I showed that only took one mul if you do that.

Quote

This isn't the only missed optimization but there are plenty of others as well as I can tell.


What would they be?
http://www.gammon.com.au/electronics

Nick Gammon

This might be relevant:

http://mekonik.wordpress.com/2009/03/18/arduino-avr-gcc-multiplication/
http://www.gammon.com.au/electronics

JarkkoL

Quote

Are you sure?

Multiplying an unsigned 8-bit number by a signed one means you have these ranges:

(0 to 255) X (-128 to +127).

In any case can't you have them as unsigned types? I showed that only took one mul if you do that.

There is special "mulsu" instruction for that purpose. You can't use unsigned mul.

Quote

What would they be?

For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.

joe mcd

I would have to agree that the 'optimizer' is an optimistic name :)  The game is given away by the flag -Os.  In other words the gcc authors admit that they can deal with size (true), speed (doubtful) but not both at the same time.  Hand assembly can do both at the same time, but I gave that up 20 years ago.

If you are really that constrained, you have no choice but to keep modifying the code and checking the assembly results until you trick the 'optimizer' into correct behavior.  Or simply writing the assembler yourself.  Or use a faster chip.  Maybe its time for 32 bits.


JarkkoL

Ok, maybe I could change IDE to use -O2 instead, and maybe it generates better code. Or is the an optimization #pragma expression for GCC I can use?

I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.

Nick Gammon


For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.


Looks like that was fixed in 4.7.0 in 2011.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50910

http://www.gammon.com.au/electronics

Go Up