Poor GCC optimization

Added the files to the original post.

It's the mod_player::mix_buffer_batch() function

JarkkoL:
So, I would say that there are no optimizations enabled by the GCC at all.

And I would say that is not correct:

/Applications/Arduino_1.0.5.app/Contents/Resources/Java/hardware/tools/avr/bin/avr-g++ -c -g -Os -Wall -fno-exceptions 
   -ffunction-sections -fdata-sections -mmcu=atmega328p -DF_CPU=16000000L -MMD -DUSB_VID=null -DUSB_PID=null 
   -DARDUINO=105

The -Os switch is enabled.

       -Os Optimize for size, but not at the expense of speed.  -Os enables all -O2 optimizations that do not typically increase code size.
           However, instructions are chosen for best performance, regardless of size.

I made a mistake with the data types.

Revised sketch:

volatile uint8_t volume;
volatile int8_t smp;

int16_t res;

void setup ()
  {
  uint8_t vol = volume;
  res += (smp * vol) >> 8;  
  Serial.println (res);
  }  // end of setup

void loop () { }

Gives:

000000c0 <setup>:
  c0:	20 91 10 01 	lds	r18, 0x0110
  c4:	80 91 11 01 	lds	r24, 0x0111
  c8:	30 e0       	ldi	r19, 0x00	; 0
  ca:	99 27       	eor	r25, r25
  cc:	87 fd       	sbrc	r24, 7
  ce:	90 95       	com	r25
  d0:	28 9f       	mul	r18, r24
  d2:	b0 01       	movw	r22, r0
  d4:	29 9f       	mul	r18, r25
  d6:	70 0d       	add	r23, r0
  d8:	38 9f       	mul	r19, r24
  da:	70 0d       	add	r23, r0
  dc:	11 24       	eor	r1, r1
  de:	67 2f       	mov	r22, r23
  e0:	77 0f       	add	r23, r23
  e2:	77 0b       	sbc	r23, r23
  e4:	80 91 12 01 	lds	r24, 0x0112
  e8:	90 91 13 01 	lds	r25, 0x0113
  ec:	68 0f       	add	r22, r24
  ee:	79 1f       	adc	r23, r25
  f0:	70 93 13 01 	sts	0x0113, r23
  f4:	60 93 12 01 	sts	0x0112, r22
  f8:	88 e9       	ldi	r24, 0x98	; 152
  fa:	91 e0       	ldi	r25, 0x01	; 1
  fc:	4a e0       	ldi	r20, 0x0A	; 10
  fe:	50 e0       	ldi	r21, 0x00	; 0
 100:	0e 94 14 03 	call	0x628	; 0x628 <_ZN5Print7printlnEii>
 104:	08 95       	ret

I think your problem here is that you are mixing signed and unsigned 8-bit types.

https://www.securecoding.cert.org/confluence/display/seccode/INT02-C.+Understand+integer+conversion+rules

According to the rules for integer promotion (from that page):

Integer Promotions

Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int. Integer promotions are applied as part of the usual arithmetic conversions to certain argument expressions; operands of the unary +, -, and ~ operators; and operands of the shift operators. The following code fragment shows the application of integer promotions:
...

So the compiler is required to promoted your operands to integers, hence the extra multiplications. There is a difference between poor optimization, and the compiler following rules it is required to follow.

Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though. This isn't the only missed optimization but there are plenty of others as well as I can tell. I was hoping to keep this code in C++ but it seems I would have to resort to inline asm to fix these performance issues ): By simply writing the multiplication in asm I was already able to increase the sampling rate from 25KHz to 30KHz

Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though.

Are you sure?

Multiplying an unsigned 8-bit number by a signed one means you have these ranges:

(0 to 255) X (-128 to +127).

In any case can't you have them as unsigned types? I showed that only took one mul if you do that.

This isn't the only missed optimization but there are plenty of others as well as I can tell.

What would they be?

This might be relevant:

Are you sure?

Multiplying an unsigned 8-bit number by a signed one means you have these ranges:

(0 to 255) X (-128 to +127).

In any case can't you have them as unsigned types? I showed that only took one mul if you do that.

There is special "mulsu" instruction for that purpose. You can't use unsigned mul.

What would they be?

For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.

I would have to agree that the 'optimizer' is an optimistic name :slight_smile: The game is given away by the flag -Os. In other words the gcc authors admit that they can deal with size (true), speed (doubtful) but not both at the same time. Hand assembly can do both at the same time, but I gave that up 20 years ago.

If you are really that constrained, you have no choice but to keep modifying the code and checking the assembly results until you trick the 'optimizer' into correct behavior. Or simply writing the assembler yourself. Or use a faster chip. Maybe its time for 32 bits.

Ok, maybe I could change IDE to use -O2 instead, and maybe it generates better code. Or is the an optimization #pragma expression for GCC I can use?

I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.

JarkkoL:
For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.

Looks like that was fixed in 4.7.0 in 2011.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50910

JarkkoL:
I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.

I haven't done it myself but you could try altering your build process to use the later avr-gcc program.

As far as I understand, C has no concept of the idea that the result of an operation can be a different type than the operands (as in "an 8x8 multiply gives a 16bit result.") So I think that...

res += (smp * vol) >> 8;

has two possible interpretations:

  1. smp and vol are 8 bits, and are promoted to 16bit quantities before the multiply, giving a 16bit result. (this is what is happening, right?)
  2. smp and vol are 8 bits, the result is 8 bits, shifting right by 8 bits gives zero, and the statement is a no-op. (this is not what you want, is it?)

You wanted "smp and vol are 8 bits, the result of a multiply is 16bits, give me the high 8 of those 16bits" - C is not going to do that.

"mixed-size math" remains one of the areas where assembly language retains a big advantage over C.

It's easy to do and works very well.

I haven't done it myself but you could try altering your build process to use the later avr-gcc program.

Ah, that's a good idea. I didn't realize GCC that comes with Arduino download is from 2008.

It's easy to do and works very well.

Great, can you tell me how to update avr-gcc exactly?

Edit: Is the latest GCC 4.8.1 available for AVR & Windows that I could easily install?

http://forum.arduino.cc/index.php?topic=51984.msg371307#msg371307
http://forum.arduino.cc/index.php?topic=37965.0

Atmel provides the latest compiler here...

Except in special cases, no compiler can optimize for both speed and size at the same time. In addition to compiling for speed and compiling for size, and third thing people need to choose is whether the time spent doing the compilation is more important than either the time it takes the executable code to run or the size of the code.

And as mentioned in other posts, the GCC shipped with the AVR is based on a release from 5 years ago.

I updated the compiler to the latest one from Atmel (4.7.2) and it does much better job with optimization. For multiplication it's now also using the "mulsu" instruction instead of 3 muls:

      res+=(smp*vol)>>8;
    383e:	0f 2f       	mov	r16, r31
    3840:	1e 2f       	mov	r17, r30
    3842:	10 03       	mulsu	r17, r16
    3844:	f0 01       	movw	r30, r0
    3846:	11 24       	eor	r1, r1
    3848:	ef 2f       	mov	r30, r31
    384a:	ff 0f       	add	r31, r31
    384c:	ff 0b       	sbc	r31, r31
    384e:	2e 0f       	add	r18, r30
    3850:	3f 1f       	adc	r19, r31

It still seems a bit suboptimal though. For example the two first mov instructions are useless and mulsu could take r30 and r31 straight as inputs.

I updated the compiler to the latest one from Atmel (4.7.2)

How do you do that?

Reply #19