I have been writing some time intensive code for Arduino and after analyzing the generated asm of the time intensive bits it seems that GCC does pretty poor job in optimizing the C++ code. Here's an example (smp is of type int8_t and res is int16_t):
So in this case GCC emits 3 (!) muls for the simple 8-bit x 8-bit multiplication, which should be single mul on ATmega328, and all this code seems pretty excessive for these lines of C++. So, I would say that there are no optimizations enabled by the GCC at all. Is there some compiler flags where I can enable optimization or is the AVR port of GCC just so poor in optimizing code?
Just one mul there. I made the variables volatile otherwise the compiler would probably optimize everything away. It actually optimizes very aggressively.
-Os Optimize for size, but not at the expense of speed. -Os enables all -O2 optimizations that do not typically increase code size.
However, instructions are chosen for best performance, regardless of size.
According to the rules for integer promotion (from that page):
Integer Promotions
Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int. Integer promotions are applied as part of the usual arithmetic conversions to certain argument expressions; operands of the unary +, -, and ~ operators; and operands of the shift operators. The following code fragment shows the application of integer promotions:
...
So the compiler is required to promoted your operands to integers, hence the extra multiplications. There is a difference between poor optimization, and the compiler following rules it is required to follow.
Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though. This isn't the only missed optimization but there are plenty of others as well as I can tell. I was hoping to keep this code in C++ but it seems I would have to resort to inline asm to fix these performance issues ): By simply writing the multiplication in asm I was already able to increase the sampling rate from 25KHz to 30KHz
Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though.
Are you sure?
Multiplying an unsigned 8-bit number by a signed one means you have these ranges:
(0 to 255) X (-128 to +127).
In any case can't you have them as unsigned types? I showed that only took one mul if you do that.
This isn't the only missed optimization but there are plenty of others as well as I can tell.
I would have to agree that the 'optimizer' is an optimistic name The game is given away by the flag -Os. In other words the gcc authors admit that they can deal with size (true), speed (doubtful) but not both at the same time. Hand assembly can do both at the same time, but I gave that up 20 years ago.
If you are really that constrained, you have no choice but to keep modifying the code and checking the assembly results until you trick the 'optimizer' into correct behavior. Or simply writing the assembler yourself. Or use a faster chip. Maybe its time for 32 bits.
Ok, maybe I could change IDE to use -O2 instead, and maybe it generates better code. Or is the an optimization #pragma expression for GCC I can use?
I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.
JarkkoL:
I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.
I haven't done it myself but you could try altering your build process to use the later avr-gcc program.
As far as I understand, C has no concept of the idea that the result of an operation can be a different type than the operands (as in "an 8x8 multiply gives a 16bit result.") So I think that...
res += (smp * vol) >> 8;
has two possible interpretations:
smp and vol are 8 bits, and are promoted to 16bit quantities before the multiply, giving a 16bit result. (this is what is happening, right?)
smp and vol are 8 bits, the result is 8 bits, shifting right by 8 bits gives zero, and the statement is a no-op. (this is not what you want, is it?)
You wanted "smp and vol are 8 bits, the result of a multiply is 16bits, give me the high 8 of those 16bits" - C is not going to do that.
"mixed-size math" remains one of the areas where assembly language retains a big advantage over C.