Added the files to the original post.
It's the mod_player::mix_buffer_batch() function
Added the files to the original post.
It's the mod_player::mix_buffer_batch() function
JarkkoL:
So, I would say that there are no optimizations enabled by the GCC at all.
And I would say that is not correct:
/Applications/Arduino_1.0.5.app/Contents/Resources/Java/hardware/tools/avr/bin/avr-g++ -c -g -Os -Wall -fno-exceptions
-ffunction-sections -fdata-sections -mmcu=atmega328p -DF_CPU=16000000L -MMD -DUSB_VID=null -DUSB_PID=null
-DARDUINO=105
The -Os switch is enabled.
-Os Optimize for size, but not at the expense of speed. -Os enables all -O2 optimizations that do not typically increase code size.
However, instructions are chosen for best performance, regardless of size.
I made a mistake with the data types.
Revised sketch:
volatile uint8_t volume;
volatile int8_t smp;
int16_t res;
void setup ()
{
uint8_t vol = volume;
res += (smp * vol) >> 8;
Serial.println (res);
} // end of setup
void loop () { }
Gives:
000000c0 <setup>:
c0: 20 91 10 01 lds r18, 0x0110
c4: 80 91 11 01 lds r24, 0x0111
c8: 30 e0 ldi r19, 0x00 ; 0
ca: 99 27 eor r25, r25
cc: 87 fd sbrc r24, 7
ce: 90 95 com r25
d0: 28 9f mul r18, r24
d2: b0 01 movw r22, r0
d4: 29 9f mul r18, r25
d6: 70 0d add r23, r0
d8: 38 9f mul r19, r24
da: 70 0d add r23, r0
dc: 11 24 eor r1, r1
de: 67 2f mov r22, r23
e0: 77 0f add r23, r23
e2: 77 0b sbc r23, r23
e4: 80 91 12 01 lds r24, 0x0112
e8: 90 91 13 01 lds r25, 0x0113
ec: 68 0f add r22, r24
ee: 79 1f adc r23, r25
f0: 70 93 13 01 sts 0x0113, r23
f4: 60 93 12 01 sts 0x0112, r22
f8: 88 e9 ldi r24, 0x98 ; 152
fa: 91 e0 ldi r25, 0x01 ; 1
fc: 4a e0 ldi r20, 0x0A ; 10
fe: 50 e0 ldi r21, 0x00 ; 0
100: 0e 94 14 03 call 0x628 ; 0x628 <_ZN5Print7printlnEii>
104: 08 95 ret
I think your problem here is that you are mixing signed and unsigned 8-bit types.
According to the rules for integer promotion (from that page):
Integer Promotions
Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int. Integer promotions are applied as part of the usual arithmetic conversions to certain argument expressions; operands of the unary +, -, and ~ operators; and operands of the shift operators. The following code fragment shows the application of integer promotions:
...
So the compiler is required to promoted your operands to integers, hence the extra multiplications. There is a difference between poor optimization, and the compiler following rules it is required to follow.
Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though. This isn't the only missed optimization but there are plenty of others as well as I can tell. I was hoping to keep this code in C++ but it seems I would have to resort to inline asm to fix these performance issues ): By simply writing the multiplication in asm I was already able to increase the sampling rate from 25KHz to 30KHz
Regardless of type promotion the compiler can recognize that the values for multiplication originate from two 8-bit values and perform the optimization though.
Are you sure?
Multiplying an unsigned 8-bit number by a signed one means you have these ranges:
(0 to 255) X (-128 to +127).
In any case can't you have them as unsigned types? I showed that only took one mul if you do that.
This isn't the only missed optimization but there are plenty of others as well as I can tell.
What would they be?
This might be relevant:
Are you sure?
Multiplying an unsigned 8-bit number by a signed one means you have these ranges:
(0 to 255) X (-128 to +127).
In any case can't you have them as unsigned types? I showed that only took one mul if you do that.
There is special "mulsu" instruction for that purpose. You can't use unsigned mul.
What would they be?
For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.
I would have to agree that the 'optimizer' is an optimistic name The game is given away by the flag -Os. In other words the gcc authors admit that they can deal with size (true), speed (doubtful) but not both at the same time. Hand assembly can do both at the same time, but I gave that up 20 years ago.
If you are really that constrained, you have no choice but to keep modifying the code and checking the assembly results until you trick the 'optimizer' into correct behavior. Or simply writing the assembler yourself. Or use a faster chip. Maybe its time for 32 bits.
Ok, maybe I could change IDE to use -O2 instead, and maybe it generates better code. Or is the an optimization #pragma expression for GCC I can use?
I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.
JarkkoL:
For example GCC fails do use right arithmetic shift for res/=4 but does the very slow division instead.
Looks like that was fixed in 4.7.0 in 2011.
JarkkoL:
I have Teensy 3.0 as well but I like to have this code running nicely on Uno for the sake of the microcontroller optimization challenge and exercise (: I haven't been writing asm for quite some time either, but maybe it's time to get a bit familiar with AVR asm.
I haven't done it myself but you could try altering your build process to use the later avr-gcc program.
As far as I understand, C has no concept of the idea that the result of an operation can be a different type than the operands (as in "an 8x8 multiply gives a 16bit result.") So I think that...
res += (smp * vol) >> 8;
has two possible interpretations:
You wanted "smp and vol are 8 bits, the result of a multiply is 16bits, give me the high 8 of those 16bits" - C is not going to do that.
"mixed-size math" remains one of the areas where assembly language retains a big advantage over C.
It's easy to do and works very well.
I haven't done it myself but you could try altering your build process to use the later avr-gcc program.
Ah, that's a good idea. I didn't realize GCC that comes with Arduino download is from 2008.
It's easy to do and works very well.
Great, can you tell me how to update avr-gcc exactly?
Edit: Is the latest GCC 4.8.1 available for AVR & Windows that I could easily install?
http://forum.arduino.cc/index.php?topic=51984.msg371307#msg371307
http://forum.arduino.cc/index.php?topic=37965.0
Atmel provides the latest compiler here...
Except in special cases, no compiler can optimize for both speed and size at the same time. In addition to compiling for speed and compiling for size, and third thing people need to choose is whether the time spent doing the compilation is more important than either the time it takes the executable code to run or the size of the code.
And as mentioned in other posts, the GCC shipped with the AVR is based on a release from 5 years ago.
I updated the compiler to the latest one from Atmel (4.7.2) and it does much better job with optimization. For multiplication it's now also using the "mulsu" instruction instead of 3 muls:
res+=(smp*vol)>>8;
383e: 0f 2f mov r16, r31
3840: 1e 2f mov r17, r30
3842: 10 03 mulsu r17, r16
3844: f0 01 movw r30, r0
3846: 11 24 eor r1, r1
3848: ef 2f mov r30, r31
384a: ff 0f add r31, r31
384c: ff 0b sbc r31, r31
384e: 2e 0f add r18, r30
3850: 3f 1f adc r19, r31
It still seems a bit suboptimal though. For example the two first mov instructions are useless and mulsu could take r30 and r31 straight as inputs.
I updated the compiler to the latest one from Atmel (4.7.2)
How do you do that?