Go Down

Topic: optimizing code? (Read 12867 times) previous topic - next topic

JustinHoMi

Feb 28, 2009, 01:48 am Last Edit: Feb 28, 2009, 01:50 am by JustinHoMi Reason: 1
Any tips on optimizing code for the arduino? That is, for execution speed, as well as size. Of course, general c-optimization tips are welcome too! I've found a few general websites for optimizing c, but nothing really for embedded dev.

I figure I might as well start developing good habits now!

JustinHoMi

#1
Feb 28, 2009, 01:58 am Last Edit: Feb 28, 2009, 01:59 am by JustinHoMi Reason: 1
I guess I'll start with a quick question. Does the compiler automatically convert multiplication and division of powers of two to bit shifts?

For example... would this...

Code: [Select]
x = y * 8;

be the same as this, in the generated assembly language?

Code: [Select]
x = y <<3;

westfw

#2
Feb 28, 2009, 03:44 am Last Edit: Feb 28, 2009, 03:45 am by westfw Reason: 1
Hmm.  Start with good habits, eh?

1) You should hardly ever make a lot of effort to "optimize" code that doesn't actually need to be optimized.

2) You should think REALLY, REALLY hard about sacrificing code clarity for the sake of speed or space.  Think of the cost of a bigger or faster CPU vs $100/hour for programmer time...

3) You get the most improvement from optimization techniques that are independent of language - moving fixed calculations outside of loops, for instance.  You're also likely to lose more performance in functions that are beyond your ability to modify than in slightly sub-optimal statement syntax.

4) You should learn to look at (and understand) the produced assembly language code using avr-objdump, at least to the point of finding and counting the number of instructions executed for a particular statement.

5) The compiler is supposed to do a "pretty good" job of producing code, within the limits of the language.  If it does something really awful, you should complain to the compiler developers.

6) for extreme optimization, assembly language may be preferred over particularly obscure C syntax.  External ASM modules written in standard assembler syntax may be preferred over inline assembler in the weird gnu asm syntax.

--------
Enough philosophy.  On to the meat.

7) Use data types that are the minimum possible size.  The AVR is an 8bit CPU, while C's "int" is 16 bits (for avr-gcc) and a long is 32bits.  You will get lots of extra code if you use longs or ints where you don't need them.

8) C has this rule about performing arithmetic with datatypes that are at least int in size (or the max size of any operand, if that's bigger than an int.)  This can have unexpected results that you should look for (see below.)

9) re-arrange math so that it avoids the need for floating point or large variables.  For instance, deal with baudrate/10 instead of baudrate.

Quote
Does the compiler automatically convert multiplication and division of powers of two to bit shifts?
Ah!  A particularly interesting question.   Let's look!
Code: [Select]
int x,y;
byte a,z;
void loop (void)
{
 x = digitalRead(1);
 z = digitalRead(3);
 y = x * 8;
 Serial.print(y);
 a = (byte)((byte) z * (byte) 8);
 Serial.print((int)a);
 y = x << 3;
 a = z << 3;
}

Code: [Select]
 y = x * 8;
126:   60 91 06 01     lds     r22, 0x0106
12a:   70 91 07 01     lds     r23, 0x0107
12e:   43 e0           ldi     r20, 0x03       ; 3
130:   66 0f           add     r22, r22
132:   77 1f           adc     r23, r23
134:   4a 95           dec     r20
136:   e1 f7           brne    .-8             ; 0x130 <loop+0x26>
138:   70 93 09 01     sts     0x0109, r23
13c:   60 93 08 01     sts     0x0108, r22
For int types, we get a loop of adds.  This makes sense; an add is just as fast and small as a shift anyway, and the AVR doesn't have single-instruction multibit shifts.

Code: [Select]

  a = (byte)((byte) z * (byte) 8);
14a:   60 91 0b 01     lds     r22, 0x010B
14e:   70 e0           ldi     r23, 0x00       ; 0
150:   33 e0           ldi     r19, 0x03       ; 3
152:   66 0f           add     r22, r22
154:   77 1f           adc     r23, r23
156:   3a 95           dec     r19
158:   e1 f7           brne    .-8             ; 0x152 <loop+0x48>
15a:   60 93 0a 01     sts     0x010A, r22
The byte variable gets the same treatment (including 16bit adds) in spite of my efforts to tell the compiler it should only deal with 8 bits.

Code: [Select]

 y = x << 3;
166:   80 91 06 01     lds     r24, 0x0106
16a:   90 91 07 01     lds     r25, 0x0107
16e:   23 e0           ldi     r18, 0x03       ; 3
170:   88 0f           add     r24, r24
172:   99 1f           adc     r25, r25
174:   2a 95           dec     r18
176:   e1 f7           brne    .-8             ; 0x170 <loop+0x66>
178:   90 93 09 01     sts     0x0109, r25
17c:   80 93 08 01     sts     0x0108, r24
The same code happens with ints when using shift explicitly.

Code: [Select]

 a = z << 3;
180:   80 91 0b 01     lds     r24, 0x010B
184:   88 0f           add     r24, r24
186:   88 0f           add     r24, r24
188:   88 0f           add     r24, r24
18a:   80 93 0a 01     sts     0x010A, r24
But shift is smarter with byte variables!

Code: [Select]
 a = z << 4;
180:   80 91 0b 01     lds     r24, 0x010B
184:   82 95           swap    r24
186:   80 7f           andi    r24, 0xF0       ; 240
188:   80 93 0a 01     sts     0x010A, r24
Interestingly, the compiler is smart enough to optimize a 4-bit shift of the byte to a Nybble swap and mask...

halley

westfw, lovely examination of the avr-gcc optimizations, and spot-on agreement with your general advice.

Your #1 is very important, and in some circles they call the failure to remember #1 the folly of "premature optimization."  Optimizing before you know what needs optimizing leads to wasted time and increased risk of introducing bugs.

Your #2 is also known as "don't be clever."  Some folks think they know how to influence the compiler into producing exactly the right assembler output, or how to cram more functionality into each line of their code.  By being clever, you're actually less likely to get an even better optimization (like that nybble swap), or again risking a bug in the overly clever code.

koyaanisqatsi

LOL! I had posted to this old thread a while back:
http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1227053665

But I think the answer here should pretty much satisfy the above one too.  After getting trapped in the idea that I needed to write the tightest, fastest possible code on every line, I started to feel like it wasn't worth the effort.  I DO use byte instead of int most of the time now, though.   ;)

Reminds me of when I was complaining to my technology VP that our new app was "so heavy and slow" that it was swamping the server.  His response: We can spend $50K and 9 months on trying to improve the code or $30K and two weeks on a more powerful server to run it.  We bought the server.

JustinHoMi

#5
Mar 03, 2009, 04:09 am Last Edit: Mar 03, 2009, 04:11 am by JustinHoMi Reason: 1
Quote
After getting trapped in the idea that I needed to write the tightest, fastest possible code on every line, I started to feel like it wasn't worth the effort.  


Well, when you've maxed out the flash or ram on the controller, then you don't have any other option. I maxed out both on my current project, which lead me to this post.

And yes... a 328 is on it's way. :)

Thanks for the tips above. I didn't realize that the disassembler would show the c source alongside the assembly! That's great!

mem

Justin, the assembler output posted above did not come from a disassembler, its produced by a utility called avr-objdump.exe that uses an intermediate compiler output file to show C source mixed with assembler. http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1233497593

JustinHoMi

Gotcha. I mistakingly called it a disassembler. Makes sense.

retrolefty

My definition of optimizing code =

Any sketch I write that doesn't cause red stuff to print in the IDE output window and also actually performs the function I'm trying accomplish. Anything better then that I'll leave to you software geeks.  ;)

Lefty

westfw

Quote
the assembler output posted above did not come from a disassembler, its produced by a utility called avr-objdump.exe

It depends on how you define things.  I belive that objump DOES do "true dissassembly" of the object, to assembler code.  But the intertwining of C code is done using the original source and line-number information saved in the object files for the debugger; you don't get a utility that output C from object code, you get one that outputs assmbler, and matchs that up with C code you already have.

mekon83

#10
Mar 08, 2009, 11:58 pm Last Edit: Mar 09, 2009, 12:16 am by mekon83 Reason: 1
The bottom line is that AVR-GCC is buggy. If you need performance, you gotta write some assembly. I'm doing some signal processing with 20kHz sample rate and those multiplications are costly. The compiler is doing a lousy job. For instance, I posted this on AVR freaks forum, http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=75909 :
Multiplication by a constant
Code: [Select]
int c;
int b = -234; // something, doesn't matter

void main() {
 c = 63 * b;
}

compiles into a loop with shifts that takes 36 cycles, while direct multiplication using the mul instruction takes 8. That's 4,5x faster. It's a well known AVR-GCC bug. The bug being that the compiler always compiles multiplication by powers of 2 as shifts. That's slow on the AVR platform as there are only 1-bit shift instructions.

So, I like the philosophy of writing a readable code, but with a working compiler. Another thing, I'm doing it for fun so there's no cost in spending some more time ...

But I'd recommend writing y *= 8 instead of y <<= 3.

TB205gti

Sorry to resurrect this old thread, but I need to have some input on this :)

Has anybody investigated if there is a speed/space improvement by using pointers as arguments to functions - and use them as return values..

eg..

Code: [Select]

void hubba(unsigned char* chr){

...blalbla..

*chr = *something*


}


versus

Code: [Select]

unsigned char hubba(){
 unsigned char ret;
..blabla.

return ret;
}


IMHO the sram usage would be less when using pointers, but are there any speed improvements ?

pluggy

Quote
Any sketch I write that doesn't cause red stuff to print in the IDE output window and also actually performs the function I'm trying accomplish.


Works for me.......  :)
http://pluggy.is-a-geek.com/index.html

Korman

#13
Oct 08, 2010, 11:12 pm Last Edit: Oct 09, 2010, 09:36 am by Korman Reason: 1
Quote
IMHO the sram usage would be less when using pointers, but are there any speed improvements ?


Instinctively I would have said the contrary, that the pointer version is worse in all aspects. But instincts don't really count for anything in optimisation, so I spent a little time on the problem.

My first conclusion I reached is, that it's not obvious to write a little test program that the compiler wouldn't optimise away. In the end I used this code:

Code: [Select]
void hubba(unsigned char* chr){
   delay (10);
   *chr = laps & 0xf3;
}

unsigned char bubba(){
   delay (14);
   return laps & 0xf5;
}

...

unsigned char c;
hubba (&c);
Serial.println (c);
c = bubba ();
Serial.println (c);


The delay and the Serial.println are used to stop the compiler from getting too smart.

The functions themselves no look like this in assembler:
Code: [Select]
unsigned char bubba(){
   delay (14);
118:   6e e0           ldi r22, 0x0E   ; 14
11a:   70 e0           ldi r23, 0x00   ; 0
11c:   80 e0           ldi r24, 0x00   ; 0
11e:   90 e0           ldi r25, 0x00   ; 0
120:   0e 94 b0 01     call    0x360   ; 0x360 <delay>
124:   80 91 36 01     lds r24, 0x0136
   return laps & 0xf5;
}
128:   85 7f           andi    r24, 0xF5   ; 245
12a:   08 95           ret


The return value is returned in r24 and not via ram, which makes the code quite small.

Code: [Select]
void hubba(unsigned char* chr){
12c:   0f 93           push    r16
12e:   1f 93           push    r17
130:   8c 01           movw    r16, r24
   delay (10);
132:   6a e0           ldi r22, 0x0A   ; 10
134:   70 e0           ldi r23, 0x00   ; 0
136:   80 e0           ldi r24, 0x00   ; 0
138:   90 e0           ldi r25, 0x00   ; 0
13a:   0e 94 b0 01     call    0x360   ; 0x360 <delay>
   *chr = laps & 0xf3;
13e:   80 91 36 01     lds r24, 0x0136
142:   83 7f           andi    r24, 0xF3   ; 243
144:   f8 01           movw    r30, r16
146:   80 83           st  Z, r24
}
148:   1f 91           pop r17
14a:   0f 91           pop r16
14c:   08 95           ret


The code is longer and takes more cycles, for a big part I guess because of moving stuff on the stack and back. If the functions were more complicated, I guess that the first version will also need it so they'd end up pretty even.

My conclusion is, that a return by value has a lot of potential benefits like as return values registers, inlining, storing local variables in registers and easier readability but I wasn't really able to point out any downsides. At worst, it will behave as efficient as the version with pointers.

The other question to ask, what's the point of optimising around with parameter passing. If the call overhead is relevant and worth optimisation, the handling of the stack frames and the function call itself will use up a lot more than the return value handling. You probably will need to get rid of the function or inlining it. And if the call overhead isn't a problem, optimising the return value passing is a waste of time.

So when is passing by reference preferable? Here we come back to the problem that creating a significant test situation. For a simple variable types the pointer overhead makes it only worse. The compiler will handle them via registers and save the processor all the loading and storing. But as soon as you move to bigger return values that need to be passed back on the stack and copied around, for example arrays or classes, the pointer might become more efficient and use less RAM. While I doubt the efficiency will matter much, the benefits of using less RAM can become critical.

My final answer to the question thus is: It depends and don't make your program more complicated for a performance gain you can't measure.

Korman

Senso

But the 328p as an wired multiplier, how can it be never used by all that multiplications???

Go Up