optimizing code?

Any tips on optimizing code for the arduino? That is, for execution speed, as well as size. Of course, general c-optimization tips are welcome too! I've found a few general websites for optimizing c, but nothing really for embedded dev.

I figure I might as well start developing good habits now!

I guess I'll start with a quick question. Does the compiler automatically convert multiplication and division of powers of two to bit shifts?

For example... would this...

x = y * 8;

be the same as this, in the generated assembly language?

x = y <<3;

Hmm. Start with good habits, eh?

  1. You should hardly ever make a lot of effort to "optimize" code that doesn't actually need to be optimized.

  2. You should think REALLY, REALLY hard about sacrificing code clarity for the sake of speed or space. Think of the cost of a bigger or faster CPU vs $100/hour for programmer time...

  3. You get the most improvement from optimization techniques that are independent of language - moving fixed calculations outside of loops, for instance. You're also likely to lose more performance in functions that are beyond your ability to modify than in slightly sub-optimal statement syntax.

  4. You should learn to look at (and understand) the produced assembly language code using avr-objdump, at least to the point of finding and counting the number of instructions executed for a particular statement.

  5. The compiler is supposed to do a "pretty good" job of producing code, within the limits of the language. If it does something really awful, you should complain to the compiler developers.

  6. for extreme optimization, assembly language may be preferred over particularly obscure C syntax. External ASM modules written in standard assembler syntax may be preferred over inline assembler in the weird gnu asm syntax.


Enough philosophy. On to the meat.

  1. Use data types that are the minimum possible size. The AVR is an 8bit CPU, while C's "int" is 16 bits (for avr-gcc) and a long is 32bits. You will get lots of extra code if you use longs or ints where you don't need them.

  2. C has this rule about performing arithmetic with datatypes that are at least int in size (or the max size of any operand, if that's bigger than an int.) This can have unexpected results that you should look for (see below.)

  3. re-arrange math so that it avoids the need for floating point or large variables. For instance, deal with baudrate/10 instead of baudrate.

Does the compiler automatically convert multiplication and division of powers of two to bit shifts?

Ah! A particularly interesting question. Let's look!

int x,y;
byte a,z;
void loop (void)
{
  x = digitalRead(1);
  z = digitalRead(3);
  y = x * 8;
  Serial.print(y);
  a = (byte)((byte) z * (byte) 8);
  Serial.print((int)a);
  y = x << 3;
  a = z << 3;
}
  y = x * 8;
 126:   60 91 06 01     lds     r22, 0x0106
 12a:   70 91 07 01     lds     r23, 0x0107
 12e:   43 e0           ldi     r20, 0x03       ; 3
 130:   66 0f           add     r22, r22
 132:   77 1f           adc     r23, r23
 134:   4a 95           dec     r20
 136:   e1 f7           brne    .-8             ; 0x130 <loop+0x26>
 138:   70 93 09 01     sts     0x0109, r23
 13c:   60 93 08 01     sts     0x0108, r22

For int types, we get a loop of adds. This makes sense; an add is just as fast and small as a shift anyway, and the AVR doesn't have single-instruction multibit shifts.

   a = (byte)((byte) z * (byte) 8);
 14a:   60 91 0b 01     lds     r22, 0x010B
 14e:   70 e0           ldi     r23, 0x00       ; 0
 150:   33 e0           ldi     r19, 0x03       ; 3
 152:   66 0f           add     r22, r22
 154:   77 1f           adc     r23, r23
 156:   3a 95           dec     r19
 158:   e1 f7           brne    .-8             ; 0x152 <loop+0x48>
 15a:   60 93 0a 01     sts     0x010A, r22

The byte variable gets the same treatment (including 16bit adds) in spite of my efforts to tell the compiler it should only deal with 8 bits.

  y = x << 3;
 166:   80 91 06 01     lds     r24, 0x0106
 16a:   90 91 07 01     lds     r25, 0x0107
 16e:   23 e0           ldi     r18, 0x03       ; 3
 170:   88 0f           add     r24, r24
 172:   99 1f           adc     r25, r25
 174:   2a 95           dec     r18
 176:   e1 f7           brne    .-8             ; 0x170 <loop+0x66>
 178:   90 93 09 01     sts     0x0109, r25
 17c:   80 93 08 01     sts     0x0108, r24

The same code happens with ints when using shift explicitly.

  a = z << 3;
 180:   80 91 0b 01     lds     r24, 0x010B
 184:   88 0f           add     r24, r24
 186:   88 0f           add     r24, r24
 188:   88 0f           add     r24, r24
 18a:   80 93 0a 01     sts     0x010A, r24

But shift is smarter with byte variables!

  a = z << 4;
 180:   80 91 0b 01     lds     r24, 0x010B
 184:   82 95           swap    r24
 186:   80 7f           andi    r24, 0xF0       ; 240
 188:   80 93 0a 01     sts     0x010A, r24

Interestingly, the compiler is smart enough to optimize a 4-bit shift of the byte to a Nybble swap and mask...

westfw, lovely examination of the avr-gcc optimizations, and spot-on agreement with your general advice.

Your #1 is very important, and in some circles they call the failure to remember #1 the folly of "premature optimization." Optimizing before you know what needs optimizing leads to wasted time and increased risk of introducing bugs.

Your #2 is also known as "don't be clever." Some folks think they know how to influence the compiler into producing exactly the right assembler output, or how to cram more functionality into each line of their code. By being clever, you're actually less likely to get an even better optimization (like that nybble swap), or again risking a bug in the overly clever code.

LOL! I had posted to this old thread a while back:
http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1227053665

But I think the answer here should pretty much satisfy the above one too. After getting trapped in the idea that I needed to write the tightest, fastest possible code on every line, I started to feel like it wasn't worth the effort. I DO use byte instead of int most of the time now, though. :wink:

Reminds me of when I was complaining to my technology VP that our new app was "so heavy and slow" that it was swamping the server. His response: We can spend $50K and 9 months on trying to improve the code or $30K and two weeks on a more powerful server to run it. We bought the server.

After getting trapped in the idea that I needed to write the tightest, fastest possible code on every line, I started to feel like it wasn't worth the effort.

Well, when you've maxed out the flash or ram on the controller, then you don't have any other option. I maxed out both on my current project, which lead me to this post.

And yes... a 328 is on it's way. :slight_smile:

Thanks for the tips above. I didn't realize that the disassembler would show the c source alongside the assembly! That's great!

Justin, the assembler output posted above did not come from a disassembler, its produced by a utility called avr-objdump.exe that uses an intermediate compiler output file to show C source mixed with assembler. http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1233497593

Gotcha. I mistakingly called it a disassembler. Makes sense.

My definition of optimizing code =

Any sketch I write that doesn't cause red stuff to print in the IDE output window and also actually performs the function I'm trying accomplish. Anything better then that I'll leave to you software geeks. :wink:

Lefty

the assembler output posted above did not come from a disassembler, its produced by a utility called avr-objdump.exe

It depends on how you define things. I belive that objump DOES do "true dissassembly" of the object, to assembler code. But the intertwining of C code is done using the original source and line-number information saved in the object files for the debugger; you don't get a utility that output C from object code, you get one that outputs assmbler, and matchs that up with C code you already have.

The bottom line is that AVR-GCC is buggy. If you need performance, you gotta write some assembly. I'm doing some signal processing with 20kHz sample rate and those multiplications are costly. The compiler is doing a lousy job. For instance, I posted this on AVR freaks forum, http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=75909 :
Multiplication by a constant

int c;
int b = -234; // something, doesn't matter

void main() {
  c = 63 * b;
}

compiles into a loop with shifts that takes 36 cycles, while direct multiplication using the mul instruction takes 8. That's 4,5x faster. It's a well known AVR-GCC bug. The bug being that the compiler always compiles multiplication by powers of 2 as shifts. That's slow on the AVR platform as there are only 1-bit shift instructions.

So, I like the philosophy of writing a readable code, but with a working compiler. Another thing, I'm doing it for fun so there's no cost in spending some more time ...

But I'd recommend writing y *= 8 instead of y <<= 3.

Sorry to resurrect this old thread, but I need to have some input on this :slight_smile:

Has anybody investigated if there is a speed/space improvement by using pointers as arguments to functions - and use them as return values..

eg..

void hubba(unsigned char* chr){

...blalbla..

*chr = *something*


}

versus

unsigned char hubba(){
  unsigned char ret;
..blabla.

return ret;
}

IMHO the sram usage would be less when using pointers, but are there any speed improvements ?

Any sketch I write that doesn't cause red stuff to print in the IDE output window and also actually performs the function I'm trying accomplish.

Works for me....... :slight_smile:

IMHO the sram usage would be less when using pointers, but are there any speed improvements ?

Instinctively I would have said the contrary, that the pointer version is worse in all aspects. But instincts don't really count for anything in optimisation, so I spent a little time on the problem.

My first conclusion I reached is, that it's not obvious to write a little test program that the compiler wouldn't optimise away. In the end I used this code:

void hubba(unsigned char* chr){
    delay (10);
    *chr = laps & 0xf3;
}

unsigned char bubba(){
    delay (14);
    return laps & 0xf5;
}

...

unsigned char c;
hubba (&c);
Serial.println (c);
c = bubba ();
Serial.println (c);

The delay and the Serial.println are used to stop the compiler from getting too smart.

The functions themselves no look like this in assembler:

unsigned char bubba(){
    delay (14);
 118:   6e e0           ldi r22, 0x0E   ; 14
 11a:   70 e0           ldi r23, 0x00   ; 0
 11c:   80 e0           ldi r24, 0x00   ; 0
 11e:   90 e0           ldi r25, 0x00   ; 0
 120:   0e 94 b0 01     call    0x360   ; 0x360 <delay>
 124:   80 91 36 01     lds r24, 0x0136
    return laps & 0xf5;
}
 128:   85 7f           andi    r24, 0xF5   ; 245
 12a:   08 95           ret

The return value is returned in r24 and not via ram, which makes the code quite small.

void hubba(unsigned char* chr){
 12c:   0f 93           push    r16
 12e:   1f 93           push    r17
 130:   8c 01           movw    r16, r24
    delay (10);
 132:   6a e0           ldi r22, 0x0A   ; 10
 134:   70 e0           ldi r23, 0x00   ; 0
 136:   80 e0           ldi r24, 0x00   ; 0
 138:   90 e0           ldi r25, 0x00   ; 0
 13a:   0e 94 b0 01     call    0x360   ; 0x360 <delay>
    *chr = laps & 0xf3;
 13e:   80 91 36 01     lds r24, 0x0136
 142:   83 7f           andi    r24, 0xF3   ; 243
 144:   f8 01           movw    r30, r16
 146:   80 83           st  Z, r24
}
 148:   1f 91           pop r17
 14a:   0f 91           pop r16
 14c:   08 95           ret

The code is longer and takes more cycles, for a big part I guess because of moving stuff on the stack and back. If the functions were more complicated, I guess that the first version will also need it so they'd end up pretty even.

My conclusion is, that a return by value has a lot of potential benefits like as return values registers, inlining, storing local variables in registers and easier readability but I wasn't really able to point out any downsides. At worst, it will behave as efficient as the version with pointers.

The other question to ask, what's the point of optimising around with parameter passing. If the call overhead is relevant and worth optimisation, the handling of the stack frames and the function call itself will use up a lot more than the return value handling. You probably will need to get rid of the function or inlining it. And if the call overhead isn't a problem, optimising the return value passing is a waste of time.

So when is passing by reference preferable? Here we come back to the problem that creating a significant test situation. For a simple variable types the pointer overhead makes it only worse. The compiler will handle them via registers and save the processor all the loading and storing. But as soon as you move to bigger return values that need to be passed back on the stack and copied around, for example arrays or classes, the pointer might become more efficient and use less RAM. While I doubt the efficiency will matter much, the benefits of using less RAM can become critical.

My final answer to the question thus is: It depends and don't make your program more complicated for a performance gain you can't measure.

Korman

But the 328p as an wired multiplier, how can it be never used by all that multiplications???

But the 328p as an wired multiplier, how can it be never used by all that multiplications???

You're referring to the older messages with "multiply by constant" examples?

The first possibility is that the AVR multiply instruction doesn't work the way that the compiler wants multiply to work (it's an 8bit multiplier, IIRC, and I mentioned that C has this thing about wanting to do math with 16bit quantities, for example.)

But I think there is a long-standing "bug" in avr-gcc where it just doesn't use the HW multiplier, even when that would result in shorter/faster code. Perhaps due to the first issue.