Go Down

### Topic: #define versus const and other software performance issues (Read 9273 times)previous topic - next topic

#### econjack

##### Oct 13, 2011, 04:15 pm
I've just added Maik Schmidt's Arduino: A Quick Start Guide to my expanding library. Overall, I like the book and think it's well-written. (I've written 15 programming books and taught in the computer science department at a Big Ten university so I'm comfortable commenting on writing style.) However, there are a few places where he is misleading and perhaps even wrong.

On page 94, he makes the statement:

So, whenever you define a constant value in your program, declare it as such (using const, not using #define).

Blanket statements like this are always dangerous and #define has a place in C. First, a #define is a textual replacement...that's all. The result is that there is no entry in the symbol table...there is no rvalue or lvalue. So what, you might ask. Well, it has two impacts on your Arduino code. First, #define's are resolved at compile time, not run time. Second, when coupled with the first fact suggests that the code should run slightly faster for #define's since there is no table look up. I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define. On the downside, because #define's are textual replacements and they are not in the symbol table, they cannot be examined by a debugger...they're gone after compilation. However, since they are constants, this is no big loss.

Another thing to keep in mind is that about the slowest basic math thing you can do is a division. At one place in the book, he shows the code:

return (voltage * 1000 - 500) / 10;

This can be rewritten as:

return (voltage * 1000 - 500) * .10;     // Original was divided by 10

In a tight loop, this change saved about 3% execution time. If you're working with integers, you can also do bit shifting to divide by 10. If you do something weird like this, make sure you comment what you did. (I didn't time bit shifting, but my guess is that it would be even faster.)

For what it's worth, I did the multiply versus divide test on Visual Studio using C# and found an almost a 66% speed improvement using the multiply on floating point numbers. If you work in VS, forget the float data type. Because of the builtin numeric cooprocessor (e.g., built upon the old 8087 chip) in the Pentium class chips, all floats get promoted to double, the operation is then calculated, and then the result must be "demoted" back to a float for the answer. All this promotion-demotion takes time. The #define versus const tradeoff in VS is virtually the same as the Arduino results.

Finally, when I started the test on the Arduino IDE, I was stunned at the speed of the loop. It was so fast that I tripled the number of loop passes and ran the test again. No change in the time. Then I had a flat-forehead epiphany (you know, where you ram the heel of your hand into your forehead wonder how you could be so dumb). Because I did nothing with the result of the single statement math calculation in the loop, the compiler simply ignored the loop! This suggested a pretty code code parser, so I checked to see where the Arduino compiler came from. I then learned it's based on the Gnu (Open Source) compiler, which is a pretty darn good compiler.

The lessons to be learned: 1) #define's are not bad things and actually give a slight performance boost, 2) when working with floating point numbers and constants, use multiplication instead of division for constants whenever you can, and 3) resolve any constants that you can rather than doing it in code. For example, a temperature conversion:

F = C x 9/5 + 32

is faster if rewritten:

F = C x 1.8 + 32      // 1.8 = 9/5

While this change won't make a hill of beans difference alone, it could if it's in a tight loop.

I hope this helps...

#### ckiick

#1
##### Oct 13, 2011, 05:20 pm
Nice analysis. But I have to take exception to point #3.

Quote
F = C x 9/5 + 32

is faster if rewritten:

F = C x 1.8 + 32      // 1.8 = 9/5

That is only correct if F and C are floating point numbers. If they are integers, the first expression will be much faster.  The arduino chip has no built-in floating point support - it's all done by emulation in the libraries.  So the second expression involves converting an int to a float, doing a floating point multiplication and converting back to an int.  Whereas the 8-bit multiply instruction takes 2 cycles.  Integer division takes longer, but it is surely less than a floating point multiply.

\$0.02

#### econjack

#2
##### Oct 13, 2011, 07:01 pm
Yep, speed issues are only for floating point numbers.

I was not aware that the Arduino compiler was integer only and there was no native floating point support. I'm not sure what the impact is in this case. Do you know if there is any thought about native fp for the compiler? I know the gnu compiler supports both floating point types. (I think it meets the old X3J11 C language standards.)

#### ckiick

#3
##### Oct 13, 2011, 08:50 pm
The compiler does floating point: it's part of the C language (it's gcc-avr). The hardware has no floating point support, so all the floating point math is done by emulation.  The compiler inserts the code automatically to emulate floating point operations.  It's just very expensive in terms of cycles used and in code space as well, compared to integer math.

Most of the time, I try really hard to not do floating point operations on an arduino, even if it makes the code a little messier.  If performance is an issue, you're better off doing integer math.

#### CapnBry

#4
##### Oct 13, 2011, 08:55 pmLast Edit: Oct 13, 2011, 08:57 pm by CapnBry Reason: 1
It isn't that compiler (gcc-avr) doesn't support floats, it is the target hardware that has no FPU. Therefore all float math has to be emulated.

I also disagree that "F = C x 9/5 + 32" is slower for float case if rewritten "F = C x (9/5) + 32" because the compiler will convert 9/5 to 1.8. You also of course have to promote the operation to floating point to get the desired result "F = C x (9.0f/5.0f) + 32.0f"

Finally, where consts are used, the compiler replaces them with the const values so they end up being equivalent to a #define. The difference in speed you're seeing is most likely because consts are typed and #defines are not. If you say:
const float X = 7
vs
#define X 7
and time Y = X + 1, the define will be faster because the const float promoted the expression to float. The same is true with consting a int, long, or char as the compiler will make a better decision than you forced it to use. Remember that the AVR is an 8-bit CPU, so using smaller types results in smaller and faster code.

#### Coding Badly

#5
##### Oct 13, 2011, 10:39 pm
I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define.

There is a flaw in your test or in your measurements.

#### Coding Badly

#6
##### Oct 13, 2011, 10:43 pm
Quote
return (voltage * 1000 - 500) / 10;
This can be rewritten as:
return (voltage * 1000 - 500) * .10;     // Original was divided by 10
In a tight loop, this change saved about 3% execution time.

Not when voltage is an integer datatype.

#### Coding Badly

#7
##### Oct 13, 2011, 10:47 pmLast Edit: Oct 13, 2011, 10:50 pm by Coding Badly Reason: 1
Quote
(I didn't time bit shifting, but my guess is that it would be even faster.)

The GCC AVR compiler does a very good job optimizing.  When bit-shifting is a better choice, the compiler will use it.

In other words, first write code that is correct, can be debugged, and can be maintained.  Then, and only then, worry about hand optimizations.

#### nickgammon

#8
##### Oct 14, 2011, 08:17 am

I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define.

Using #define:

Code: [Select]
`#define COUNT 20void setup (){}void loop (){  volatile byte x = 0;    for (byte i = 0; i < COUNT; i++)    x++;}`

Generates:

Code: [Select]
`void loop (){  volatile byte x = 0;    for (byte i = 0; i < COUNT; i++)  bc: 9f 5f        subi r25, 0xFF ; 255  be: 94 31        cpi r25, 0x14 ; 20  c0: d1 f7        brne .-12      ; 0xb6 <loop+0xe>    x++;}  c2: 0f 90        pop r0  c4: cf 91        pop r28  c6: df 91        pop r29  c8: 08 95        ret`

Using const int:

Code: [Select]
`const int COUNT = 20;void setup (){}void loop (){  volatile byte x = 0;    for (byte i = 0; i < COUNT; i++)    x++;}`

Generates:

Code: [Select]
`void loop (){  volatile byte x = 0;    for (byte i = 0; i < COUNT; i++)  bc: 9f 5f        subi r25, 0xFF ; 255  be: 94 31        cpi r25, 0x14 ; 20  c0: d1 f7        brne .-12      ; 0xb6 <loop+0xe>    x++;}  c2: 0f 90        pop r0  c4: cf 91        pop r28  c6: df 91        pop r29  c8: 08 95        ret`

It generates the same code! There are no percentage differences there.

In other words, first write code that is correct, can be debugged, and can be maintained.  Then, and only then, worry about hand optimizations.

I agree with this. Don't try to outsmart the compiler.

However you can help it by thinking about your data types. For example:

Code: [Select]
`long foo [100];`

Compared to:

Code: [Select]
`byte foo [100];`

Now if "foo" can only ever hold 0 to 100, then the "byte" version is going to help the compiler to generate code that is smaller and faster. It can't guess what sort of data you are expecting to get.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

#### CapnBry

#9
##### Oct 14, 2011, 03:36 pm

The GCC AVR compiler does a very good job optimizing.  When bit-shifting is a better choice, the compiler will use it.

While we're on the subject of bit twiddling, I've seen the compiler turn this:
Code: [Select]
`X = (A << 4) | B`
into
Code: [Select]
`swap AX = A & 0xf0X = A | B`
The AVR documentation says that logical left shift takes one clock cycle so it would make more sense to me to have it LSL A, 4 rather than swap the nibbles and mask out the bottom nibble. Any idea why the compiler did it that way? Does it take one cycle *per shift*? The only other reason I could think of was to preserve the flags.

#### Coding Badly

#10
##### Oct 14, 2011, 07:44 pm

There is no "LSL A, 4" instruction.  The instruction is "LSL A".  One bit shifted per machine instruction.

The swap-plus-bitwise-and is two machine instructions.  Bit shifting left four bits is four machine instructions.

#### robtillaart

#11
##### Oct 14, 2011, 08:04 pmLast Edit: Oct 14, 2011, 08:17 pm by robtillaart Reason: 1
Quote
return (voltage * 1000 - 500) / 10;

This can be rewritten as:

return (voltage * 1000 - 500) * .10;     // Original was divided by 10

it can even be rewritten as

return voltage * 100 - 50;

and I rewrote   F = C x 9/5 + 32  with all integers recently to   F = (C x 18 + 322) /10     same amount of math but "much" higher precision.

Rethinking math before coding is often best way to optimize imho ..

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

#### CapnBry

#12
##### Oct 14, 2011, 09:07 pm

There is no "LSL A, 4" instruction.  The instruction is "LSL A".

Ah that's what I expected but apparently didn't read the instruction set reference completely. So often my mind just jumps immediately to the Intel analogue that does support multiple shifts per instruction (although it used to take multiple clocks even to shift one bit).

#### buzzdavidson

#13
##### Oct 14, 2011, 09:57 pm

In other words, first write code that is correct, can be debugged, and can be maintained.  Then, and only then, worry about hand optimizations.

I absolutely agree with this statement, and would actually take it a bit further.  Clarity in code is absolutely essential for the ongoing maintenance and support of  software.  Overly-optimized code is difficult to read, easy to misinterpret, and difficult to maintain and extend.  This rule is especially important for library developers, who tend to move on to the next shiny project once their current library has been released.  Keep in mind that someone else will need to be able to update and maintain your code.  In the case of Arduino, you can safely replace "someone else" with "thousands of other developers."

As an example. while your wickedly tight 300 character one-liner with multiple bit shifts and repeated magic numbers may save a couple of processing cycles during runtime, it will be utterly incomprehensible to you in six months; another developer simply won't get it.  At best they may take the time to break the statement down to a more maintainable block of code; in reality they will probably screw up the end result.

http://www2.research.att.com/~bs/JSF-AV-rules.pdf Is an excellent, well-thought out set of guidelines for C++ developers, written by Bjarne Stroustrup (father of C++).  Those of us who have written production C++ for military subcontracts know this one well; it is based upon hard-earned knowledge from large-scale, real-world projects.

#### nickgammon

#14
##### Oct 14, 2011, 11:03 pm
This made me laugh, from my above post:

Code: [Select]
`for (byte i = 0; i < COUNT; i++)  bc: 9f 5f        subi r25, 0xFF ; 255`

Apparently the compiler prefers to subtract 255 from a byte, rather than add 1.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

Go Up