#define versus const and other software performance issues

I've just added Maik Schmidt's Arduino: A Quick Start Guide to my expanding library. Overall, I like the book and think it's well-written. (I've written 15 programming books and taught in the computer science department at a Big Ten university so I'm comfortable commenting on writing style.) However, there are a few places where he is misleading and perhaps even wrong.

On page 94, he makes the statement:

So, whenever you define a constant value in your program, declare it as such (using const, not using #define).

Blanket statements like this are always dangerous and #define has a place in C. First, a #define is a textual replacement...that's all. The result is that there is no entry in the symbol table...there is no rvalue or lvalue. So what, you might ask. Well, it has two impacts on your Arduino code. First, #define's are resolved at compile time, not run time. Second, when coupled with the first fact suggests that the code should run slightly faster for #define's since there is no table look up. I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define. On the downside, because #define's are textual replacements and they are not in the symbol table, they cannot be examined by a debugger...they're gone after compilation. However, since they are constants, this is no big loss.

Another thing to keep in mind is that about the slowest basic math thing you can do is a division. At one place in the book, he shows the code:

return (voltage * 1000 - 500) / 10;

This can be rewritten as:

return (voltage * 1000 - 500) * .10; // Original was divided by 10

In a tight loop, this change saved about 3% execution time. If you're working with integers, you can also do bit shifting to divide by 10. If you do something weird like this, make sure you comment what you did. (I didn't time bit shifting, but my guess is that it would be even faster.)

For what it's worth, I did the multiply versus divide test on Visual Studio using C# and found an almost a 66% speed improvement using the multiply on floating point numbers. If you work in VS, forget the float data type. Because of the builtin numeric cooprocessor (e.g., built upon the old 8087 chip) in the Pentium class chips, all floats get promoted to double, the operation is then calculated, and then the result must be "demoted" back to a float for the answer. All this promotion-demotion takes time. The #define versus const tradeoff in VS is virtually the same as the Arduino results.

Finally, when I started the test on the Arduino IDE, I was stunned at the speed of the loop. It was so fast that I tripled the number of loop passes and ran the test again. No change in the time. Then I had a flat-forehead epiphany (you know, where you ram the heel of your hand into your forehead wonder how you could be so dumb). Because I did nothing with the result of the single statement math calculation in the loop, the compiler simply ignored the loop! This suggested a pretty code code parser, so I checked to see where the Arduino compiler came from. I then learned it's based on the Gnu (Open Source) compiler, which is a pretty darn good compiler.

The lessons to be learned: 1) #define's are not bad things and actually give a slight performance boost, 2) when working with floating point numbers and constants, use multiplication instead of division for constants whenever you can, and 3) resolve any constants that you can rather than doing it in code. For example, a temperature conversion:

F = C x 9/5 + 32

is faster if rewritten:

F = C x 1.8 + 32 // 1.8 = 9/5

While this change won't make a hill of beans difference alone, it could if it's in a tight loop.

I hope this helps...

Nice analysis. But I have to take exception to point #3.

F = C x 9/5 + 32

is faster if rewritten:

F = C x 1.8 + 32 // 1.8 = 9/5

That is only correct if F and C are floating point numbers. If they are integers, the first expression will be much faster. The arduino chip has no built-in floating point support - it's all done by emulation in the libraries. So the second expression involves converting an int to a float, doing a floating point multiplication and converting back to an int. Whereas the 8-bit multiply instruction takes 2 cycles. Integer division takes longer, but it is surely less than a floating point multiply.

$0.02

Yep, speed issues are only for floating point numbers.

I was not aware that the Arduino compiler was integer only and there was no native floating point support. I'm not sure what the impact is in this case. Do you know if there is any thought about native fp for the compiler? I know the gnu compiler supports both floating point types. (I think it meets the old X3J11 C language standards.)

The compiler does floating point: it's part of the C language (it's gcc-avr). The hardware has no floating point support, so all the floating point math is done by emulation. The compiler inserts the code automatically to emulate floating point operations. It's just very expensive in terms of cycles used and in code space as well, compared to integer math.

Most of the time, I try really hard to not do floating point operations on an arduino, even if it makes the code a little messier. If performance is an issue, you're better off doing integer math.

It isn't that compiler (gcc-avr) doesn't support floats, it is the target hardware that has no FPU. Therefore all float math has to be emulated.

I also disagree that "F = C x 9/5 + 32" is slower for float case if rewritten "F = C x (9/5) + 32" because the compiler will convert 9/5 to 1.8. You also of course have to promote the operation to floating point to get the desired result "F = C x (9.0f/5.0f) + 32.0f"

Finally, where consts are used, the compiler replaces them with the const values so they end up being equivalent to a #define. The difference in speed you're seeing is most likely because consts are typed and #defines are not. If you say:
const float X = 7
vs
#define X 7
and time Y = X + 1, the define will be faster because the const float promoted the expression to float. The same is true with consting a int, long, or char as the compiler will make a better decision than you forced it to use. Remember that the AVR is an 8-bit CPU, so using smaller types results in smaller and faster code.

econjack:
I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define.

There is a flaw in your test or in your measurements.

return (voltage * 1000 - 500) / 10;
This can be rewritten as:
return (voltage * 1000 - 500) * .10; // Original was divided by 10
In a tight loop, this change saved about 3% execution time.

Not when voltage is an integer datatype.

(I didn't time bit shifting, but my guess is that it would be even faster.)

The GCC AVR compiler does a very good job optimizing. When bit-shifting is a better choice, the compiler will use it.

In other words, first write code that is correct, can be debugged, and can be maintained. Then, and only then, worry about hand optimizations.

econjack:
I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define.

Using #define:

#define COUNT 20

void setup (){}

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
    x++;
}

Generates:

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
  bc:	9f 5f       	subi	r25, 0xFF	; 255
  be:	94 31       	cpi	r25, 0x14	; 20
  c0:	d1 f7       	brne	.-12     	; 0xb6 <loop+0xe>
    x++;
}
  c2:	0f 90       	pop	r0
  c4:	cf 91       	pop	r28
  c6:	df 91       	pop	r29
  c8:	08 95       	ret

Using const int:

const int COUNT = 20;

void setup (){}

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
    x++;
}

Generates:

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
  bc:	9f 5f       	subi	r25, 0xFF	; 255
  be:	94 31       	cpi	r25, 0x14	; 20
  c0:	d1 f7       	brne	.-12     	; 0xb6 <loop+0xe>
    x++;
}
  c2:	0f 90       	pop	r0
  c4:	cf 91       	pop	r28
  c6:	df 91       	pop	r29
  c8:	08 95       	ret

It generates the same code! There are no percentage differences there.

I agree with this. Don't try to outsmart the compiler.

However you can help it by thinking about your data types. For example:

long foo [100];

Compared to:

byte foo [100];

Now if "foo" can only ever hold 0 to 100, then the "byte" version is going to help the compiler to generate code that is smaller and faster. It can't guess what sort of data you are expecting to get.

While we're on the subject of bit twiddling, I've seen the compiler turn this:

X = (A << 4) | B

into

swap A
X = A & 0xf0
X = A | B

The AVR documentation says that logical left shift takes one clock cycle so it would make more sense to me to have it LSL A, 4 rather than swap the nibbles and mask out the bottom nibble. Any idea why the compiler did it that way? Does it take one cycle per shift? The only other reason I could think of was to preserve the flags.

There is no "LSL A, 4" instruction. The instruction is "LSL A". One bit shifted per machine instruction.

The swap-plus-bitwise-and is two machine instructions. Bit shifting left four bits is four machine instructions.

return (voltage * 1000 - 500) / 10;

This can be rewritten as:

return (voltage * 1000 - 500) * .10; // Original was divided by 10

it can even be rewritten as

return voltage * 100 - 50;

and I rewrote F = C x 9/5 + 32 with all integers recently to F = (C x 18 + 322) /10 same amount of math but "much" higher precision.

Rethinking math before coding is often best way to optimize imho ..

Ah that's what I expected but apparently didn't read the instruction set reference completely. So often my mind just jumps immediately to the Intel analogue that does support multiple shifts per instruction (although it used to take multiple clocks even to shift one bit).

I absolutely agree with this statement, and would actually take it a bit further. Clarity in code is absolutely essential for the ongoing maintenance and support of software. Overly-optimized code is difficult to read, easy to misinterpret, and difficult to maintain and extend. This rule is especially important for library developers, who tend to move on to the next shiny project once their current library has been released. Keep in mind that someone else will need to be able to update and maintain your code. In the case of Arduino, you can safely replace "someone else" with "thousands of other developers."

As an example. while your wickedly tight 300 character one-liner with multiple bit shifts and repeated magic numbers may save a couple of processing cycles during runtime, it will be utterly incomprehensible to you in six months; another developer simply won't get it. At best they may take the time to break the statement down to a more maintainable block of code; in reality they will probably screw up the end result.

http://www2.research.att.com/~bs/JSF-AV-rules.pdf Is an excellent, well-thought out set of guidelines for C++ developers, written by Bjarne Stroustrup (father of C++). Those of us who have written production C++ for military subcontracts know this one well; it is based upon hard-earned knowledge from large-scale, real-world projects.

This made me laugh, from my above post:

for (byte i = 0; i < COUNT; i++)
  bc:	9f 5f       	subi	r25, 0xFF	; 255

Apparently the compiler prefers to subtract 255 from a byte, rather than add 1.

Actually, I believe there is no add-immediate instruction. Yup. There is an add-immediate-word but no add-immediate-byte.

When Atmel says the AVR processor is "RISC", they aren't kidding!

Well I suppose they don't need it, and subtracting is arguably more useful than adding.

If you're working with integers, you can also do bit shifting to divide by 10. If you do something weird like this, make sure you comment what you did.

Please explain how. Bit shifting one place divides by 2. Then, you still need to divide by 5. I can't figure how how to shift any direction to divide by 5.

A bit of Googling and I found an answer for at least an approximation. And hey, isn't most stuff an approximation? This test sketch:

const boolean SHOW_RESULTS = false;
const unsigned long ITERATIONS = 50000;

void setup ()
{

  Serial.begin (115200);
  Serial.println ();

  unsigned long start;
 

  Serial.println ("By shifting:");
  
  start = millis ();

  for (unsigned long n = 0; n < ITERATIONS; n++)
  {
    volatile unsigned long n2, n4, n32, n64, n102, ans;

    n2 = n     << 1;    // n * 2
    n4 = n2    << 1;    // n * 4
    n32 = n4   << 3;    // n * 32
    n64 = n32  << 1;    // n * 64
    n102 = n64 + n32 + n4 + n2;  // n * 102
    ans = n102 >> 10;  // n * 102 / 1024

    if (SHOW_RESULTS)
    {
      Serial.print ("n = ");
      Serial.print (n, DEC);
      Serial.print (" n / 10 = ");
      Serial.println (ans, DEC);
    }

  } // end for loop

  Serial.print ("time taken = ");
  Serial.println (millis () - start, DEC);

  Serial.println ("By multiplying and then shifting:");
  
  start = millis ();

  for (unsigned long n = 0; n < ITERATIONS; n++)
  {
    volatile unsigned long ans;

    ans = (n * 102) >> 10;  // n * 102 / 1024

    if (SHOW_RESULTS)
    {
      Serial.print ("n = ");
      Serial.print (n, DEC);
      Serial.print (" n / 10 = ");
      Serial.println (ans, DEC);
    }

  } // end for loop

  Serial.print ("time taken = ");
  Serial.println (millis () - start, DEC);
  
  Serial.println ("By dividing:");
  
  start = millis ();

  for (unsigned long n = 0; n < ITERATIONS; n++)
  {
    volatile unsigned long ans;

    ans = n / 10;

    if (SHOW_RESULTS)
    {
      Serial.print ("n = ");
      Serial.print (n, DEC);
      Serial.print (" n / 10 = ");
      Serial.println (ans, DEC);
    }

  } // end for loop

  Serial.print ("time taken = ");
  Serial.println (millis () - start, DEC);

} // end setup

void loop() {}

As written, I get this:

By shifting:
time taken = 771
By multiplying and then shifting:
time taken = 294
By dividing:
time taken = 1956

So, the basic method is to multiply by 102 and then shift right 10 bits (ie. divide by 1024). This is an approximation in the sense that we want to multiply by 0.1 but are actually multiplying by 102/1024 (0.09961).

I tried three ways, the first was to multiply by 102 by doing assorted shifts, the second was to let the compiler do the multiplication, and the third to let the compiler do the division.

The figures show that the division was the slowest (1956 mS), my shifting idea second slowest (771 mS) and the multiply-then-shift fastest (294 mS). In fact the multiply-then-shift is over 6 times as fast, if you can stand the loss of precision.

Obviously this is only really useful if you know the numbers in advance. It's also interesting that my attempt to second-guess the compiler and do a "more efficient" multiply by 102 was actually quite a lot slower than what the compiler generated, left to its own devices.

Your method would be faster if you hadn't made the intermediates n2, n4, n32, n64 and n102 unnecessarily volatile. You could make it slightly faster still with the chain 2,3,6,96,102.

The reason the multiply method is fastest is because the ATmega has a 2 cycle 8-bit multiply instruction built in. You really want the shift to be 8 or 16 bits as then the shifting can be done for free (the compiler just moves the bytes in the long down by 1 or 2 places). Try (n*6554)>>16