#define versus const and other software performance issues

It isn't that compiler (gcc-avr) doesn't support floats, it is the target hardware that has no FPU. Therefore all float math has to be emulated.

I also disagree that "F = C x 9/5 + 32" is slower for float case if rewritten "F = C x (9/5) + 32" because the compiler will convert 9/5 to 1.8. You also of course have to promote the operation to floating point to get the desired result "F = C x (9.0f/5.0f) + 32.0f"

Finally, where consts are used, the compiler replaces them with the const values so they end up being equivalent to a #define. The difference in speed you're seeing is most likely because consts are typed and #defines are not. If you say:
const float X = 7
vs
#define X 7
and time Y = X + 1, the define will be faster because the const float promoted the expression to float. The same is true with consting a int, long, or char as the compiler will make a better decision than you forced it to use. Remember that the AVR is an 8-bit CPU, so using smaller types results in smaller and faster code.

econjack:
I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define.

There is a flaw in your test or in your measurements.

return (voltage * 1000 - 500) / 10;
This can be rewritten as:
return (voltage * 1000 - 500) * .10; // Original was divided by 10
In a tight loop, this change saved about 3% execution time.

Not when voltage is an integer datatype.

(I didn't time bit shifting, but my guess is that it would be even faster.)

The GCC AVR compiler does a very good job optimizing. When bit-shifting is a better choice, the compiler will use it.

In other words, first write code that is correct, can be debugged, and can be maintained. Then, and only then, worry about hand optimizations.

econjack:
I wrote a small test program and found that, in a tight loop, there is about a 3.5% hit on performance when using const versus #define.

Using #define:

#define COUNT 20

void setup (){}

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
    x++;
}

Generates:

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
  bc:	9f 5f       	subi	r25, 0xFF	; 255
  be:	94 31       	cpi	r25, 0x14	; 20
  c0:	d1 f7       	brne	.-12     	; 0xb6 <loop+0xe>
    x++;
}
  c2:	0f 90       	pop	r0
  c4:	cf 91       	pop	r28
  c6:	df 91       	pop	r29
  c8:	08 95       	ret

Using const int:

const int COUNT = 20;

void setup (){}

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
    x++;
}

Generates:

void loop ()
{
  volatile byte x = 0;
  
  for (byte i = 0; i < COUNT; i++)
  bc:	9f 5f       	subi	r25, 0xFF	; 255
  be:	94 31       	cpi	r25, 0x14	; 20
  c0:	d1 f7       	brne	.-12     	; 0xb6 <loop+0xe>
    x++;
}
  c2:	0f 90       	pop	r0
  c4:	cf 91       	pop	r28
  c6:	df 91       	pop	r29
  c8:	08 95       	ret

It generates the same code! There are no percentage differences there.

I agree with this. Don't try to outsmart the compiler.

However you can help it by thinking about your data types. For example:

long foo [100];

Compared to:

byte foo [100];

Now if "foo" can only ever hold 0 to 100, then the "byte" version is going to help the compiler to generate code that is smaller and faster. It can't guess what sort of data you are expecting to get.

While we're on the subject of bit twiddling, I've seen the compiler turn this:

X = (A << 4) | B

into

swap A
X = A & 0xf0
X = A | B

The AVR documentation says that logical left shift takes one clock cycle so it would make more sense to me to have it LSL A, 4 rather than swap the nibbles and mask out the bottom nibble. Any idea why the compiler did it that way? Does it take one cycle per shift? The only other reason I could think of was to preserve the flags.

There is no "LSL A, 4" instruction. The instruction is "LSL A". One bit shifted per machine instruction.

The swap-plus-bitwise-and is two machine instructions. Bit shifting left four bits is four machine instructions.

return (voltage * 1000 - 500) / 10;

This can be rewritten as:

return (voltage * 1000 - 500) * .10; // Original was divided by 10

it can even be rewritten as

return voltage * 100 - 50;

and I rewrote F = C x 9/5 + 32 with all integers recently to F = (C x 18 + 322) /10 same amount of math but "much" higher precision.

Rethinking math before coding is often best way to optimize imho ..

Ah that's what I expected but apparently didn't read the instruction set reference completely. So often my mind just jumps immediately to the Intel analogue that does support multiple shifts per instruction (although it used to take multiple clocks even to shift one bit).

I absolutely agree with this statement, and would actually take it a bit further. Clarity in code is absolutely essential for the ongoing maintenance and support of software. Overly-optimized code is difficult to read, easy to misinterpret, and difficult to maintain and extend. This rule is especially important for library developers, who tend to move on to the next shiny project once their current library has been released. Keep in mind that someone else will need to be able to update and maintain your code. In the case of Arduino, you can safely replace "someone else" with "thousands of other developers."

As an example. while your wickedly tight 300 character one-liner with multiple bit shifts and repeated magic numbers may save a couple of processing cycles during runtime, it will be utterly incomprehensible to you in six months; another developer simply won't get it. At best they may take the time to break the statement down to a more maintainable block of code; in reality they will probably screw up the end result.

http://www2.research.att.com/~bs/JSF-AV-rules.pdf Is an excellent, well-thought out set of guidelines for C++ developers, written by Bjarne Stroustrup (father of C++). Those of us who have written production C++ for military subcontracts know this one well; it is based upon hard-earned knowledge from large-scale, real-world projects.

This made me laugh, from my above post:

for (byte i = 0; i < COUNT; i++)
  bc:	9f 5f       	subi	r25, 0xFF	; 255

Apparently the compiler prefers to subtract 255 from a byte, rather than add 1.

Actually, I believe there is no add-immediate instruction. Yup. There is an add-immediate-word but no add-immediate-byte.

When Atmel says the AVR processor is "RISC", they aren't kidding!

Well I suppose they don't need it, and subtracting is arguably more useful than adding.

If you're working with integers, you can also do bit shifting to divide by 10. If you do something weird like this, make sure you comment what you did.

Please explain how. Bit shifting one place divides by 2. Then, you still need to divide by 5. I can't figure how how to shift any direction to divide by 5.

A bit of Googling and I found an answer for at least an approximation. And hey, isn't most stuff an approximation? This test sketch:

const boolean SHOW_RESULTS = false;
const unsigned long ITERATIONS = 50000;

void setup ()
{

  Serial.begin (115200);
  Serial.println ();

  unsigned long start;
 

  Serial.println ("By shifting:");
  
  start = millis ();

  for (unsigned long n = 0; n < ITERATIONS; n++)
  {
    volatile unsigned long n2, n4, n32, n64, n102, ans;

    n2 = n     << 1;    // n * 2
    n4 = n2    << 1;    // n * 4
    n32 = n4   << 3;    // n * 32
    n64 = n32  << 1;    // n * 64
    n102 = n64 + n32 + n4 + n2;  // n * 102
    ans = n102 >> 10;  // n * 102 / 1024

    if (SHOW_RESULTS)
    {
      Serial.print ("n = ");
      Serial.print (n, DEC);
      Serial.print (" n / 10 = ");
      Serial.println (ans, DEC);
    }

  } // end for loop

  Serial.print ("time taken = ");
  Serial.println (millis () - start, DEC);

  Serial.println ("By multiplying and then shifting:");
  
  start = millis ();

  for (unsigned long n = 0; n < ITERATIONS; n++)
  {
    volatile unsigned long ans;

    ans = (n * 102) >> 10;  // n * 102 / 1024

    if (SHOW_RESULTS)
    {
      Serial.print ("n = ");
      Serial.print (n, DEC);
      Serial.print (" n / 10 = ");
      Serial.println (ans, DEC);
    }

  } // end for loop

  Serial.print ("time taken = ");
  Serial.println (millis () - start, DEC);
  
  Serial.println ("By dividing:");
  
  start = millis ();

  for (unsigned long n = 0; n < ITERATIONS; n++)
  {
    volatile unsigned long ans;

    ans = n / 10;

    if (SHOW_RESULTS)
    {
      Serial.print ("n = ");
      Serial.print (n, DEC);
      Serial.print (" n / 10 = ");
      Serial.println (ans, DEC);
    }

  } // end for loop

  Serial.print ("time taken = ");
  Serial.println (millis () - start, DEC);

} // end setup

void loop() {}

As written, I get this:

By shifting:
time taken = 771
By multiplying and then shifting:
time taken = 294
By dividing:
time taken = 1956

So, the basic method is to multiply by 102 and then shift right 10 bits (ie. divide by 1024). This is an approximation in the sense that we want to multiply by 0.1 but are actually multiplying by 102/1024 (0.09961).

I tried three ways, the first was to multiply by 102 by doing assorted shifts, the second was to let the compiler do the multiplication, and the third to let the compiler do the division.

The figures show that the division was the slowest (1956 mS), my shifting idea second slowest (771 mS) and the multiply-then-shift fastest (294 mS). In fact the multiply-then-shift is over 6 times as fast, if you can stand the loss of precision.

Obviously this is only really useful if you know the numbers in advance. It's also interesting that my attempt to second-guess the compiler and do a "more efficient" multiply by 102 was actually quite a lot slower than what the compiler generated, left to its own devices.

Your method would be faster if you hadn't made the intermediates n2, n4, n32, n64 and n102 unnecessarily volatile. You could make it slightly faster still with the chain 2,3,6,96,102.

The reason the multiply method is fastest is because the ATmega has a 2 cycle 8-bit multiply instruction built in. You really want the shift to be 8 or 16 bits as then the shifting can be done for free (the compiler just moves the bytes in the long down by 1 or 2 places). Try (n*6554)>>16

Your method would be faster if you hadn't made the intermediates n2, n4, n32, n64 and n102 unnecessarily volatile

You are right. I had volatile there to convince the compiler to do the work at all, but making only ans volatile gave me this:

By shifting:
time taken = 497
By multiplying and then shifting:
time taken = 294
By dividing:
time taken = 1956

Try (n*6554)>>16

Good call. I got:

By multiplying and then shifting:
time taken = 75

Closer too. Rather than multiplying by 0.1 we are multiplying by 0.1000061.

Which is about the same number of significant digits expected from float. Orders of magnitude faster and just as accurate. What's not to like!

wrote a shift multiply finder few years back in C# - Optimizing integer divisions with Multiply Shift in C# - CodeProject -