Stack Cost, Inline, and Compiler Optimization

For the sake of readability and maintainability, I have a lot of subprocesses broken out into individual functions, even though they may only be called from one place. In many cases, that one place may be inside a loop.

So now I'm trying to optimize one of my sketches for speed and ram, and I started wondering what the cost of this was in processor cycles and bytes of ram pushed to and from the stack. I did some googling but couldn't find any clear answers, especially in regards to Arduino and Atmel.

Then I recalled the "inline" command and started researching it. Supposedly the compiler optimizations often do this by itself, but I'm also aware that when the Arduino IDE launches the gcc compiler, it optimizes for size and not speed. I've also read a lot of people bitching that its optimizations often aren't very good.

So I am seeking advice from you advanced gurus - Should I:

  1. leave it up to the compiler - or
  2. add the inline keyword to my functions that are only called from one place, in loops - or
  3. move the code out of the separate function and into the loop?

And, can someone quote me the approximate cost in time and ram to push and pull from the stack?

I am a programming heretic - I do not use functions! Everything is written out inline.
I did add in my very first function recently, to send display messages to a MAX7219. But that was the first time since late 2010 when I started writing Arduino code.

Your call. Nick Gammon did some interrupt time testing, showed it took like 6-8uS to jump to an ISR and back, I imagine a function call would be similar.
Try the Interrupts section here

My answer is : maybe options 1, maybe option 2 and maybe option 3 :wink:

This is Nick Gammons page about the stack for an interrupt: Gammon Forum : Electronics : Microprocessors : Interrupts
But that could be very different from calling a normal function.

The compiler can do more than you think. For example the time to use a stack and call a function could be small or big, the compiler can optimize that.

Do you have a test sketch for me, so I can try some optimizations ?

When you want to optimize things, the compiler can do its part and you should do your part. For example the use of buffers and the functions like memcpy() and so.
The Arduino doesn't have a profiler, so you have to guess which part of the code is executed most of the time, and would benefit to optimize it.

The most compiler options for the avr-gcc compiler are for loop-optimization and inline functions. I tried a number of those, but it didn't do a lot.

Arduino IDE 1.5.8 BETA uses the newer compiler. It allows the -flto option. I use that for all my Arduino boards and standalone microcontrollers now. It's a very good optimization for size and speed (5 to 25%).
To set the -flto option : Code size with different versions and -flto optimization - Libraries - Arduino Forum

(while I was typing this, CrossRoads already mentioned Nick Gammon's page)

I've also read a lot of people bitching that its optimizations often aren't very good.

I'm not one of those people. I don't see any references to ISRs in the original question.

And, can someone quote me the approximate cost in time and ram to push and pull from the stack?

First, the compiler will tend to inline small functions, or even large functions called once. So there may be no overhead at all. As for calling a function I think it will be one machine cycle per byte, and it would take two cycles therefore to push a return address, and another two to pull it back. Basically the overhead would be minimal.

I always code for readability.

I wouldn't even bother with the inline keyword. It's just a compiler suggestion, it may not make any difference.

What is more likely to make a difference is choosing good data types (as small as will do the job), avoiding floating point arithmetic, using the F macro for strings, that sort of thing.

So now I'm trying to optimize one of my sketches for speed and ram ...

For a more detailed response please attach this sketch.

I did a test, using a Morse Code sketch I had lying around. Manually inlining a function called once saved 50 bytes of program code. That appeared to be (in part):

// get a line from serial (file name) 
//  forces to upper case
void getline (char * buf, size_t bufsize)
 196:	cf 92       	push	r12
 198:	df 92       	push	r13
 19a:	ff 92       	push	r15
 19c:	0f 93       	push	r16
 19e:	1f 93       	push	r17
 1a0:	cf 93       	push	r28
 1a2:	df 93       	push	r29
 1a4:	8c 01       	movw	r16, r24
 1a6:	eb 01       	movw	r28, r22
 1a8:	04 c0       	rjmp	.+8      	; 0x1b2 <_Z7getlinePcj+0x1c>

...

  }     // end of getline
 20c:	df 91       	pop	r29
 20e:	cf 91       	pop	r28
 210:	1f 91       	pop	r17
 212:	0f 91       	pop	r16
 214:	ff 90       	pop	r15
 216:	df 90       	pop	r13
 218:	cf 90       	pop	r12
 21a:	08 95       	ret

So 2 bytes per push/pop in program code, plus it seems it takes 2 cycles, not one, that I wrote earlier. Adding "inline" in front of this function did indeed save the 50 bytes in the compiled code. So perhaps there is some point, particularly for functions you know are only called once. That way you get the readability, without the overhead of the extra size and time taken.

A call takes 4 cycles and a return takes 4 cycles.

I tried a very simple example:

void setup()
{
  Serial.begin(115200);
  foo();
}

void loop() {}

void foo() {
  Serial.println("Hello world.");
}

Although this function is called only once the compiler did not put it inline. It generated call and return instructions. When I made foo() inline, either by declaring it inline or by placing the code inside setup(), the size of the sketch actually increased by two bytes. The inline code lacked the call/ret instructions but for some reason added two pushes and two pops to setup(), hence the two byte difference.

I didn't time the executions but based on the assembly code it would be a wash.

This isn't a very useful example but it does hint at possible unpredictability.

Arduino uses "-Os" to optimize for size, perhaps the compiler decided that not doing it inline resulted in shorter code. That simple "-Os" has a whole world of optimization decisions behind it.

DrWizard:
So I am seeking advice from you advanced gurus - Should I:

  1. leave it up to the compiler - or
  2. add the inline keyword to my functions that are only called from one place, in loops - or
  3. move the code out of the separate function and into the loop?

I'm not a Guru and I am only advanced in years.

For me readabilitiy is the number one goal. I am much more concerned to save my time rather than the Arduino's.

If it works properly (which includes fast enough) then do nothing.
If it does not work properly (including running too slow) then figure out where the problem is.

My guess is that far more run-time speed can be gained by modifying the program than by any attempt to improve on the compiler's optimizations.

...R

jboyton:
A call takes 4 cycles and a return takes 4 cycles.

Yes but you have to take into account the pushes and pops the compiler generates to protect the registers used inside the function.

Back to the original question:

So now I'm trying to optimize one of my sketches for speed and ram,

One of the favorite RAM optimizations is to move const char arrays or lookup tables to PROGMEM. (google +PROGMEM +Arduino +tutorial) The price is a small performance penalty, but can save a lot of precious RAM.

Second a good way to optimize RAM and speed is to check the datatypes of the variables used.
A short check list:

  • 8 bit vars are faster than 16 bits vars are faster than 32 bits vars
  • floats are faster than long (but float have less significant bits)
  • moving math from float domain to 16 bit int domain speeds things up
  • avoid division, use multiplication if possible.
  • reuse calculated values (do math only once)
  • mixed unsigned and signed should be avoided if possible.
  • lookup tables can be faster than functions (but costs more RAM)
  • exact function might be slower than approximations that often are good enough.
  • the use of the String class can slow things down and is not RAM friendly.

Finally the algorithm used can make a big difference. E.g. Recursive functions are slower than iterative and use more RAM in general.

If you post the actual code we might help to find some hot spots.

you have to take into account the pushes and pops the compiler generates to protect the registers used inside the function.

Maybe. "small" functions won't need to save many (or any!) registers. "large" function would need to do the equivalent of saving those registers even if the code was inline. Argument passing MOST of the time is done in registers, so you're not always putting arguments on the stack. local variables in functions give back their memory (if they use any) when the function returns, while having "many" local variables in a single function (or global variables) uses that memory ALL the time.

I'd say: use your functions, but you should actually LOOK at the machine code/assembler produced by the compiler, and SEE if it seems to have a lot of overhead associated with the function calls. See "avr-objdump." Usually it won't.

(I actually think that "we" don't suggestion dividing problems into functions NEARLY as much as we should, here...)

The size of the sketch (in Flash) is not my issue. Ram usage and speed are the issue. Lemme tell you what the project is: I am doing WS2801 "NeoPixels" on my Christmas tree, and also 2 trees for friends. Depending on the size of our trees, we are using 250, 400, and 500 pixels. These are being controlled by Arduino Mega 2560's because I needed the extra RAM just for the buffers for the pixels, if nothing else. I have programmed in some very advanced (and very cool looking!) effects, but refresh speed is an issue on some effects.

I was a professional programmer for 30 years before retiring. I have a ton of experience programming in other languages, including C# and Java just to name a few. But C++ I'm still struggling a bit with; it's very unforgiving. I wholly agree that readability and maintainability are very important. But, efficiency is just one of my pet peeves. That's left over from my IMSAI and Altair days I have already gone to great lengths to optimize this sketch. Variable sizes are the minimum necessary. All strings and constants are in PROGMEM. I cache results of calculations where I can. I do have some floating point math still in the sketch and I know that is causing a major slowdown, but I'm working on converting that to fixed point. About the only thing I haven't tried to optimize yet is the inlining of all my different processes, which as I mentioned earlier, are in separate functions for readability. But many of them are called from within loops so the stack cost -- (assuming the compiler doesn't optimize it well for me) -- would add up and be considerable. There are lots and lots of loops in this sketch, often nested many levels deep in order to do the complex effects.

The Arduino libraries can sometimes be replaced with direct writing to registers, which could be 20 times faster. Can you show us the sketch or the most import part with loops ?

I think most things have already been said:

  • There is no such thing as stack cost, at least not stack cost that is predictable. The ATmega microcontrollers have many registers, and the compiler uses those, as DrWizard wrote.
  • The new avr-gcc compiler allows the -flto option which can increase the speed and decrease the size by 5 to 25% as I wrote.
  • I disagree with robtillaart, floats usually take more time than 32-bit long (and I have tested it). But the difference is not that big. Dividing is however a lot slower than multiplexing.
  • The default -Os (optimize for size) can be overwritten by a speed optimization. Since you have a Arduino Mega, code size is no problem. And it will increase the speed a lot.

The gnu avr gcc compiler has decades of development for optimizations, they even outsmart your 30 years as professional programmer ;D It is impossible to tell if using more functions would slow things down.
You can use our knowledge about optimizing the Arduino specific things. It will not be the loops, neither the use of many functions, but calls like digitalWrite(2, HIGH), which could be 20 times slower than bitSet( PORTE, 4).

DrWizard:
The size of the sketch (in Flash) is not my issue. Ram usage and speed are the issue.

Post your code please. This is all hypothetical. If you are not short of PROGMEM then floating point calculations could conceivably be done by table lookup.

See NeoPixels Revealed: How to (not need to) generate precisely timed signals | josh.com

He managed to drive over 1000 pixels using only 5% of a Arduino Duemilanove's RAM (2 kB).

I agree with Nick Gammon, this topic has too much talk about trivial things.
So I also did a small test on my Arduino Mega 2560. I used the raytracer by greg-kennedy with a few modification.
http://forum.arduino.cc/index.php?topic=281076.0
The modified sketch that I used is attached.

Arduino IDE 1.0.6 : 53 seconds
Arduion IDE 1.5.8 : 57 seconds (that's not good, slower)

I tried some options in the platform.txt for Arduino IDE 1.5.8:
-flto : 53 seconds (from 57 to 53, okay)
-O3 or -Ofast : 49 seconds (optimized for speed instead of size)

I could not go lower than 49 seconds. I tried -funsafe-loop-optimizations -ffast-math -funsafe-math-optimizations but it stayed near 49 seconds.

My conclusion: That is only a small improvement. A higher speed can be achieved by a different code implementation, or specific use of the ATmega microcontroller or hardware.

greg-bench.ino (37.9 KB)

And the answer is ...

#1 let the compiler worry about it. It does optimize for inlining. Consider the following 3 sketches:
First, the control:

void setup()
{
  Serial.begin (115200);
  int r;
  unsigned long startm = micros();
  for (int i = 0; i < 2048; i++)
  {
    r = foobar(i);
    Serial.print(r);  // If I don't do something with the result, the compiler optimizes it all out
  }
  Serial.println ();
  Serial.println (micros() - startm);
}

void loop()
{
}

int foobar(int barfoo)
{
  return barfoo * 3 - 5;
}

Next, we try it with the inline keyword:

void setup()
{
  Serial.begin (115200);
  int r;
  unsigned long startm = micros();
  for (int i = 0; i < 2048; i++)
  {
    r = foobar(i);
    Serial.print(r);  // If I don't do something with the result, the compiler optimizes it all out
  }
  Serial.println ();
  Serial.println (micros() - startm);
}

void loop()
{
}

int inline foobar(int barfoo)
{
  return barfoo * 3 - 5;
}

And for the third test, we move the code to be inline:

void setup()
{
  Serial.begin (115200);
  int r;
  unsigned long startm = micros();
  for (int i = 0; i < 2048; i++)
  {
    r = i * 3 - 5;
    Serial.print(r);  // If I don't do something with the result, the compiler optimizes it all out
  }
  Serial.println ();
  Serial.println (micros() - startm);
}

void loop()
{
}

I didn't look at the assembly output, but all 3 versions compile to 4022 bytes, and run in 659288 microseconds, so I would expect all 3 versions compiled the same.

Now, for those of you who wanted to see my code, here is my WS2801 Xmas light sketch. It is still a work in progress, but I worked to keep ram usage low at the expense of flash. I also take advantage of the EEPROM to hold data that doesn't change often (and that I want to retain between power cycles). It does require a Mega 2560 because of the size of the buffers for the pixel colors and list. The calcFade function is my biggest time hog with 2 floating point calculations but I' working on changing that to scaled integers. I'm sure I could reduce flash usage quite a bit by ditching the FastLED library and using my own bitbang routine for the pixels. I started to do that but couldn't figure out the correct way to do the pointers.

WS2811_XmasFull (2).Release 1.01 for Richard.zip (21.5 KB)

I didn't look at the assembly output, but all 3 versions compile to 4022 bytes, and run in 659288 microseconds, so I would expect all 3 versions compiled the same.

Yes, but the Serial.prints are the limiting factor. A useful trick in this case is to make a variable volatile so the compiler doesn't optimize away the result.

So:

volatile int r;

void setup()
{
  Serial.begin (115200);
  unsigned long startm = micros();
  for (int i = 0; i < 2048; i++)
  {
    r = foobar(i);
  }
  Serial.println ();
  Serial.println (micros() - startm);
}

void loop()
{
}

int foobar(int barfoo)
{
  return barfoo * 3 - 5;
}

Size: 2452 bytes, time 1448 µS. (For all three versions).

So your conclusion was correct, but from an unsound test. :wink:

Your baudrate is only 9600: Serial.begin(9600);
When the Serial output buffer (in the Serial library) is filled, the Arduino waits until there is a free byte in the buffer. After that it fills the buffer with one byte, and waits again.
Reading the EEPROM is not very fast but writing EEPROM is very slow.

About your loop: you check the buttons, and that uses I2C. That is very slow.
The num_leds is 50 to 350 and that is the amount you call calcFade() and buttonCheck().
The calcFade() is straightforward.

I looked into the Adafruit lcd.readButtons()
digitalRead is called 5 times : Adafruit-RGB-LCD-Shield-Library/Adafruit_RGBLCDShield.cpp at master · adafruit/Adafruit-RGB-LCD-Shield-Library · GitHub
That causes 5 i2c sessions : https://github.com/adafruit/Adafruit-RGB-LCD-Shield-Library/blob/master/Adafruit_MCP23017.cpp
Every session uses the Wire.requestFrom() to read one byte. The Wire.requestFrom() function waits until the i2c session has finished.
The amount of floating point calculations that can be done instead of lcd.readButtons!

I suggest to use 115200 baudrate, and don't print from within deep loops. Try to rewrite the calcFade with unsigned long or unsigned int. The calcFade might be 20% to 50% faster (just a wild guess).
And don't call readButtons() inside that deep loop at all. Perhaps two levels higher, or no more than 5 times a second. The loop might be 5% to 100% faster (an even wilder guess).

Thanks guys! Useful information, but:

Serial debugging output is completely shutoff in the release version -- by undeclaring Debug_Mode in DebugUtils.h.
The effect currently being played is held in ram. It reads it into ram from EEPROM only at the start of a new effect. I have noticed a tiny delay there, but that doesn't happen often enough to be an issue.
Writing to EEPROM is done only if the user edits their configuration, which at the moment isn't even possible because I haven't finished the menus.
I knew the floating point math in calcFade() was causing a serious slowdown, and I'm working on changing that to fixed point math, as mentioned above. That only affects 2 of the effects though.
readButtons() gets called during the calcFade() loops because the floating math point was so slow, and areadButtons() was needed to keep the user interface / menu system responsive for the user. Once I fix the math problem, I should no longer need to do a readButtons() in that loop for responsiveness.
*** I did not realize however how bad Adafruit's readButtons() function was. So I think I need to write my own more direct version of it.

Thanks again fellas! I learned some new tricks to put in my bag!