Function defined in separate .cpp file -> slower??

Hi,

I'm profiling a critical function and I need to keep the execution time to the very minimum.
After lots of headaches, I have realized that if I declare and define the function in a different .h and .cpp file than the one that is calling the function, the computation time is much larger!!

Here's an example (obviously useless, just for demonstration):

main.ino

#include "fooAux.h"
#define N       100

unsigned long t1, t2;

int16_t fooIn(float x)
{
   if (x > 0.5f)
    return 2;
   else 
    return 1;   
}

void setup() {
  Serial.begin(115200);
}

void loop() 
{
  float x = -1.0f;
  
  // Function defined in current .cpp file
  t1 = micros(); 
  for(uint8_t i = 0; i < N; ++i)
  {
    fooIn(x);
    asm("");
  } 
  t2 = micros();
  Serial.println("Time fooIn: " + String((float)(t2-t1)/N));
  
  
  // Function defined in aux.cpp
  t1 = micros(); 
  for(uint8_t i = 0; i < N; ++i)
  {
    fooAux(x);
    asm("");
  } 
  t2 = micros();
  Serial.println("Time fooAux: " + String((float)(t2-t1)/N));  
  
  delay(2000);
}

fooAux.h

#ifndef FOOAUX_H
#define FOOAUX_H

#include <stdint.h>

int16_t fooAux(float x);

#endif

fooAux.cpp

#include "fooAux.h"

int16_t fooAux(float x)
{
	if (x > 0.5f)
		return 2;
	else
		return 1;		
}

When I run this program I get the following printout:
Time fooIn: 0.24
Time fooAux: 4.88

It takes like 20 times more time to execute!! What on earth is going on here?

Thanks!

Yes of course, any function in a separate file cannot be easily inlined.
The compiler will often optimise calls to small functions it knows all about by inlining
its code at the call site, rather than generate code to call it.
Separate files are compiled separately and the resulting code combined by the linker
program which fixes all the calls between functions to have the right run-time addresses,
but the linker can never inline the code, and it imposes the most general calling strategy.

But also you have to consider whether the compiler has competely optimised away the call altogether -
if a function causes no side effects and its result is not used the compiler can simply ignore it, since
nothing else in the program can ever tell if it ran (except for benchmark timings)

Tinrik:
It takes like 20 times more time to execute!! What on earth is going on here?

Are you really sure that anything "executes" when you are getting the short time as a result?

My guess is, that as you are never using the results of fooIn() or fooAux() in any way, this happens:
The compiler "optimizes out" all of the function calls of fooIn() from the program, so that fooIn() is never actually executed.

And your "demonstration" code just shows, how clever optimizing compilers can be in optimizing code in certain situations and how the compiler can remove complete function calls if the result doesn't matter and never is used.

BTW: If your code is having "a critical function and I need to keep the execution time to the very minimum" then I'm wondering why you use "float" as a data type in your code.

Indeed I realized the compiler was optimizing out the call to fooIn(); if I comment it out I get the same result.

Then I guess I have to do some operation with the result of fooIn(), which will then impact the execution time. Is there a way to do this kind of benchmarking without such an impact? Is it possible to disable all compiler optimizations within the Arduino IDE?

Thanks!

to measure time in these kind of loops make the essential data volatile (like interrupts)
Then the compiler will not optimize the loop.

int16_t foo1(float x)
{
  if (x > 0.5)
    return 2;
  else
    return 1;
}

int16_t foo2(float x)
{
  if (int(x * 2) > 1) return 2;
  return 1;
}

volatile int a;

uint32_t start;
uint32_t stop;

void setup()
{
  Serial.begin(115200);
  Serial.print("Start ");
  Serial.println(__FILE__);

  float x = analogRead(A0) / 1023.0;

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    a = foo1(x);
  }
  stop = micros();
  Serial.println(stop - start);
  Serial.println(a);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    a = foo2(x);
  }
  stop = micros();
  Serial.println(stop - start);
  Serial.println(a);
}

void loop()
{
}

then I'm wondering why you use "float" as a data type in your code.

Or Strings. Converting the float to a String so that print() can unwrap the string wrapped in the String is NOT a way to improve performance of anything.

obviously useless

What is obvious is that that is not your real code. It is NOT at all obvious that you are not doing the same stupid stuff in the real code.

Assing the return value to a volatile variable, and I'm sure you will see different results.

Tinrik:
Then I guess I have to do some operation with the result of fooIn(), which will then impact the execution time. Is there a way to do this kind of benchmarking without such an impact?

No. But you can create different for-loops "doing something relevant" and compare:
1.) a for-loop doing "hardly anything" but at the same time "something relevant", so the code gets executed
2.) a for-loop to be benchmarked, doing the same thing plus an extra function call
The time you want to measure is the time difference between those two then.

Tinrik:
Is it possible to disable all compiler optimizations within the Arduino IDE?

It depends on the Arduino-IDE version.
In some IDE versions it is impossible, perhaps.
In some IDE versions it requires complicated tweaking of the IDE and Arduino core.
In some other IDE versions it requires some simpler tweaking of the IDE and Arduino core.

Perhaps give that code a try for benchmarking:

#include "fooAux.h"
#define N       100

unsigned long t0, t1, t2;

int16_t fooIn(float x)
{
   if (x > 0.5f)
    return 2;
   else 
    return 1;   
}

void setup() {
  Serial.begin(115200);
}

void loop() 
{
  float x = -1.0f;
  volatile byte dummy=0;
  x= x+dummy; // this line avouids 'x' to be treated as a 'const' value with a known value at compile time
  //  first create some 'nearly do anything' loop for comparison reasons
  t0= micros();
  for(uint8_t i = 0; i < N; ++i)
  {
    dummy+=i;
    asm("");
  } 
  t0=micros()-t0;
  Serial.print("dummy= ");Serial.println(dummy);
  Serial.print("t0= ");Serial.println(t0);delay(100);

  dummy=0;
  // Function defined in current .cpp file
  t1 = micros(); 
  for(uint8_t i = 0; i < N; ++i)
  {
    dummy+=fooIn(x);
    asm("");
  } 
  t1= micros()-t1;
  Serial.print("dummy= ");Serial.println(dummy);
  Serial.print("t1= ");Serial.println(t1);delay(100);

  dummy=0;
  // Function defined in aux.cpp
  t2 = micros(); 
  for(uint8_t i = 0; i < N; ++i)
  {
    dummy+=fooAux(x);
    asm("");
  } 
  t2 = micros()-t2;
  Serial.print("dummy= ");Serial.println(dummy);
  Serial.print("t2= ");Serial.println(t2);delay(100);

  Serial.println();
  Serial.print("Extra time for using fooIn(x)= ");Serial.println(t1-t0);
  Serial.print("Extra time for using fooAux(x)= ");Serial.println(t2-t0);
  Serial.println("--------------------\r\n");
  delay(2000);
}

jurs:
It depends on the Arduino-IDE version.
In some IDE versions it is impossible, perhaps.
In some IDE versions it requires complicated tweaking of the IDE and Arduino core.
In some other IDE versions it requires some simpler tweaking of the IDE and Arduino core.

You can just do it in code.

Global optimizations disabled (per file)

This will work in IDE's 1.5.7 and later.

#pragma GCC push_options
#pragma GCC optimize ("O0")

//Code here

#pragma GCC pop_options

On a per function basis should work in all IDE's

void __attribute__((optimize("O0"))) func() {
    // code that does not get optimized.
}

Great tips, thank you all!! Now I can profile my code more reliably.