Speed of floating point operations

Hi,

I'm new to this forum, but i have been working with arduino's for quite a while.
I have a question about the execution time of floating point commands. I found this old thread: http://arduino.cc/forum/index.php/topic,40901.0.html mentioning the speed of execution and was wondering where those numbers where comming from...
I get different (very strange and confusing) results...

First i used the function micros() to time my operations, but i read that micros()'s resolution is 4µs?
For my second attempt, i'm using Timer1, with prescaler 1, on a duemillanove, so i should have 16 ticks / µs.

This is the code i'm trying to time :

void setup()
{
  Serial.begin(9600);
  TCCR1B &= 0xF8;
  TCCR1B |= (1 << CS10);
}

void loop()
{
  float fnumber;
  float fresult = 0.0;
  
  
  uint16_t time;
  TCNT1 = 0;

  fnumber = 50.0;
  fresult = sqrt(fnumber);
  //fresult = sin(fnumber);
  //delayMicroseconds(10);

  time = TCNT1;
   
  Serial.print("delay: ");
  Serial.println(time, DEC);
  
  Serial.print("sqrt(");
  Serial.print(fnumber, DEC);
  Serial.print(") = ");
  Serial.println(fresult, DEC);  
  
  delay(1000);
}

executing a sqrt() or sin() gives my a delta of 1, meaning 1/16th of a µs. This can't be right, but i can't figure out what i'm doing wrong...
Inserting a delayMicroseconds(10) function gives me (roughly) the correct delta of 156 (160 expected), so my timers seems to work correctly.

using the function micros() instead of timer1 gives me a delta of 4µs...

is an arduino really that fast in executing floating point calculations?

Your parameter is effectively a constant - maybe the compiler optimised the call to sqrt away.

What he said.

A much better method of benchmarking operations like this is to run the calculation in a loop for, say, 10,000 iterations. You take a timestamp before, and a timestamp after the loop (using micros or even millis if the number of iterations is high enough) and some simple math gets you the average time per operation. It's much more reliable, evens out hiccups due to things like interrupts, can time operations that take less time than the timer resolution, and would be portable to boards with different clock rates. You also avoid most problems with the compiler optimizing code in ways you don't expect. And you don't have to do all that messing around with timers, either.

I don't know why you are getting those results, but try it this way and see if you get more reasonable results.

Since you're calling a known function whose result only depends on its argument, and the argument is a compile-time constant, it's conceivable that the floating point calculation has been optimised out by the compiler. If this was happening then you might get a different execution time if you included a value which was not a compile-time constant.

Edit add: too slow!

Ok... replacing 50.0 by analogRead(0) gives me more sane results for functions sin() and sqrt(). (200-300 clockpulses)

But what about the next example :

void setup()
{
  Serial.begin(9600);
  TCCR1B &= 0xF8;
  TCCR1B |= (1 << CS10);
}

void loop()
{
  float fnumber1, fnumber2;
  float fresult = 0.0;
  
  fnumber1 = (float) analogRead(0);
  fnumber2 = (float) analogRead(1);
  
  uint16_t time;
  TCNT1 = 0;

  fresult = fnumber1 / fnumber2;
  
  time = TCNT1;
   
  Serial.print("delay: ");
  Serial.println(time, DEC);
  
  
  Serial.print(fnumber1, DEC);
  Serial.print(" / ");
  Serial.print(fnumber2, DEC);
  Serial.print(" = ");
  Serial.println(fresult, DEC);  
  
  delay(1000);
}

I'm deviding 2 floating point numbers (no compile time constants). Again, i get a delta of 1 clockpulse... for a floating point devision.

Optimised out again -

  uint16_t time;
  TCNT1 = 0;
     10a:	e4 e8       	ldi	r30, 0x84	; 132
     10c:	f0 e0       	ldi	r31, 0x00	; 0
     10e:	11 82       	std	Z+1, r1	; 0x01
     110:	10 82       	st	Z, r1

  fresult = fnumber1 / fnumber2;
  
  time = TCNT1;
     112:	e0 80       	ld	r14, Z
     114:	f1 80       	ldd	r15, Z+1	; 0x01
   
  Serial.print("delay: ");

Can't understand why, your printing it out later in the code so it shouldn't be getting optimized out, but this suggests it is.

disassembled like so -

Duane B

rcarduino.blogspot.com

I moved fresult to global scope and made it volatile to force the compiler to leave it alone, now we get the following and your test will work -

  uint16_t time;
  TCNT1 = 0;
     10a:	04 e8       	ldi	r16, 0x84	; 132
     10c:	10 e0       	ldi	r17, 0x00	; 0
     10e:	f8 01       	movw	r30, r16
     110:	11 82       	std	Z+1, r1	; 0x01
     112:	10 82       	st	Z, r1

  fresult = fnumber1 / fnumber2;
     114:	c6 01       	movw	r24, r12
     116:	b5 01       	movw	r22, r10
     118:	a4 01       	movw	r20, r8
     11a:	93 01       	movw	r18, r6
     11c:	0e 94 58 06 	call	0xcb0	; 0xcb0 <__divsf3>
     120:	60 93 24 01 	sts	0x0124, r22
     124:	70 93 25 01 	sts	0x0125, r23
     128:	80 93 26 01 	sts	0x0126, r24
     12c:	90 93 27 01 	sts	0x0127, r25
  
  time = TCNT1;
     130:	f8 01       	movw	r30, r16
     132:	e0 80       	ld	r14, Z
     134:	f1 80       	ldd	r15, Z+1	; 0x01

Again heres how I get the disassembly -

Duane B

rcarduino.blogspot.com

just use

volatile float fresult = sqrt(fnumber);

and check your timing....

volatile says to the compiler you may not optimize this statement.

DuaneB:
Can't understand why, your printing it out later in the code so it shouldn't be getting optimized out, but this suggests it is.

I don't think the calculation could have been eliminated completely - all I can think is that the compiler has reordered the code so that the calculation no longer occurs between the timing statements.

DuaneB:
disassembled like so -

RCArduino: How To View Arduino and Arduino Due Assembly

Thanks a lot for that link!! That's exactly what a was missing...

PeterH:

DuaneB:
Can't understand why, your printing it out later in the code so it shouldn't be getting optimized out, but this suggests it is.

I don't think the calculation could have been eliminated completely - all I can think is that the compiler has reordered the code so that the calculation no longer occurs between the timing statements.

using the volatile keyword to create the variable solved this problem too. I'm getting 20-40 clockpulses / floating point devision now (calculating the average of 10000 devisions as suggested)

Maybe 1 last question... Is there a way to remove the -Os compile option and get rid of all the optimizations? Just for the sake of this kind of exercises?

a friend of mine asked if it was possible to make a very small device that could:

(never tried)
move the compiler to another folder and place a proxy.exe in the current one that just passes the params you like

20-40 clockpulses

I am very surprised that its that fast, just to be sure, do you mean milliseconds or clock pulses ?

Duane B

volatile float fresult = sqrt(fnumber);

and check your timing....

will skip optimization of the copy of a float variable to the next. ( Which I thought were slightly faster than 20 clock pulses )

What about

volatile float fnumber= 50.0;
float fresult = sqrt(fnumber);

might be considerably slower.
Either try it or look at assembly code again ...

robtillaart:
just use

volatile float fresult = sqrt(fnumber);

and check your timing....

volatile says to the compiler you may not optimize this statement.

All volatile says is it may not alter the stores or loads to the variable. In particular, the compiler is allowed to optimize the sqrt value and save it away in a temporary value or compute it in the compiler, and then store the saved value in the loop.

The code sequences that replace the original code have implicit conversions from integer to floating point as well as the sqrt operation. In addition, if you ever move the code to a different processor, like say a Due that uses the Arm chip, storing the result of sqrt into a float variable will cause an implicit double to single conversion.

Which all goes to show what a nonsense measuring the performance and optimization of contrived sequences of code is.

If you have a real application - then it gets interesting.

Duane B.

All volatile says is it may not alter the stores or loads to the variable.

It tells the compiler not to assume that the value in a variable, even if it does not appear that the variable has not been written to.

The issue in this code is due to the resulting variable is not used so the division was optimized away. You can force a use of that division by assigning it to a (volatile) variable.

zatalian:
Is there a way to remove the -Os compile option and get rid of all the optimizations? Just for the sake of this kind of exercises?

You are proposing to disable all optimisations in order to make it possible to measure the performance? Doesn't that render the performance measurements meaningless?

Which all goes to show what a nonsense measuring the performance and optimization of contrived sequences of code is.

If you have a real application - then it gets interesting.

Agree with you unless your goal is to learn about optimizations and how they are done. (there is always that other option :wink:

DuaneB:

20-40 clockpulses

I am very surprised that its that fast, just to be sure, do you mean milliseconds or clock pulses ?

Duane B

clockpulses... This is the code i used :

void setup()
{
  Serial.begin(9600);
  TCCR1B &= 0xF8;
  TCCR1B |= (1 << CS10);
}

void loop()
{
  float fnumber1, fnumber2;
  volatile float fresult = 0.0;
  
  uint16_t time;
  unsigned long total;
  
  for (int i = 0; i < 10000; i++)
  {
    fnumber1 = random() / 1000.0;
    fnumber2 = random() / 1000.0;
  
    TCNT1 = 0;
    fresult = fnumber1 / fnumber2;
    time = TCNT1;
  
    total += time;
  } 
  total /= 10000;
  
  Serial.print("delta: ");
  Serial.println(time, DEC);
  
  
  Serial.print(fnumber1, DEC);
  Serial.print(" / ");
  Serial.print(fnumber2, DEC);
  Serial.print(" = ");
  Serial.println(fresult, DEC);  
  

}

robtillaart:

Which all goes to show what a nonsense measuring the performance and optimization of contrived sequences of code is.

If you have a real application - then it gets interesting.

Agree with you unless your goal is to learn about optimizations and how they are done. (there is always that other option :wink:

Well, disabling optimizations can be useful to compare floating point operations versus integer operations. The whole point of this exercise was to know - before i have a complete project - if the arduino will be fast enough and if i will be able to use floating points or if I will have to do all the calculations with integers.

But in the end... these measurements will indeed be estimates and real measurements can only be taken in real programs. I totally agree with that statement.
Thanks to everybody for this very informative discussion.