Go Down

Topic: Speed of floating point operations (Read 7661 times) previous topic - next topic

zatalian

Hi,

I'm new to this forum, but i have been working with arduino's for quite a while.
I have a question about the execution time of floating point commands. I found this old thread: http://arduino.cc/forum/index.php/topic,40901.0.html mentioning the speed of execution and was wondering where those numbers where comming from...
I get different (very strange and confusing) results...

First i used the function micros() to time my operations, but i read that micros()'s resolution is 4µs?
For my second attempt, i'm using Timer1, with prescaler 1, on a duemillanove, so i should have 16 ticks / µs.

This is the code i'm trying to time :

Code: [Select]

void setup()
{
  Serial.begin(9600);
  TCCR1B &= 0xF8;
  TCCR1B |= (1 << CS10);
}

void loop()
{
  float fnumber;
  float fresult = 0.0;
 
 
  uint16_t time;
  TCNT1 = 0;

  fnumber = 50.0;
  fresult = sqrt(fnumber);
  //fresult = sin(fnumber);
  //delayMicroseconds(10);

  time = TCNT1;
   
  Serial.print("delay: ");
  Serial.println(time, DEC);
 
  Serial.print("sqrt(");
  Serial.print(fnumber, DEC);
  Serial.print(") = ");
  Serial.println(fresult, DEC); 
 
  delay(1000);
}


executing a sqrt() or sin() gives my a delta of 1, meaning 1/16th of a µs. This can't be right, but i can't figure out what i'm doing wrong...
Inserting a delayMicroseconds(10) function gives me (roughly) the correct delta of 156 (160 expected), so my timers seems to work correctly.

using the function micros() instead of timer1 gives me a delta of 4µs...

is an arduino really that fast in executing floating point calculations?



AWOL

Your parameter is effectively a constant - maybe the compiler optimised the call to sqrt away.


ckiick

A much better method of benchmarking operations like this is to run the calculation in a loop for, say, 10,000 iterations.  You take a timestamp before, and a timestamp after the loop (using micros or even millis if the number of iterations is high enough) and some simple math gets you the average time per operation.  It's much more reliable, evens out hiccups due to things like interrupts, can time operations that take less time than the timer resolution, and would be portable to boards with different clock rates.  You also avoid most problems with the compiler optimizing code in ways you don't expect.  And you don't have to do all that messing around with timers, either.

I don't know why you are getting those results, but try it this way and see if you get more reasonable results.

PeterH

Since you're calling a known function whose result only depends on its argument, and the argument is a compile-time constant, it's conceivable that the floating point calculation has been optimised out by the compiler. If this was happening then you might get a different execution time if you included a value which was not a compile-time constant.

Edit add: too slow!

zatalian

Ok... replacing 50.0 by analogRead(0) gives me more sane results for functions sin() and sqrt(). (200-300 clockpulses)

But what about the next example :

Code: [Select]

void setup()
{
  Serial.begin(9600);
  TCCR1B &= 0xF8;
  TCCR1B |= (1 << CS10);
}

void loop()
{
  float fnumber1, fnumber2;
  float fresult = 0.0;
 
  fnumber1 = (float) analogRead(0);
  fnumber2 = (float) analogRead(1);
 
  uint16_t time;
  TCNT1 = 0;

  fresult = fnumber1 / fnumber2;
 
  time = TCNT1;
   
  Serial.print("delay: ");
  Serial.println(time, DEC);
 
 
  Serial.print(fnumber1, DEC);
  Serial.print(" / ");
  Serial.print(fnumber2, DEC);
  Serial.print(" = ");
  Serial.println(fresult, DEC); 
 
  delay(1000);
}


I'm deviding 2 floating point numbers (no compile time constants). Again, i get a delta of 1 clockpulse... for a floating point devision.


DuaneB

Optimised out again -

Code: [Select]

  uint16_t time;
  TCNT1 = 0;
     10a: e4 e8        ldi r30, 0x84 ; 132
     10c: f0 e0        ldi r31, 0x00 ; 0
     10e: 11 82        std Z+1, r1 ; 0x01
     110: 10 82        st Z, r1

  fresult = fnumber1 / fnumber2;
 
  time = TCNT1;
     112: e0 80        ld r14, Z
     114: f1 80        ldd r15, Z+1 ; 0x01
   
  Serial.print("delay: ");



Can't understand why, your printing it out later in the code so it shouldn't be getting optimized out, but this suggests it is.

disassembled like so -

http://rcarduino.blogspot.com/2012/09/how-to-view-arduino-assembly.html

Duane B

rcarduino.blogspot.com

DuaneB

I moved fresult to global scope and made it volatile to force the compiler to leave it alone, now we get the following and your test will work -

Code: [Select]
  uint16_t time;
  TCNT1 = 0;
     10a: 04 e8        ldi r16, 0x84 ; 132
     10c: 10 e0        ldi r17, 0x00 ; 0
     10e: f8 01        movw r30, r16
     110: 11 82        std Z+1, r1 ; 0x01
     112: 10 82        st Z, r1

  fresult = fnumber1 / fnumber2;
     114: c6 01        movw r24, r12
     116: b5 01        movw r22, r10
     118: a4 01        movw r20, r8
     11a: 93 01        movw r18, r6
     11c: 0e 94 58 06 call 0xcb0 ; 0xcb0 <__divsf3>
     120: 60 93 24 01 sts 0x0124, r22
     124: 70 93 25 01 sts 0x0125, r23
     128: 80 93 26 01 sts 0x0126, r24
     12c: 90 93 27 01 sts 0x0127, r25
 
  time = TCNT1;
     130: f8 01        movw r30, r16
     132: e0 80        ld r14, Z
     134: f1 80        ldd r15, Z+1 ; 0x01


Again heres how I get the disassembly -

http://rcarduino.blogspot.com/2012/09/how-to-view-arduino-assembly.html

Duane B

rcarduino.blogspot.com

robtillaart

#8
Nov 08, 2012, 07:39 pm Last Edit: Nov 08, 2012, 07:44 pm by robtillaart Reason: 1
just use

volatile float fresult = sqrt(fnumber);

and check your timing....

volatile says to the compiler you may not optimize this statement.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

PeterH


Can't understand why, your printing it out later in the code so it shouldn't be getting optimized out, but this suggests it is.


I don't think the calculation could have been eliminated completely - all I can think is that the compiler has reordered the code so that the calculation no longer occurs between the timing statements.

zatalian


disassembled like so -

http://rcarduino.blogspot.com/2012/09/how-to-view-arduino-assembly.html


Thanks a lot for that link!! That's exactly what a was missing...



Can't understand why, your printing it out later in the code so it shouldn't be getting optimized out, but this suggests it is.


I don't think the calculation could have been eliminated completely - all I can think is that the compiler has reordered the code so that the calculation no longer occurs between the timing statements.


using the volatile keyword to create the variable solved this problem too. I'm getting 20-40 clockpulses / floating point devision now (calculating the average of 10000 devisions as suggested)


Maybe 1 last question... Is there a way to remove the -Os compile option and get rid of all the optimizations? Just for the sake of this kind of exercises?



robtillaart

Quote
a friend of mine asked if it was possible to make a very small device that could:

(never tried)
move the compiler to another folder and place a proxy.exe in the current one that just passes the params you like
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

DuaneB

Quote
20-40 clockpulses


I am very surprised that its that fast, just to be sure, do you mean milliseconds or clock pulses ?

Duane B

michael_x

Quote
volatile float fresult = sqrt(fnumber);

and check your timing....


will skip optimization of the copy of a float variable to the next. ( Which I thought were slightly faster than 20 clock pulses )

What about
volatile float fnumber= 50.0;
float fresult = sqrt(fnumber);


might be considerably slower.
Either try it or look at assembly code again ...

MichaelMeissner


just use

volatile float fresult = sqrt(fnumber);

and check your timing....

volatile says to the compiler you may not optimize this statement.

All volatile says is it may not alter the stores or loads to the variable.  In particular, the compiler is allowed to optimize the sqrt value and save it away in a temporary value or compute it in the compiler, and then store the saved value in the loop.

The code sequences that replace the original code have implicit conversions from integer to floating point as well as the sqrt operation.  In addition, if you ever move the code to a different processor, like say a Due that uses the Arm chip, storing the result of sqrt into a float variable will cause an implicit double to single conversion.

Go Up