Calculation speed (floats vs ints)

So how much impact does this have on speed? Is there a simple way to test it or a simple rule of thumb?

I tried making a simple sketch where two variables are declared as ints and multiplied to form a third variable as int and then switch a pin high/low to see the loop period on a scope. Then I changed the variables from int to float and got the same period. I guess the "loop time" in arduino is the limiting factor in this case?

If the numbers are all constants, the compiler may be optimizing out the math.

Floats should be significantly slower.

Also, tighten the loop, and don’t use digitalWrite() to twiddle the pin, because digitalWrite() is slow, it’s like 50 clock cycles or something.

Pick a pin to output the signal on, set it as output, and look at pinout charts. You’ll see it marked like PA1 or PC2 or whatever. To toggle the current state of that pin:

PINA=(1<<1);
or
PINC=(1<<2);

and so on. Much, much faster.

ie
void setup() {
pinMode(A2,OUTPUT);
while (1){
float1+=float3; //gotta change it’s value so it doesn’t get optimized out.
PINC=(1<<2);
}
}
void loop() {
//code will never get here since there’s an infinite loop in setup()
}

Something like that will probably work better - that was just off the top of my head.
Note that I think someone actually did more rigorous performance calculations here and posted them somewhere.

if you test performance you should declare the ints and/or the floats volatile, so the compiler won’t optimize them. IN sketch below the multiplication is tested for 3 datatypes in a loop of 1000. Note that the loop overhead is same for all 3 loops.

//
//    FILE: .ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: demo
//    DATE: 2016=01-13
//     URL: http://forum.arduino.cc/index.php?topic=371813.0
//
// Released to the public domain
//

uint32_t start;
uint32_t stop;

volatile int x = 3, y = 4, z;
volatile long k = 3, l = 4, m;
volatile float p = 3, q = 4, r;

void setup()
{
  Serial.begin(115200);
  Serial.print("Start ");
  Serial.println(__FILE__);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y * x;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k * m;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p * q;
  }
  stop = micros();
  Serial.println(stop - start);
}

void loop()
{
}

==>

Start sketch_jan13b.ino
1804
6384
10156

Floats should be significantly slower.

Note that floating point divides may be very close or even faster than 32bit integer ("long") divides, because code has to essentially perform the same divide algorithm on only 24 bits instead of 32.

westfw:
Note that floating point divides may be very close or even faster than 32bit integer (“long”) divides, because code has to essentially perform the same divide algorithm on only 24 bits instead of 32.

update sketch,

  • added byte in multiply
  • added divide for 4 datatypes.
  • added addition for 4 datatypes (subtraction is same)
//
//    FILE: mulCompare.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.01
// PURPOSE: demo
//    DATE: 2016-01-13
//     URL: http://forum.arduino.cc/index.php?topic=371813
//
// Released to the public domain
//

uint32_t start;
uint32_t stop;

volatile byte a = 3, b = 4, c;
volatile int x = 3, y = 4, z;
volatile long k = 3, l = 4, m;
volatile float p = 3, q = 4, r;

void setup()
{
  Serial.begin(115200);
  Serial.print("Start ");
  Serial.println(__FILE__);
  Serial.println("multiply compare, time per 1000 micros");

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    c = a * b;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y * x;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k * m;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p * q;
  }
  stop = micros();
  Serial.println(stop - start);


  Serial.println("divide compare, time per 1000 micros");

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    c = a / b;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y / x;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k / m;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p / q;
  }
  stop = micros();
  Serial.println(stop - start);

  Serial.println("add compare, time per 1000 micros");

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    c = a + b;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y + x;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k + m;
  }
  stop = micros();
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p + q;
  }
  stop = micros();
  Serial.println(stop - start);
}

void loop()
{
}

output (byte int long float)

multiply compare, time per 1000 micros
976
1804
6568
10160

divide compare, time per 1000 micros
5968
14628
40888
29408

add compare, time per 1000 micros
768
1256
2156
7896

Just to check I ran this on a Teensy 3.2 at 72MHz. Here are the results:

multiply compare, time per 1000 micros 230 181 167 601 divide compare, time per 1000 micros 170 252 167 515 add compare, time per 1000 micros 143 169 154 1045

And the same on a Arduino Uno:

multiply compare, time per 1000 micros 968 1812 6632 10160 divide compare, time per 1000 micros 5960 14628 40888 29408 add compare, time per 1000 micros 768 1252 2156 7896

Damn ist that ARM M4 chip fast.

Yep, can blink an LED like it's nobody's business 8)

Wow, I had noe idea that data types had such a huge impact on speed. Same with the type of operation performed. Dividing also had a huge cost compared to addidion/subtraction.

Dande80: Just to check I ran this on a Teensy 3.2 at 72MHz. Here are the results:

multiply compare, time per 1000 micros 230 181 167 601 divide compare, time per 1000 micros 170 252 167 515 add compare, time per 1000 micros 143 169 154 1045

And the same on a Arduino Uno:

multiply compare, time per 1000 micros 968 1812 6632 10160 divide compare, time per 1000 micros 5960 14628 40888 29408 add compare, time per 1000 micros 768 1252 2156 7896

Damn ist that ARM M4 chip fast.

How come float is even faster sometimes than the other types on the ARM?

Lars81:
How come float is even faster sometimes than the other types on the ARM?

float = 23 bit mantisse == 3 bytes and 8 bit exponent
long = 32 bit mantisse == 4 bytes

for division the exponents are subtracted which is very fast for 8 bit, takes < 5% of time.
so in effect you are comparing a 3 byte division with a 4 byte division .
From this reasoning the time for a long should be approx 4/3 x time float
looking at the number for UNO we see
float = ~30 uSec and long ~40 uSec

Note that other numbers might give different results.

Here is the result from Arduino DUE including double:
(please note that on DUE word: 16bit, int: 32bit, long: 32bit, float: 32bit, double: 64bit)

multiply compare, time per 1000 micros
byte: 345
word: 173
int: 298
long: 296
float: 888
double: 1195

divide compare, time per 1000 micros
byte: 137
word: 161
int: 359
long: 219
float: 874
double: 1072

add compare, time per 1000 micros
byte: 132
word: 270
int: 220
long: 296
float: 1356
double: 1702

At the end I am confused. Why byte take more time than word?
Why int and long are not the same (on DUE both 32 bit)?

Here is modified code from robtillaart:

//
//    FILE: mulCompare.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.01
// PURPOSE: demo
//    DATE: 2016-01-13
//     URL: http://forum.arduino.cc/index.php?topic=371813
//
// Released to the public domain
//

uint32_t start;
uint32_t stop;

volatile byte a = 3, b = 4, c;
volatile word aw = 3, bw = 4, cw;
volatile int x = 3, y = 4, z;
volatile long k = 3, l = 4, m;
volatile float p = 3, q = 4, r;
volatile double s = 3, t = 4, w;

void setup()
{
  Serial.begin(250000);
  Serial.print("Start ");
  Serial.println(__FILE__);
  Serial.println("multiply compare, time per 1000 micros");

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    c = a * b;
  }
  stop = micros();
  Serial.print("byte: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    cw = aw * bw;
  }
  stop = micros();
  Serial.print("word: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y * x;
  }
  stop = micros();
  Serial.print("int: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k * l;
  }
  stop = micros();
  Serial.print("long: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p * q;
  }
  stop = micros();
  Serial.print("float: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    w = s * t;
  }
  stop = micros();
  Serial.print("double: ");
  Serial.println(stop - start);


  Serial.println("divide compare, time per 1000 micros");

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    c = a / b;
  }
  stop = micros();
  Serial.print("byte: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    cw = aw / bw;
  }
  stop = micros();
  Serial.print("word: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y / x;
  }
  stop = micros();
  Serial.print("int: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k / l;
  }
  stop = micros();
  Serial.print("long: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p / q;
  }
  stop = micros();
  Serial.print("float: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    w = s / t;
  }
  stop = micros();
  Serial.print("double: ");
  Serial.println(stop - start);

  Serial.println("add compare, time per 1000 micros");

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    c = a + b;
  }
  stop = micros();
  Serial.print("byte: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    cw = aw + bw;
  }
  stop = micros();
  Serial.print("word: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    z = y + x;
  }
  stop = micros();
  Serial.print("int: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    m = k + l;
  }
  stop = micros();
  Serial.print("long: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    r = p + q;
  }
  stop = micros();
  Serial.print("float: ");
  Serial.println(stop - start);

  start = micros();
  for (int i = 0; i < 1000; i++)
  {
    w = s + t;
  }
  stop = micros();
  Serial.print("double: ");
  Serial.println(stop - start);
}

void loop()
{
}

The DUE main uc is a 32-bit uc (SAM3X8e), therefore all access are optimized for 32-bit variables.

Many ARMs have some sort of code memory caching that can cause the same code to take different times just because of how it happens to be positions in memory. It can be very difficult to figure out a cycle count, even looking at the assembly code :-(

I'd like to know why the double and float timings are so close - I'd expect doubles to be about half the speed of float...

The DUE main uc is a 32-bit uc (SAM3X8e), therefore all access are optimized for 32-bit variables.

Operations shorter than 32bit can require extra "extend" instructions, depending on operation and how good the compiler is - I've seen articles stressing that 32bit math can be faster than 8bit math on an ARM. But I don't think this explains why 16bit math would be faster than 8bit math...

Here is the result from an ESP32:

multiply compare, time per 1000 micros byte: 45 word: 57 int: 59 long: 57 float: 62 double: 543

divide compare, time per 1000 micros byte: 61 word: 72 int: 64 long: 69 float: 881 double: 2203

add compare, time per 1000 micros byte: 57 word: 65 int: 57 long: 57 float: 58 double: 317

===

Here is the result from an nRf52 (Adafruit Feather Bluefruit):

multiply compare, time per 1000 micros byte: 0 word: 0 int: 976 long: 977 float: 0 double: 976

divide compare, time per 1000 micros byte: 0 word: 977 int: 976 long: 0 float: 0 double: 977

add compare, time per 1000 micros byte: 0 word: 977 int: 977 long: 0 float: 0 double: 977

Any suggestion as to why we have no performance for a float operations on the nRF52?