Maths performance

So, I'd always had the general idea that floating point and division were slow. But I was curious to see some real numbers. So, I played around a bit.

Because division is theoretically done by a series of repeated subtractions, I expected a large difference in performance when the denominator is small or large. Here is what I found:

Number of microseconds required to perform 255 operations on:

Byte addition:                       152
Byte multiplication:                 196
Word addition:                       224
Word multiplication:                 252
Long addition:                       388
Long multiplication:                 444
Float multiplication:              2,572
Float addition:                    2,756
Byte division, large denominator:  3,420
Word division, large denominator:  3,468
Byte division, small denominator:  3,524
Word division, small denominator:  3,700
Float division, small denominator: 7,520
Float division, large denominator: 7,532
Long division, large denominator:  9,852
Long division, small denominator: 11,320

('word' == unsigned int)

So, a few surprises:

Compared to 8-bit math, 16-bit math is not half as fast.

Floating point division is faster than long division. If you have a lot of 'long' divisions inside a loop, you may see a performance gain by copying the variable values to floats before entering the loop.

Floating multiplication is faster than floating addition (though not by much).

Word division in my first test was faster than byte division -- I had cast my numerator as a (byte). When I instead cast it as (word), then byte division became faster than word division (these are the results shown above). Apparently, the compiler uses 'word' internally for division, so casting to (byte) actually slows things down. The fact that byte and word division performance is very very similar supports this theory. So, if you're doing division, there's not much benefit to using an 8-bit integer in favor of a 16-bit integer. Or alternatively, implementing your own 8-bit division routine in assembler may be able to outperform GCC.

The performance difference between small and large denominators is much smaller than I expected (though still measurable).

Denominator size has less of an impact on floats than it does on other data types (in fact, the difference of large vs small denominator in floats is probably statistically insignificant, due to insufficient measurement timing precision).

Executive summary: Use the smallest data type possible; division is slower than floating point.

1 Like

tylernt:
Floating point division is faster than long division. If you have a lot of 'long' divisions inside a loop, you may see a performance gain by copying the variable values to floats before entering the loop.

Floats are 4 bytes on the Arduino, same as a long, but floats only offer 6-7 digits of precision. A long or unsigned long has 10 digits of precision. If you are working with numbers that truly require 4 bytes, putting them into floats will potentially compromise the accuracy of your numbers. Also, you can't meaningfully do equality checks on floats (e.g. if x = y) so if that's required in your code, sticking the numbers into floats will break the code.

Number of microseconds required to perform 255 operations on:

Byte addition:                       152

One byte addition is a one clock cycle machine instruction so the correct number is 1/160000002551000000 = 15.9375 microseconds. If the two load instructions are included the number is still significantly smaller at 1/16000000255(1+2+2)1000000 = 79.6875 microseconds. If we add in the subsequent store the value is 1/16000000255*(1+2+2+2)*1000000 = 111.5625 microseconds.

In other words, it is impossible for an AVR processor running at 16 MHz to take as long as 152 microseconds to perform a one byte addition 255 times. Which makes your other numbers suspect.

1 cycle if loading/storing to register. I used volatile variables to keep the compiler from optimizing away my loop, so there are some loads/stores in there too...

Uh huh. lds to load the left side from SRAM (2 cycles). lds to load the right side from SRAM (2 cycles). add to add the two numbers (1 cycle). sts to store the result to SRAM (2 cycles). 2 + 2 + 1 +2 = 7 cycles. I'll let you do the rest of the math so you can verify my number.

Sorry, I didn't intend for these numbers to be absolute measures of performance. There are many combinations of constant, register, and RAM for both sources and for destinations, so absolute compute times will vary based on what you're actually doing (and what the compiler decides to optimize).

My intention was only to provide a relative comparison. Still, for completeness sake, here is the code:

  {
  Serial.print("Byte addition: ");
  volatile byte x = 0;
  volatile byte addend = 1;
  unsigned long timer1 = micros();
  for(byte i = 0; i < 255; i++)
    {
    x = (byte)255 + (byte)addend;
    }
  unsigned long timer2 = micros();
  Serial.println(timer2 - timer1);
  }

...with the other data types and operators substituted as appropriate. (The increment / compare / jump involved with the 'for' loop will also affect the results, but that is the same penalty for all data types and operators.)

The performance difference between small and large denominators is much smaller than I expected

division isn't done by repeated subtraction; it's done by essentially "long division", so it's more-or-less proportional to the length of the operands in bits, rather than the actual value (subtle difference!) There might also be some optimizations that mean that the number of one bits affects the timing.

This also explains why floating point multiplication/division is faster - it only has 24bits (of mantissa), rather than 32bits.

westfw:
division isn't done by repeated subtraction; it's done by essentially "long division", so it's more-or-less proportional to the length of the operands in bits, rather than the actual value (subtle difference!) There might also be some optimizations that mean that the number of one bits affects the timing.

This also explains why floating point multiplication/division is faster - it only has 24bits (of mantissa), rather than 32bits.

Ah, that explains it! :slight_smile:

Sorry, I didn't intend for these numbers to be absolute measures of performance.

Your code says something different than your words.

  Serial.print("Byte addition: ");
...
  Serial.println(timer2 - timer1);
...

Interleaving calls to Serial.print{ln} dramatically affects the results. You are essentially benchmarking your test code and the Serial interrupt service routine.

Are you sure? As I see it, the logic is: Serial -> mark time -> test -> mark time -> Serial. It doesn't seem like the Serial writes would affect the time-keeping at all.

EDIT: Oh wait... I think I see. Serial.write() just dumps to a buffer, doesn't it? And actually writing out is based on an interrupt, which could occur during the test loop.

[quote author=Coding Badly link=topic=206064.msg1516165#msg1516165 date=1387694940]

Sorry, I didn't intend for these numbers to be absolute measures of performance.

Your code says something different than your words.[/quote]

Ok, I mean these numbers aren't meant to be comprehensive of every possible permutation of constant, register, and RAM. This is just one use case (out of many) to compare the relative performance of the various data types and operations, to help people decide which data types and operators to use when performance is an issue.

Interleaving calls to Serial.print{ln} dramatically affects the results. You are essentially benchmarking your test code and the Serial interrupt service routine.

Ah. I added Serial.flush(). Updated results:

Byte addition:                       152
Byte multiplication:                 200
Word addition:                       228
Word multiplication:                 240
Long addition:                       392
Long multiplication:                 440
Float multiplication:              2,516
Float addition:                    2,692
Byte division, large denominator:  3,356
Word division, large denominator:  3,404
Byte division, small denominator:  3,448
Word division, small denominator:  3,628
Float division, small denominator: 7,376
Float division, large denominator: 7,380
Long division, large denominator:  9,660
Long division, small denominator: 11,100

tylernt:
Ah. I added Serial.flush(). Updated results:

Apparently, in the wrong place(s)...

Serial.begin( 115200 );
Byte addition: 152

Serial.begin( 9600 );
Byte addition: 144

If Serial.flush was working the way you expected, the results should be identical.

joshuabardwell:
Oh wait... I think I see. Serial.write() just dumps to a buffer, doesn't it? And actually writing out is based on an interrupt, which could occur during the test loop.

Correct.

The millis / micros interrupt service routine also affects benchmarking.

If it's interrupts that are the problem, isn't Serial.flush a red herring? Why not just call noInterrupts() before the test run and interrupts() after?

joshuabardwell:
Why not just call noInterrupts() before the test run and interrupts() after?

That interferes with millis / micros. For very short tests it wont matter. For longer tests it will (any result over 1 ms is questionable; any result over 2 ms is toast).

For some benchmarking (e.g. overall application performance) including the interrupt service routine time is desirable. In tylernt's case, it is not.

If it's interrupts that are the problem, isn't Serial.flush a red herring?

The problem with Serial is easy to mitigate: collect the results and print at the end of all tests (followed by a flush and/or a suitable delay).

@Tylernt,

Hhanks for sharing these results,
the fact that there is all kinds of interference with the measurements is called "real life" imho :wink:

Is the test sketch on github or downloadable e.g. to run on other platforms ?

tylernt:
1 cycle if loading/storing to register. I used volatile variables to keep the compiler from optimizing away my loop, so there are some loads/stores in there too...

Don't do that, just make the loop calculations actually calculate a value you use:

int timing_func (int b)  // pass in b so it can't be constant folded
{
  int a = 0 ;
  unsigned long ts = micros () ;
  for (byte a = 0 ; a < 250 ; a++)
  {
    a += b ;  // unrolled loop body to reduce loop overhead
    b += a ;  // make code too complex to optimize, this is a fibonacci series BTW
    a += b ;
    b += a ;
  }
  ts = micros () - ts ;
  Serial.println (ts) ;  // microseconds for 1000 additions
  return a ; // use a so that its calculation can't be optimized away

You have to estimate and remove the loop overhead (I've used a byte
variable for the loop to reduce this overhead)

There is a bit of an art to writing a benchmark for an optimizing compiler,
its very easy to have your code vanish to an optimization.

you can also do 2 loops the first with 2 additions and the second with 1 addition and subtract the durations.
Then the loop overhead will be removed automagically.

  unsigned long ts2 = micros () ;
  for (byte i = 0 ; i < 1000; i++)
  {
     x = a + b;
     x = a + b;
  }
  ts2 = micros () - ts2 ;

  unsigned long ts1 = micros () ;
  for (byte i = 0 ; i < 1000; i++)
  {
     x = a + b;
  }
  ts1 = micros () - ts1 ;

  duration = ((ts2 - ts1) + R) / 1000;  // R = 500 is for rounding or 0 for no rounding.

Oh... yes, of course. Duh.

Thanks robtillaart. Clever how you removed the loop overhead! 8)

So, just to see how much millis() etc affected the numbers, I turned interrupts off and just used timer1 to time things. I also threw in some comparisons of register, constant, and RAM. Here's what I got:

Approximate Relative Performance Index, various maths operations on various data types:
Byte addition, register + register -> register: 1
Byte addition, register + register -> RAM:     94
Byte addition, constant + register -> RAM:     94
Byte addition:                                140
Byte multiplication:                          187
Word addition:                                219
Word multiplication:                          234
Long addition:                                375
Long multiplication:                          422
Float multiplication:                       2,453
Float addition:                             2,625
Byte division, large denominator:           3,265
Word division, large denominator:           3,312
Byte division, small denominator:           3,359
Word division, small denominator:           3,531
Float division, small denominator:          7,187
Float division, large denominator:          7,187
Long division, large denominator:           9,406
Long division, small denominator:          10,812
Unless specified otherwise, all operations are constant + RAM -> RAM.

Rather than specify microseconds and number of operations, I'm calling these numbers "Approximate Relative Performance Index" numbers. This is not an attempt to say "it'll take this long to do this many of these operations", it's an attempt to say "this is relatively and approximately how much faster/slower various things take." No attempt has been made to be perfectly scientifically precise. The results are good enough for my purposes.

Anyway, it appears that my previous tests were relatively close to this one; the results didn't change wildly so all of the previous list of "surprises" is still accurate. One new interesting result is that constants are just as fast as registers, though that's not really a surprise.

Code attached.

MathTest2.ino (5.86 KB)