Math execution times for Arduino

Hello Arduinoland residents,

I searched high and low for the execution times for the basic (addition, multiply, divide) and less used (sqrt, sin, etc) math operations on the Arduino when using different data types. I could find nothing definitive, so I took it upon myself to measure the speeds. I used the Mega2560, so these measurements should correlate well on the Uno. It took a while to figure out a way too keep the compiler from optimizing the math operation out of existence, but I think I got it figured out. The results are in the attached picture.

The most interesting result is that floating point division is actually faster than 32-bit integer division! What takes 44us with long operands only takes 34us with float operands.

Anyway, I hope these results help someone out there. And if anyone has done similar tests, do these results jibe with yours?

Math Speeds Mega2560.jpg

Is it just me or are those pretty blurry?

Just you. Looks fine to me, once you enlarge it from thumbnail view.

Interesting results for a quick-n-dirty "how long should I plan for this to take" perspective. Although optimizations, library version, clock speeds, and lots of other considerations keep this from being authoritative or definitive. For that, nothing beats an object dump and comparison with the Atmel instruction clock cycle reference of course. :)

I would've expected FP math to take longer than 32-bit int as well. Huh.

Thanks for the share.

SirNickity: I would've expected FP math to take longer than 32-bit int as well. Huh.

FP math is only 24 bits.

ON an 8-bit CPU those extra two bytes add up to a lot of extra cycles.

Multiplication timing results desagree with my tests on UNO. I measure ~5 us int32_t, and int16_t should be more faster, as it takes only 2 clock cycles + overhead with pushing registers, anyway I think < 1 us.
Here is a page for comparison :
http://www.gnu.org/savannah-checkouts/non-gnu/avr-libc/user-manual/benchmarks.html

Actually, a (int16_t * int16_t) multiply is definitely more than 2 clock cycles. The only multiply available on the Uno/Mega takes 8-bit operands (it's an 8-bit microcontroller, remember). An 8-bit x 8-bit multiply does produce a 16-bit result, but if the operands are both 16-bit, then there will be multiple instructions to complete.

Regardless, I think your 5us measurement for 32-bit multiply is more correct than my measurement. I did these measurements only once between calls to micros(), which means the time for micros() to execute is included in the measurement. In addition, like micros(), my measurements only have 4 us resolution (well, it's better than that because I did it multiple times and took an average). The reason I didn't loop is because I wanted to be certain the compiler couldn't optimize anything. However, I just measured the execution time of micros() (with the looping method), and it takes 3774ns minus the roughly 3-4 clocks for the loops (say 250ns). So subtract 3.5 us from all my measurements! I think I might go ahead and re-measure all the math with the looping method just to compare...

Actually, a (int16_t * int16_t) multiply is definitely more than 2 clock cycles.

Ooops, you are right, 16x16 requires about 4 MULTS, 2 cycles each, so 0.5 u and plus overhead. AFAIR, I "for"-loop for 10000, to get better results, and declare some variables inside loop as volatile. Another things I've noticed, that sqrt timing depends on the value, when argument is smaller operation goes faster. Have you click on a link? Actually your results in good proximity to theirs, +10..20 %

Ok, much better viewing on a laptop. Smartphone wouldn’t expand the image.

Magician: Another things I've noticed, that sqrt timing depends on the value, when argument is smaller operation goes faster.

Yep. You'll also find that with integer division as well - the loops end when there's no more 1 bits in the partial result.

You'll also see it with multiply on chips which have no hardware multiplier.

Yeah, I was worried about the value of the operands making a difference… but not worried enough to spend a lot of time on it. I went ahead and got some more accurate results (by looping 65535 times in most cases). I tried to have the operand change for each calculation as well as I could without adding time, so perhaps it gives a good “average” result. Although + and * seemed quite independent of operand values. Anyway, check out the attached image for the new results. And I did subtract out the time that I estimated the looping to take (I guessed 250 ns for each iter).

The surprising result here is that int8 division is almost as slow as int16! I wonder if int8 division is not really implemented, and it just executes as int16? Float division is definitely quite a bit faster than int32. And I’m not sure if the differences in sqrt() are from differences in operand values or from converting to float (because I suspect sqrt() is only actually implemented for float).

Oh, by the way, these times are in ns, not us (typo on top of image file).

Math Speeds Mega2560-2.jpg

DireSpume: And I'm not sure if the differences in sqrt() are from differences in operand values or from converting to float (because I suspect sqrt() is only actually implemented for float).

You're right: http://www.nongnu.org/avr-libc/user-manual/group__avr__math.html

Since you can't view pics as a guest, I thought it would be a good idea to put the results as text instead:

Nanosecond execution times on Arduino Mega2560 (16 MHz) Operation uint8 int16 int32 int64 float** + 63^ 884 1763 8428 10943 * 125^ 1449 4592 57038 10422 / 15859^^ 15969 41866* 274809 31951 sqrt 54251 54448 70884 47127 micros() 3524 Measured using double nested loops (usually 255*255). The extra time required for looping was estimated at 250ns and subtracted out. % was also tested but is the same as /, as expected. * unsigned result was faster at 39413 ** results varied a bit, highest value reported ^^ unsigned result was faster at 13174 ^ not measured, from data sheet

Notes: - division of any data type is quite time consuming, int32 being especially bad. - float division is faster than int32, perhaps because the mantissa for float numbers is only 24-bit? Although + and * are worse. - int8 division is the same speed as int16 division, perhaps because division for int8 isn't implemented and executes as int16 (just a guess though). But who does int8 division anyway? Well, I could see doing modulo I guess. - somewhat surprising is that division of unsigned ints seems marginally faster than signed ints. - the times for an operation to complete is somewhat dependent on the value of the operands, especially for division, so computations may complete in a shorter time. - sqrt() is only implemented for floats, so the time increase with ints is probably at least partially due to time required to convert to float (or it's just different because of different operand values used for the different tests). - compiler optimization often leads to dramatic improvement in computations over what these numbers suggest.

That doesn’t look too bad to me, but you might consider putting it in code tags. That way, it’ll use a fixed-width font, and spaces are more-or-less guaranteed to line up correctly, even across browsers, OSes, and devices.

You might be interested in this page:

http://www.nongnu.org/avr-libc/user-manual/benchmarks.html