# Speed of math operations (particularly division) on Arduino

We ran into some performance issues on a recent project and were curious as to how various operations affected execution time. The common wisdom is that division is slower than multiplication or the other standard operations. We looked around for some Arduino-specific data but didn't find much, so we decided to run some brief tests ourselves: http://www.engblaze.com/faster-code-fridays-understand-division-and-speed-of-operations/

Interestingly, division is much slower than other math. As one of our commenters points out, the 8-bit AVRs handle integer multiplication, addition, and subtraction in hardware, but have to generate code for division.

Normally there are other things to worry about first when you're tackling code optimization, but the difference is so stark that it's something to keep in mind if you're doing serious number-crunching.

That was an interesting article. Thanks for sharing.

Yes, very enlightening information. Now we need some of our software gurus to show us methods to avoid (or limit the use of) divisions when using integer math.

Lefty

There is a topic how improve performance, replacing division with multiply / shift:
http://arduino.cc/forum/index.php/topic,60924.0.html
I read somewhere, that compiler smart enough to check if it necessary “int” as a “for” loop counter, in example code:

``````for (int i = 0; i <= 100; i++) {
for (int j = 0; j <= 100; j++) {
``````

I think, “i” and “j” would be “downscaled” to unsigned byte, but even byte multiply byte (mul8_8_16) require at least 2 cycles, so 10 000 multiplication should take around 125 nsec x 10 000 = 1.25 milliseconds, and not

And for our finicky multiplication test, each run completed in 72 us.

that is too fast for two loop, and too much for every muls8_8_16.

Interesting, thanks. I did some tests which agree in general ways with the posted conclusions. Division is slower than multiplication. But you expect that, because how many people can do long division quickly? Compared to multiplication?

Sketch:

``````#define DATATYPE byte
#define OPERATION i / j

volatile DATATYPE x;
unsigned long time;

void setup ()
{
Serial.begin(115200);
Serial.println ();

time = micros();
for (DATATYPE i = 0; i < 100; i++) {
for (DATATYPE j = 0; j < 100; j++) {
x = OPERATION;
} // end of for j
}  // end of for i
time = micros() - time;
Serial.print("Completion time: ");
Serial.println(time);
} // end of setup

void loop() {}
``````

By varying the two defines (DATATYPE and OPERATION) I tested various combinations. I then subtracted out from the computed time 3800 uS which was the time taken for an operation of “1” (being the loop and a straight assignment).

``````Operation   Data type   Time uS   Time each uS  Clock cycles  x baseline

multiply    byte           640         0.064         1               1
multiply    int          3,808         0.3808        6               6
multiply    long         7,620         0.762        12              12

divide      byte       126,924        12.6924      203             198
divide      int        148,936        14.8936      238             233
divide      long       395,544        39.5544      633             618
``````

The column “x baseline” compares to a straight byte multiply.

So certainly divides are slower. And using longer data types is slower.

If you know the value in advance it would obviously be faster to multiply by the inverse than divide.

There are application notes, "AVR200: Multiply and Divide Routines" and "AVR201: Using the AVR® Hardware Multiplier". I'm not argue with results, just curious the difference posted Nick and what they calculate at Atmel. It's probably, no wander that division slower than "asembler" macros , and it would also vary with a bunch of compiler flags 0s, 02, 03 etc, but what bugs me how multiplication could go faster? Even faster than "instruction Data set summary" says MUL 8 x 8 = 2 cycles? Error in "correction" constant 3800? The same time "proportion" in time via data type is in good agreement with table:

Table 1. Performance Summary

8-bit x 8-bit Routines Word (Cycles) Unsigned multiply 8 x 8 = 16 bits 1 (2) Signed multiply 8 x 8 = 16 bits 1 (2)

16-bit x 16-bit Routines Signed/unsigned multiply 16 x 16 = 16 bits 6 (9) Unsigned multiply 16 x 16 = 32 bits 13 (17) Signed multiply 16 x 16 = 32 bits 15 (19) Signed multiply-accumulate 16 x 16 += 32 bits 19 (23)

Magician:
There is a topic how improve performance, replacing division with multiply / shift:
http://arduino.cc/forum/index.php/topic,60924.0.html
I read somewhere, that compiler smart enough to check if it necessary “int” as a “for” loop counter, in example code:

``````for (int i = 0; i <= 100; i++) {
``````

for (int j = 0; j <= 100; j++) {

``````

I think, "i" and "j" would be "downscaled" to unsigned byte, but even byte multiply byte (mul8_8_16) require at least 2 cycles, so 10 000 multiplication should take around 125 nsec x 10 000 = 1.25 milliseconds, and not

> And for our finicky multiplication test, each run completed in 72 us.

that is too fast for two loop, and too much for every muls8_8_16.
``````

The compiler has optimized away one of the loops - if you want to avoid compiler optimizations like this you have to use the value computed in one iteration in the next iteration, and then use if after the loops complete - the easiest way to do this is something like

``````    for (DATATYPE i = 0; i < 100; i++) {
for (DATATYPE j = 0; j < 100; j++) {
x ^= OPERATION;
}
}
``````

I tried += but the compiler just optimized it away, note (compilers go to great lengths to optimize [nested] loops, note, because that’s where optimization really pays off). Using -= it didn’t optimize away, and ^= seems a better bet since it is unrelated to arithmetic ops.

Another good way to defeat optimization is to make the initial value (here of x) as a parameter to the function doing the timing tests. The compiler then can’t know its starting value.

Just for info, I tested several data types.

``````app. execution time in us for one oparation incl. write to RAM (var's are volatile)

Operator                        +  -      *      / %

byte                            0,41    0,60     5,4
char (signed)                   0,41    0,60    15,2

unsigned int  (range byte)      0,85    1,35    13,1
unsigned int  (full range)      0,85    1,35    13,2
int  (range short int)          0,85    1,35    15,2
int  (full range)               0,85    1,35    15,2

unsigned long                   1,7     6,1     43,6
long                            1,7     6,1     44,7

float                           6,5     9,8     28,9
``````
``````void loop() {
volatile char x, a, b, min, max;  // forces x to be in RAM, not in registers (as in every normal code)
min = -127;
max = 127;

time1 = micros(); // reference time
for (unsigned int i = 0; i < 10000; i++) {
a = random(min, max);
b = random(min, max);
asm(""); // avoid code optimization of the compiler
}
time1 = micros() - time1;

time2 = micros();
for (unsigned int i = 0; i < 10000; i++) {
a = random(min, max);
b = random(min, max);
x = a/b;
asm(""); // avoid code optimization of the compiler
}
time2 = micros() - time2;

Serial.print("Completion time: ");
Serial.print((time2-time1)/10);
Serial.println("ns per operation");
}
``````

Maybe related, maybe not. But the 328 does not have a DIV instruction and hence the complete DIV needs to be implemented in software.

For division by a constant, you can sometimes save time by multiplying by 1/c, instead of dividing by c. It's not much, but it's something.

1 Like

NOBSI,
Interesting that your tests show float division is significantly faster than the long int division.

Just unexpected!

Something that hasn’t really been mentioned is that a int16 * int16 can result in an int32, but unless you instruct avr-gcc to not throw away the upper 16bits that’s exactly what it’s going to do.

In this sketch I compared the speeds of 5 ways to do int32 = int16 * int16. I used timer 1 to get the best resolution on the measurement and I turned off interrupts so that they wouldn’t affect it either.

``````#include "Arduino.h"
#include <stdint.h>

#define TEST_PIN 2
#define TEST_PIN_MASK (1 << TEST_PIN) // Port D Pin 3
#define NS_PER_SEC 1000000000
#define TIME_RES 10
#define TIME_SCALE (uint32_t)(1ULL*NS_PER_SEC*TIME_RES/F_CPU)

// Macro From -> https://github.com/rekka/avrmultiplication/blob/master/mult16x16.h
// longRes = intIn1 * intIn2
// signed16 * signed16
#define MultiS16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"muls %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mulsu %B2, %A1 \n\t" \
"sbc %D0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %D0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" \
)

#define PRINT_MUL(p,a,b)  \
{                         \
Serial.print(p);        \
Serial.print(F(" = ")); \
Serial.print(a);        \
Serial.print(F(" * "));  \
Serial.println(b);      \
}

enum
{
INT16_MUL,
INT16_MACRO_MUL,
INT16_CAST_MUL,
INT32_MUL,
FLOAT_MUL,
NUM_MUL
};

char int16_mul_string[] = "INT16_MUL";
char int16_macro_mul_string[] = "INT16_MACRO_MUL";
char int16_cast_mul_string[] = "INT16_CAST_MUL";
char int32_mul_string[] = "INT32_MUL";
char float_mul_string[] = "FLOAT_MUL";

const char *test_strings[NUM_MUL] =
{
int16_mul_string,
int16_macro_mul_string,
int16_cast_mul_string,
int32_mul_string,
float_mul_string
};

volatile int16_t int16a, int16b;
volatile int32_t int16p, int16_cast_p, int16_macro_p;
volatile int32_t int32a, int32b, int32p;
volatile float floata, floatb, floatp;

void setup()
{
Serial.begin(115200);

// Turn on Timer1 with no prescaling.
TCCR1A = 0;
TCCR1B = 1;
TCCR1C = 0;

// Set up pin to see if I'm calculating time correctly.
digitalWrite(TEST_PIN, LOW);
pinMode(TEST_PIN, OUTPUT);

// Display timing information.
Serial.print(F("F_CPU = "));
Serial.print(F_CPU);
Serial.println(F("Hz."));
Serial.print(F("Timer1 scale = "));
Serial.print(TIME_SCALE/TIME_RES);
Serial.print(F("."));
Serial.print(TIME_SCALE%TIME_RES);
Serial.println(F("ns/tick."));
}

void loop()
{
static uint32_t slow_down_serial_timer;
const uint32_t slow_down_serial_time = 1000000;
uint16_t timestamps[NUM_MUL+1];
uint8_t idx;

if(micros() - slow_down_serial_timer > slow_down_serial_time)
{
slow_down_serial_timer += slow_down_serial_time;

// Generate Random Inputs
generate_section:
int16a = random(INT16_MIN, INT16_MAX);
int16b = random(INT16_MIN, INT16_MAX);
int32a = (int32_t)int16a;
int32b = (int32_t)int16b;
floata = (float)int32a;
floatb = (float)int32b;

// Measure the time for each operation
math_section:
cli();
timestamps[0] = TCNT1;
int16p = int16a * int16b;
timestamps[1] = TCNT1;
MultiS16X16to32(int16_macro_p, int16a, int16b);
timestamps[2] = TCNT1;
int16_cast_p = (int32_t)int16a * (int32_t)int16b;
timestamps[3] = TCNT1;
int32p = int32a * int32b;
timestamps[4] = TCNT1;
floatp = floata * floatb;
timestamps[5] = TCNT1;
sei();

// Print the measured times
print_section:
Serial.println(F("==================================================="));
for(idx = 0; idx < NUM_MUL; idx++)
{
uint32_t temp_time = TIME_SCALE*(timestamps[idx+1]-timestamps[idx]);
Serial.print(test_strings[idx]);
Serial.print(F(" : "));
Serial.print(temp_time/TIME_RES);
Serial.print(F("."));
Serial.print(temp_time%TIME_RES);
Serial.print(F("ns. ("));
Serial.print((timestamps[idx+1]-timestamps[idx]));
Serial.println(F("ticks)."));
switch(idx)
{
case INT16_MUL:
PRINT_MUL(int16p, int16a, int16b);
break;
case INT16_MACRO_MUL:
PRINT_MUL(int16_macro_p, int16a, int16b);
break;
case INT16_CAST_MUL:
PRINT_MUL(int16_cast_p, int16a, int16b);
break;
case INT32_MUL:
PRINT_MUL(int32p, int32a, int32b);
break;
case FLOAT_MUL:
PRINT_MUL(floatp, floata, floatb);
break;
}
}
}
}
``````

These are a sample of the results:

# F_CPU = 16000000Hz. Timer1 scale = 62.5ns/tick.

INT16_MUL : 2375.0ns. (38ticks).
-2607 = -15961 * -13369
INT16_MACRO_MUL : 2750.0ns. (44ticks).
213382609 = -15961 * -13369
INT16_CAST_MUL : 4312.5ns. (69ticks).
213382609 = -15961 * -13369
INT32_MUL : 6562.5ns. (105ticks).
213382609 = -15961 * -13369
FLOAT_MUL : 9562.5ns. (153ticks).
213382608.00 = -15961.00 * -13369.00

• Method one (INT16_MUL) is the fastest, which doesn’t mean much because the product is just wrong.
• Method two (INT16_MACRO_MUL) is nearly as fast, but it uses an assembly macro to prevent the compiler from throwing away the upper 16bits of the product.
• Method three (INT16_CAST_MUL) is almost twice as slow, as it casts the int16s into int32s before the multiplication. After this the compiler actually does an int32 * int32 multiplication. Smarter compilers exist, btw, that will recognize this casting method as a place where it can do an int16 * int16 = int32 (making the macro in method two unecessary)
• Method four (INT32_MUL) shows that at least AVR is doing some sort of optimization when casting the int16s to int32s, as the int32 * int32 = int32 still takes longer than method 3.
• Method five (FLOAT_MUL) uses floating point and not only is it super slow, but it also adds error to the product.

I noticed that I was getting different numbers from the other people in the thread so I verified my measurements. If you add up all of these measurements they are ~25.5us. The pulse I measured on TEST_PIN shows that the measurements are correct.

I also inspected the assembly code, to make sure that the number of cycles measured was equal to the number of cycles counted by hand. The first calculation of 38 cycles seems to match the assembly code:

``````math_section:
cli();
380: f8 94       cli
382: 5a 9a       sbi 0x0b, 2 ; 11
timestamps[0] = TCNT1;                     ; cycles (total cycles)
384: 80 91 84 00 lds r24, 0x0084          ; 2 (2)
388: 90 91 85 00 lds r25, 0x0085          ; 2 (4)
38c: 9a 83       std Y+2, r25             ; 2 (6)
38e: 89 83       std Y+1, r24             ; 2 (8)
int16p = int16a * int16b;
390: 40 91 96 01 lds r20, 0x0196          ; 2 (10)
394: 50 91 97 01 lds r21, 0x0197          ; 2 (12)
398: 20 91 94 01 lds r18, 0x0194          ; 2 (14)
39c: 30 91 95 01 lds r19, 0x0195          ; 2 (16)
3a0: 42 9f       mul r20, r18             ; 2 (18)
3a2: c0 01       movw r24, r0             ; 1 (19)
3a4: 43 9f       mul r20, r19             ; 2 (21)
3a6: 90 0d       add r25, r0              ; 1 (22)
3a8: 52 9f       mul r21, r18             ; 2 (24)
3aa: 90 0d       add r25, r0              ; 1 (25)
3ac: 11 24       eor r1, r1               ; 1 (26)
3ae: aa 27       eor r26, r26             ; 1 (27)
3b0: 97 fd       sbrc r25, 7              ; 1/2/3 (28/29/30)
3b2: a0 95       com r26                  ; 1 (29/30/31)
3b4: ba 2f       mov r27, r26             ; 1 (30/31/32)
3b6: 80 93 90 01 sts 0x0190, r24          ; 2 (32/33/34)
3ba: 90 93 91 01 sts 0x0191, r25          ; 2 (34/35/36)
3be: a0 93 92 01 sts 0x0192, r26          ; 2 (36/37/38)
3c2: b0 93 93 01 sts 0x0193, r27          ; 2 (38/39/40)
``````

I can provide the rest of the assembly code is anyone is interested. I mostly generated it to be sure the optimizer wasn’t doing anything to affect the measurements.

At this point I’m pretty sure my calculations are correct. I’m not sure why the discrepancy with the previous posters, however.

Edited to Add: I didn’t realize this was a resurrected thread with 9618 views!

The result in (1) is correct. -2607 = -15961 * -13369
You multiplied two 16-bit integers and stored the result in a 16-bit integer. If you want the result to fit in a 16-bit integer, you’ll have to pick two smaller numbers to multiply.

Pete

el_supremo: The result in (1) is correct. -2607 = -15961 * -13369 You multiplied two 16-bit integers and stored the result in a 16-bit integer. If you want the result to fit in a 16-bit integer, you'll have to pick two smaller numbers to multiply.

No, I stored all of my products in int32s:

``````volatile int16_t int16a, int16b;
volatile int32_t int16p, int16_cast_p, int16_macro_p;
``````

There are embedded compilers, btw, that will maintain the upper int16 from an int16 * int16 multiply as long as you use a 32bit variable to store the product. They have to list it in the "deviations from the standard" section of their manual, but they're usually proud of doing it.

Oh right. The result being stored in a 32-bit integer is irrelevant. The result of multiplying a 16-bit int by a 16-bit int is defined in the C language to be a 16-bit int.

Pete

el_supremo:
Oh right. The result being stored in a 32-bit integer is irrelevant. The result of multiplying a 16-bit int by a 16-bit int is defined in the C language to be a 16-bit int.

Yes, one of the most annoying things about the language honestly.

And to make it even worse, processor MUL instructions almost always produce both parts of the product (I’ve never seen one that didn’t).

The compiler sometimes does extra work to zero out the upper half of the result just because the laws of C say so!

By the time this thread got done, divide by 10 and remainder got -very- optimized.

Fast / 10 and % 10 thread.

GoForSmoke:
By the time this thread got done, divide by 10 and remainder got -very- optimized.

Fast / 10 and % 10 thread.

Hmm…interesting. I tested his first solution in that thread (not sure what the absolutely fastest version turned out to be in that thread). It seems he reduced the operation by 61% over using / and %, and 28% over using the ((1 << 32)/10) multiply then shift 32 trick.

# F_CPU = 16000000Hz. Timer1 scale = 62.5ns/tick.

DIVMOD10 : 39562.5ns. (633ticks).
213382609 = 21338260.9
FUNCTION_DIVMOD10 : 15375.0ns. (246ticks).
213382609 = 21338260.9
MULTSHIFT_DIVMOD10 : 21375.0ns. (342ticks).
213382609 = 21338260.9

This is the code I used.

``````#include "Arduino.h"
#include <stdint.h>

#define TEST_PIN 2
#define TEST_PIN_MASK (1 << TEST_PIN) // Port D Pin 3
#define NS_PER_SEC 1000000000
#define TIME_RES 10
#define TIME_SCALE (uint32_t)(1ULL*NS_PER_SEC*TIME_RES/F_CPU)

#define PRINT_DIVMOD10(p,a,b)  \
{                         \
Serial.print(p);        \
Serial.print(F(" = ")); \
Serial.print(a);        \
Serial.print(F("."));  \
Serial.println(b);      \
}

enum
{
DIVMOD10,
FUNCTION_DIVMOD10,
MULTSHIFT_DIVMOD10,
NUM_DIVMOD10
};

char divmod10_string[] = "DIVMOD10";
char function_divmod10_string[] = "FUNCTION_DIVMOD10";
char multshift_divmod10_string[] = "MULTSHIFT_DIVMOD10";

const char *test_strings[NUM_DIVMOD10] =
{
divmod10_string,
function_divmod10_string,
multshift_divmod10_string
};

uint32_t div_divmod10, mod_divmod10;
uint32_t div_function_divmod10, mod_function_divmod10;
uint32_t div_multshift_divmod10, mod_multshift_divmod10;

//void divmod10(uint32_t in, uint32_t &div, uint32_t &mod);

void setup()
{
Serial.begin(115200);

// Turn on Timer1 with no prescaling.
TCCR1A = 0;
TCCR1B = 1;
TCCR1C = 0;

// Set up pin to see if I'm calculating time correctly.
digitalWrite(TEST_PIN, LOW);
pinMode(TEST_PIN, OUTPUT);

// Display timing information.
Serial.print(F("F_CPU = "));
Serial.print(F_CPU);
Serial.println(F("Hz."));
Serial.print(F("Timer1 scale = "));
Serial.print(TIME_SCALE/TIME_RES);
Serial.print(F("."));
Serial.print(TIME_SCALE%TIME_RES);
Serial.println(F("ns/tick."));
}

void loop()
{
static uint32_t slow_down_serial_timer;
const uint32_t slow_down_serial_time = 1000000;
uint16_t timestamps[NUM_DIVMOD10+1];
uint8_t idx;
uint32_t temp_uint32;

if(micros() - slow_down_serial_timer > slow_down_serial_time)
{
slow_down_serial_timer += slow_down_serial_time;

// Generate Random Inputs
generate_section:
temp_uint32 = (uint32_t)random(INT16_MIN, INT16_MAX) * (uint32_t)random(INT16_MIN, INT16_MAX);

// Measure the time for each operation
math_section:
cli();
timestamps[0] = TCNT1;
div_divmod10 = temp_uint32 / 10;
mod_divmod10 = temp_uint32 % 10;
timestamps[1] = TCNT1;
divmod10(temp_uint32, div_function_divmod10, mod_function_divmod10);
timestamps[2] = TCNT1;
div_multshift_divmod10 = ((uint64_t)temp_uint32 * ((1ULL << 32)/10) + (1 << 31)) >> 32;
mod_multshift_divmod10 = temp_uint32 - div_multshift_divmod10 * 10;
timestamps[3] = TCNT1;
sei();

// Print the measured times
print_section:
Serial.println(F("==================================================="));
for(idx = 0; idx < NUM_DIVMOD10; idx++)
{
uint32_t temp_time = TIME_SCALE*(timestamps[idx+1]-timestamps[idx]);
Serial.print(test_strings[idx]);
Serial.print(F(" : "));
Serial.print(temp_time/TIME_RES);
Serial.print(F("."));
Serial.print(temp_time%TIME_RES);
Serial.print(F("ns. ("));
Serial.print((timestamps[idx+1]-timestamps[idx]));
Serial.println(F("ticks)."));
switch(idx)
{
case DIVMOD10:
PRINT_DIVMOD10(temp_uint32, div_divmod10, mod_divmod10);
break;
case FUNCTION_DIVMOD10:
PRINT_DIVMOD10(temp_uint32, div_function_divmod10, mod_function_divmod10);
break;
case MULTSHIFT_DIVMOD10:
PRINT_DIVMOD10(temp_uint32, div_multshift_divmod10, mod_multshift_divmod10);
break;
}
}
}
}

void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
// q = in * 0.8;
uint32_t q = (in >> 1) + (in >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);  // not needed for 16 bit version

// q = q / 8;  ==> q =  in *0.1;
q = q >> 3;

// determine error
uint32_t  r = in - ((q << 3) + (q << 1));   // r = in - q*10;
div = q + (r > 9);
if (r > 9) mod = r - 10;
else mod = r;
}
``````

If he is solely doing this for displaying fixed point integers over the serial port, however, then he’d be much better off inserting the ‘.’ as he converts the integers to ASCII. I did this by writing my own printf() and including a “%#.#d” formatter where the 2nd # tells me the position of the implied decimal point.

I've gotten too old for such mental gymnastics but does that work through some decimal base magic or could it be adapted to other bases, in particular prime number bases?

BTW, subsequent algorithms in that thread makes improvements by leaps. I had tried my own table-lookup based constructions (a small table that can be applied to different word lengths) later on and came up short. Not bad but not as fast.

PS - there's probably a centuries old technique from India to do all this.

Atmel has some optimized routines and performance results listed here that might be of interest.