Go Down

Topic: Speed of math operations (particularly division) on Arduino (Read 15907 times) previous topic - next topic

engblaze

We ran into some performance issues on a recent project and were curious as to how various operations affected execution time.  The common wisdom is that division is slower than multiplication or the other standard operations.  We looked around for some Arduino-specific data but didn't find much, so we decided to run some brief tests ourselves: http://www.engblaze.com/faster-code-fridays-understand-division-and-speed-of-operations/

Interestingly, division is much slower than other math.  As one of our commenters points out, the 8-bit AVRs handle integer multiplication, addition, and subtraction in hardware, but have to generate code for division.

Normally there are other things to worry about first when you're tackling code optimization, but the difference is so stark that it's something to keep in mind if you're doing serious number-crunching.

mromani

That was an interesting article. Thanks for sharing.

retrolefty

Yes, very enlightening information. Now we need some of our software gurus to show us methods to avoid (or limit the use of) divisions when using integer math.

Lefty

Magician

There is a topic how improve performance, replacing division with multiply / shift:
http://arduino.cc/forum/index.php/topic,60924.0.html
I read somewhere, that compiler smart enough to check if it necessary "int" as a "for" loop counter, in example code:
Code: [Select]
for (int i = 0; i <= 100; i++) {
        for (int j = 0; j <= 100; j++) {

I think, "i" and "j" would be "downscaled" to unsigned byte, but even byte multiply byte (mul8_8_16) require at least 2 cycles, so 10 000 multiplication should take around 125 nsec x 10 000 = 1.25 milliseconds, and not
Quote
And for our finicky multiplication test, each run completed in 72 us.
that is too fast for two loop, and too much for every muls8_8_16.

nickgammon

Interesting, thanks. I did some tests which agree in general ways with the posted conclusions. Division is slower than multiplication. But you expect that, because how many people can do long division quickly? Compared to multiplication?

Sketch:

Code: [Select]
#define DATATYPE byte
#define OPERATION i / j

volatile DATATYPE x;
unsigned long time;

void setup ()
{
    Serial.begin(115200);
    Serial.println ();

    time = micros();
    for (DATATYPE i = 0; i < 100; i++) {
        for (DATATYPE j = 0; j < 100; j++) {
            x = OPERATION;
        } // end of for j
    }  // end of for i
    time = micros() - time;
    Serial.print("Completion time: ");
    Serial.println(time);
} // end of setup

void loop() {}


By varying the two defines (DATATYPE and OPERATION) I tested various combinations. I then subtracted out from the computed time 3800 uS which was the time taken for an operation of "1" (being the loop and a straight assignment).

Code: [Select]

Operation   Data type   Time uS   Time each uS  Clock cycles  x baseline
         
multiply    byte           640         0.064         1               1
multiply    int          3,808         0.3808        6               6
multiply    long         7,620         0.762        12              12
         
divide      byte       126,924        12.6924      203             198
divide      int        148,936        14.8936      238             233
divide      long       395,544        39.5544      633             618


The column "x baseline" compares to a straight byte multiply.

So certainly divides are slower. And using longer data types is slower.

If you know the value in advance it would obviously be faster to multiply by the inverse than divide.
Please post technical questions on the forum, not by personal message. Thanks!

More info: http://www.gammon.com.au/electronics

Magician

There are application notes, "AVR200: Multiply and Divide Routines" and "AVR201: Using the AVR® Hardware Multiplier". I'm not argue with results, just curious the difference posted Nick and what they calculate at Atmel. It's probably, no wander that division slower than "asembler" macros , and it would also vary with a bunch of compiler flags 0s, 02, 03 etc, but what bugs me how multiplication could go faster? Even faster than "instruction Data set summary" says MUL 8 x 8 = 2 cycles?
Error in "correction" constant 3800? The same time "proportion" in time via data type is in good agreement with table:
Quote
Table 1. Performance Summary

8-bit x 8-bit Routines                                                     Word (Cycles)
Unsigned multiply 8 x 8 = 16 bits                                 1 (2)
Signed multiply 8 x 8 = 16 bits                                     1 (2)

16-bit x 16-bit Routines
Signed/unsigned multiply 16 x 16 = 16 bits                  6 (9)
Unsigned multiply 16 x 16 = 32 bits                            13 (17)
Signed multiply 16 x 16 = 32 bits                                15 (19)
Signed multiply-accumulate 16 x 16 += 32 bits           19 (23)


MarkT


There is a topic how improve performance, replacing division with multiply / shift:
http://arduino.cc/forum/index.php/topic,60924.0.html
I read somewhere, that compiler smart enough to check if it necessary "int" as a "for" loop counter, in example code:
Code: [Select]
for (int i = 0; i <= 100; i++) {
        for (int j = 0; j <= 100; j++) {

I think, "i" and "j" would be "downscaled" to unsigned byte, but even byte multiply byte (mul8_8_16) require at least 2 cycles, so 10 000 multiplication should take around 125 nsec x 10 000 = 1.25 milliseconds, and not
Quote
And for our finicky multiplication test, each run completed in 72 us.
that is too fast for two loop, and too much for every muls8_8_16.


The compiler has optimized away one of the loops - if you want to avoid compiler optimizations like this you have to use the value computed in one iteration in the next iteration, and then use if after the loops complete - the easiest way to do this is something like
Code: [Select]

    for (DATATYPE i = 0; i < 100; i++) {
        for (DATATYPE j = 0; j < 100; j++) {
            x ^= OPERATION;
        }
    }

I tried += but the compiler just optimized it away, note (compilers go to great lengths to optimize [nested] loops, note, because that's where optimization really pays off).  Using -= it didn't optimize away, and ^= seems a better bet since it is unrelated to arithmetic ops.

Another good way to defeat optimization is to make the initial value (here of x) as a parameter to the function doing the timing tests.  The compiler then can't know its starting value.
[ I will NOT respond to personal messages, I WILL delete them, use the forum please ]

Nobsi

Just for info, I tested several data types.

Code: [Select]

app. execution time in us for one oparation incl. write to RAM (var's are volatile)

Operator                        +  -      *      / %

byte                            0,41    0,60     5,4
char (signed)                   0,41    0,60    15,2

unsigned int  (range byte)      0,85    1,35    13,1
unsigned int  (full range)      0,85    1,35    13,2
int  (range short int)          0,85    1,35    15,2
int  (full range)               0,85    1,35    15,2

unsigned long                   1,7     6,1     43,6
long                            1,7     6,1     44,7

float                           6,5     9,8     28,9



Code: [Select]

void loop() {
    volatile char x, a, b, min, max;  // forces x to be in RAM, not in registers (as in every normal code)
    min = -127;
    max = 127;

    time1 = micros(); // reference time
    for (unsigned int i = 0; i < 10000; i++) {
         a = random(min, max);
         b = random(min, max);
         asm(""); // avoid code optimization of the compiler
    }
    time1 = micros() - time1;

    time2 = micros();
    for (unsigned int i = 0; i < 10000; i++) {
         a = random(min, max);
         b = random(min, max);
         x = a/b;
         asm(""); // avoid code optimization of the compiler
    }
    time2 = micros() - time2;

    Serial.print("Completion time: ");
    Serial.print((time2-time1)/10);
    Serial.println("ns per operation");
}


Used Boards: Uno, Mega, Redboard

sterretje

Maybe related, maybe not. But the 328 does not have a DIV instruction and hence the complete DIV needs to be implemented in software.
If you understand an example, use it.
If you don't understand an example, don't use it.

Electronics engineer by trade, software engineer by profession. Trying to get back into electronics after 15 years absence.

aarg

For division by a constant, you can sometimes save time by multiplying by 1/c, instead of dividing by c. It's not much, but it's something.
  ... with a transistor and a large sum of money to spend ...
Please don't PM me with technical questions. Post them in the forum.

lastchancename

NOBSI,
Interesting that your tests show float division is significantly faster than the long int division.

Just unexpected!
Q: How many searches did you make before posting this question?      A: none
At the very least, take a guess at the solution, then we can help move forward from what you know already.

BigBobby

Something that hasn't really been mentioned is that a int16 * int16 can result in an int32, but unless you instruct avr-gcc to not throw away the upper 16bits that's exactly what it's going to do.

In this sketch I compared the speeds of 5 ways to do int32 = int16 * int16.  I used timer 1 to get the best resolution on the measurement and I turned off interrupts so that they wouldn't affect it either.

Code: [Select]

#include "Arduino.h"
#include <stdint.h>

#define TEST_PIN 2
#define TEST_PIN_MASK (1 << TEST_PIN) // Port D Pin 3
#define NS_PER_SEC 1000000000
#define TIME_RES 10
#define TIME_SCALE (uint32_t)(1ULL*NS_PER_SEC*TIME_RES/F_CPU)

// Macro From -> https://github.com/rekka/avrmultiplication/blob/master/mult16x16.h
// longRes = intIn1 * intIn2
// signed16 * signed16
#define MultiS16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"muls %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mulsu %B2, %A1 \n\t" \
"sbc %D0, r26 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %D0, r26 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" \
)

#define PRINT_MUL(p,a,b)  \
{                         \
  Serial.print(p);        \
  Serial.print(F(" = ")); \
  Serial.print(a);        \
  Serial.print(F(" * "));  \
  Serial.println(b);      \
}

enum
{
  INT16_MUL,
  INT16_MACRO_MUL,
  INT16_CAST_MUL,
  INT32_MUL,
  FLOAT_MUL,
  NUM_MUL
};

char int16_mul_string[] = "INT16_MUL";
char int16_macro_mul_string[] = "INT16_MACRO_MUL";
char int16_cast_mul_string[] = "INT16_CAST_MUL";
char int32_mul_string[] = "INT32_MUL";
char float_mul_string[] = "FLOAT_MUL";

const char *test_strings[NUM_MUL] =
{
    int16_mul_string,
    int16_macro_mul_string,
    int16_cast_mul_string,
    int32_mul_string,
    float_mul_string
};

volatile int16_t int16a, int16b;
volatile int32_t int16p, int16_cast_p, int16_macro_p;
volatile int32_t int32a, int32b, int32p;
volatile float floata, floatb, floatp;

void setup()
{
  Serial.begin(115200);

  // Turn on Timer1 with no prescaling.
  TCCR1A = 0;
  TCCR1B = 1;
  TCCR1C = 0;

  // Set up pin to see if I'm calculating time correctly.
  digitalWrite(TEST_PIN, LOW);
  pinMode(TEST_PIN, OUTPUT);

  // Display timing information.
  Serial.print(F("F_CPU = "));
  Serial.print(F_CPU);
  Serial.println(F("Hz."));
  Serial.print(F("Timer1 scale = "));
  Serial.print(TIME_SCALE/TIME_RES);
  Serial.print(F("."));
  Serial.print(TIME_SCALE%TIME_RES);
  Serial.println(F("ns/tick."));
}

void loop()
{
  static uint32_t slow_down_serial_timer;
  const uint32_t slow_down_serial_time = 1000000;
  uint16_t timestamps[NUM_MUL+1];
  uint8_t idx;

  if(micros() - slow_down_serial_timer > slow_down_serial_time)
  {
    slow_down_serial_timer += slow_down_serial_time;

    // Generate Random Inputs
generate_section:
    int16a = random(INT16_MIN, INT16_MAX);
    int16b = random(INT16_MIN, INT16_MAX);
    int32a = (int32_t)int16a;
    int32b = (int32_t)int16b;
    floata = (float)int32a;
    floatb = (float)int32b;

    // Measure the time for each operation
math_section:
    cli();
    PORTD |= TEST_PIN_MASK;
    timestamps[0] = TCNT1;
    int16p = int16a * int16b;
    timestamps[1] = TCNT1;
    MultiS16X16to32(int16_macro_p, int16a, int16b);
    timestamps[2] = TCNT1;
    int16_cast_p = (int32_t)int16a * (int32_t)int16b;
    timestamps[3] = TCNT1;
    int32p = int32a * int32b;
    timestamps[4] = TCNT1;
    floatp = floata * floatb;
    timestamps[5] = TCNT1;
    PORTD &= ~TEST_PIN_MASK;
    sei();

    // Print the measured times
print_section:
    Serial.println(F("==================================================="));
    for(idx = 0; idx < NUM_MUL; idx++)
    {
      uint32_t temp_time = TIME_SCALE*(timestamps[idx+1]-timestamps[idx]);
      Serial.print(test_strings[idx]);
      Serial.print(F(" : "));
      Serial.print(temp_time/TIME_RES);
      Serial.print(F("."));
      Serial.print(temp_time%TIME_RES);
      Serial.print(F("ns. ("));
      Serial.print((timestamps[idx+1]-timestamps[idx]));
      Serial.println(F("ticks)."));
      switch(idx)
      {
        case INT16_MUL:
          PRINT_MUL(int16p, int16a, int16b);
          break;
        case INT16_MACRO_MUL:
          PRINT_MUL(int16_macro_p, int16a, int16b);
          break;
        case INT16_CAST_MUL:
          PRINT_MUL(int16_cast_p, int16a, int16b);
          break;
        case INT32_MUL:
          PRINT_MUL(int32p, int32a, int32b);
          break;
        case FLOAT_MUL:
          PRINT_MUL(floatp, floata, floatb);
          break;
      }
    }
  }
}


These are a sample of the results:

Quote
F_CPU = 16000000Hz.
Timer1 scale = 62.5ns/tick.
===================================================
INT16_MUL : 2375.0ns. (38ticks).
-2607 = -15961 * -13369
INT16_MACRO_MUL : 2750.0ns. (44ticks).
213382609 = -15961 * -13369
INT16_CAST_MUL : 4312.5ns. (69ticks).
213382609 = -15961 * -13369
INT32_MUL : 6562.5ns. (105ticks).
213382609 = -15961 * -13369
FLOAT_MUL : 9562.5ns. (153ticks).
213382608.00 = -15961.00 * -13369.00
...
  • Method one (INT16_MUL) is the fastest, which doesn't mean much because the product is just wrong.
  • Method two (INT16_MACRO_MUL) is nearly as fast, but it uses an assembly macro to prevent the compiler from throwing away the upper 16bits of the product.
  • Method three (INT16_CAST_MUL) is almost twice as slow, as it casts the int16s into int32s before the multiplication.  After this the compiler actually does an int32 * int32 multiplication.  Smarter compilers exist, btw, that will recognize this casting method as a place where it can do an int16 * int16 = int32 (making the  macro in method two unecessary)
  • Method four (INT32_MUL) shows that at least AVR is doing some sort of optimization when casting the int16s to int32s, as the int32 * int32 = int32 still takes longer than method 3.
  • Method five (FLOAT_MUL) uses floating point and not only is it super slow, but it also adds error to the product.

I noticed that I was getting different numbers from the other people in the thread so I verified my measurements.  If you add up all of these measurements they are ~25.5us.  The pulse I measured on TEST_PIN shows that the measurements are correct.


I also inspected the assembly code, to make sure that the number of cycles measured was equal to the number of cycles counted by hand.  The first calculation of 38 cycles seems to match the assembly code:

Code: [Select]

math_section:
    cli();
     380: f8 94       cli
    PORTD |= TEST_PIN_MASK;
     382: 5a 9a       sbi 0x0b, 2 ; 11
    timestamps[0] = TCNT1;                     ; cycles (total cycles)
     384: 80 91 84 00 lds r24, 0x0084          ; 2 (2)
     388: 90 91 85 00 lds r25, 0x0085          ; 2 (4)
     38c: 9a 83       std Y+2, r25             ; 2 (6)
     38e: 89 83       std Y+1, r24             ; 2 (8)
    int16p = int16a * int16b;
     390: 40 91 96 01 lds r20, 0x0196          ; 2 (10)
     394: 50 91 97 01 lds r21, 0x0197          ; 2 (12)
     398: 20 91 94 01 lds r18, 0x0194          ; 2 (14)
     39c: 30 91 95 01 lds r19, 0x0195          ; 2 (16)
     3a0: 42 9f       mul r20, r18             ; 2 (18)
     3a2: c0 01       movw r24, r0             ; 1 (19)
     3a4: 43 9f       mul r20, r19             ; 2 (21)
     3a6: 90 0d       add r25, r0              ; 1 (22)
     3a8: 52 9f       mul r21, r18             ; 2 (24)
     3aa: 90 0d       add r25, r0              ; 1 (25)
     3ac: 11 24       eor r1, r1               ; 1 (26)
     3ae: aa 27       eor r26, r26             ; 1 (27)
     3b0: 97 fd       sbrc r25, 7              ; 1/2/3 (28/29/30)
     3b2: a0 95       com r26                  ; 1 (29/30/31)
     3b4: ba 2f       mov r27, r26             ; 1 (30/31/32)
     3b6: 80 93 90 01 sts 0x0190, r24          ; 2 (32/33/34)
     3ba: 90 93 91 01 sts 0x0191, r25          ; 2 (34/35/36)
     3be: a0 93 92 01 sts 0x0192, r26          ; 2 (36/37/38)
     3c2: b0 93 93 01 sts 0x0193, r27          ; 2 (38/39/40)


I can provide the rest of the assembly code is anyone is interested.  I mostly generated it to be sure the optimizer wasn't doing anything to affect the measurements.

At this point I'm pretty sure my calculations are correct.  I'm not sure why the discrepancy with the previous posters, however.

Edited to Add:  I didn't realize this was a resurrected thread with 9618 views!

el_supremo

The result in (1) is correct. -2607 = -15961 * -13369
You multiplied two 16-bit integers and stored the result in a 16-bit integer. If you want the result to fit in a 16-bit integer, you'll have to pick two smaller numbers to multiply.

Pete
Don't send me technical questions via Private Message.

BigBobby

The result in (1) is correct. -2607 = -15961 * -13369
You multiplied two 16-bit integers and stored the result in a 16-bit integer. If you want the result to fit in a 16-bit integer, you'll have to pick two smaller numbers to multiply.
No, I stored all of my products in int32s:
Code: [Select]

volatile int16_t int16a, int16b;
volatile int32_t int16p, int16_cast_p, int16_macro_p;


There are embedded compilers, btw, that will maintain the upper int16 from an int16 * int16 multiply as long as you use a 32bit variable to store the product.  They have to list it in the "deviations from the standard" section of their manual, but they're usually proud of doing it.

el_supremo

Oh right. The result being stored in a 32-bit integer is irrelevant. The result of multiplying a 16-bit int by a 16-bit int is defined in the C language to be a 16-bit int.

Pete
Don't send me technical questions via Private Message.

Go Up