Efficient 8 bit division

What's the most efficient way to divide an unsigned 8-bit integer variable by a fixed unsigned 8-bit value on an atmega328? Successive subtraction? Binary tree? 256 byte lookup table?

E.G. uint8_t result, variable; result = variable / 25;

check the divmod10 discussion - http://forum.arduino.cc/index.php?topic=167414.0 -

a 256 lookup table will be fastest, and might be put in PROGMEM

a multiply shift is also often faster - http://www.codeproject.com/Articles/17480/Optimizing-integer-divisions-with-Multiply-Shift-i

which "const" division do you have in mind?

Thanks, that last suggestion could be the trick I was hoping to find. The result doesn't have to be exact, just pretty close. The constant divisor will either be 26 or 13 and the dividend anything from 0 to 255. This is within an ISR so I am motivated to make it short.

I was half-joking about the 256 byte lookup table though. It might be fastest but too expensive for me.

You can always divide by 13 and then shift by 2 if required.

A giant switch/case would work, too. You could use the gcc .. extension to cut down on the number of cases. You should only need 256/13 about 20 cases.

KeithRB: You can always divide by 13 and then shift by 2 if required.

I don't understand. How would dividing and shifting be any faster than just dividing?

If you come up with a super efficient divide by 13 algorithm, a simple test and shift gets you divide by 26, so you don't need two lookup tables or whatever.

Timing things like this is very confusing with the optimizing compiler silently doing its thing. It loves to throw away my code if it decides I don’t really want it, even if I do. I don’t believe the numbers I’m getting.

jboyton: Timing things like this is very confusing with the optimizing compiler silently doing its thing. It loves to throw away my code if it decides I don't really want it, even if I do. I don't believe the numbers I'm getting.

No, it throws it away if it does not have a lasting effect. A do nothing loop has no lasting effect. just use the index outside of the loop.

n / 13 <==> n * 79 >> 10

n / 26 <==> n * 79 >> 11

uint32_t start;
uint32_t stop;

volatile uint8_t t;

void setup() 
{
  Serial.begin(115200);
  Serial.println("Start ");

  start = micros();
  for (uint8_t i=0; i<255; i++)
  {
    t = i/26;
  }
  stop = micros();
  Serial.println(stop-start);

  start = micros();
  for (uint8_t i=0; i<255; i++)
  {
    t = i * 79 >> 11;
  }
  stop = micros();
  Serial.println(stop-start);

  start = micros();
  for (uint8_t i=0; i<255; i++)
  {
    t = i * 79 >> 10;
  }
  stop = micros();
  Serial.println(stop-start);
  Serial.println();

  for (uint8_t i=0; i<255; i++)
  {
    Serial.print(i/26);
    Serial.print("\t");
    Serial.print(i * 79 >> 11);
    Serial.print("\t");
    Serial.println(i/26 - (i * 79 >> 11) );
  }

  for (uint8_t i=0; i<255; i++)
  {
    Serial.print(i/13);
    Serial.print("\t");
    Serial.print(i * 79 >> 10);
    Serial.print("\t");
    Serial.println(i/13 - (i * 79 >> 10) );
  }
}

void loop() 
{
}

output in microseconds
1460 - normal division 26 = 5.73 usec
272 - shift multiply 26 = 1.07 usec
260 - shift muliply 13 = 1.02 usec

so factor ~5.3 faster (tested on IDE 1.5.4. UNO)

fast enough?

KeithRB: No, it throws it away if it does not have a lasting effect. A do nothing loop has no lasting effect. just use the index outside of the loop.

Right, but what is a "lasting effect"? I wanted to time a piece of code and the result would have had a lasting effect for me.

If you don't reference the result in a way that the compiler thinks is lasting, like with a Serial.print(), then the compiler discards it. Or if you load the input value to an operation with a fixed constant the compiler calculates the result at compile time and again discards the code. So I have to do a little extra dancing to get the compiler to do what I want. In one case I found that code that I had wrapped with micros() calls was moved outside of the two calls.

It would be nice at times like this to turn off optimization. How is this done?

robtillaart: output in microseconds 1460 - normal division 26 = 5.73 usec 272 - shift multiply 26 = 1.07 usec 260 - shift muliply 13 = 1.02 usec

so factor ~5.3 faster (tested on IDE 1.5.4. UNO)

fast enough?

Thanks for doing that. I had close to the same results last night but kept getting different answers depending on how I arranged to fool the compiler into doing the timing (it often fooled me instead).

If I cast the operation it is about 20% faster. And if I time it with timer1 instread of micros() it appears to be faster still.

On my machine (IDE 1.05) using micros() to time:

5.68 us - division by 26 1.02 us - multiply/shift (79, >>10) 0.82 us - multiply/shift (79, >>10) with casting

Using timer1 (no prescale) to time:

5.62 us - division by 26 0.85 us - multiply/shift (79, >>10) 0.67 us - multiply/shift (79, >>10) with casting

I think that's fast enough for my needs, assuming what I'm seeing is real.

jboyton:
I think that’s fast enough for my needs, assuming what I’m seeing is real.

I realized I was shifting by 10 instead of 11. So I repeated the timings for >>11:

Using micros():

5.68 us - division by 26
1.13 us - multiply/shift (79, >>11)
0.91 us - multiply/shift (79, >>11) with casting

Using timer1 (no prescale):

5.62 us - division by 26
4.12 us - multiply/shift (79, >>11) <----- ? ? ?
0.73 us - multiply/shift (79, >>11) with casting

More weirdness with that second to last one.
Why would an extra shift cause the timing to go haywire?

what board are you testing?

How does your casted formula looks like?

It might be that the shift >> 11 is not optimized well.

can you post the code that reproduced that haywire value?

Board/software: Arduino Uno R3; Mac OS X 10.9.5; IDE 1.0.5

Cast: result = uint16_t(i * 79) >> 11;

I have no idea what is going on. When I run the code shown below I get:

4.122 us – shift 11
0.878 us – shift 10
0.729 us – shift 11 with cast

Note the baud rate is 115200. If I change it to 9600 I get this instead:

3.983
0.838
0.689

I can also significantly change the outcome by merely rearranging my code in seemingly innocuous ways. That all seems a little weird to me.

void setup() {

  uint16_t t1Start, t1Stop;
  uint8_t result;
  uint8_t loops;
  int elapsedCycles;
  float elapsedTime;
  
  Serial.begin(115200);
  Serial.println("\nHello.\n");
  
  TIMSK1 = 0;
  TCCR1A = 0;
  TCCR1B = 1;
    
  loops = 255;

  t1Start = TCNT1;
  for (uint8_t i=0; i<loops; i++) {
    result = (i * 79) >> 11;
  }
  t1Stop = TCNT1;
  
  elapsedCycles = t1Stop - t1Start;
  elapsedTime = float(elapsedCycles) / 16.0;
  Serial.println("\n(i * 79) >> 11");
  Serial.print("result = "); Serial.println(result);
  Serial.print(t1Start); Serial.print(","); Serial.println(t1Stop);
  Serial.print("cycles: "); Serial.println(elapsedCycles);
  Serial.print("time (us): "); Serial.println(elapsedTime,3);
  Serial.print("loops: "); Serial.println(loops);
  Serial.print("time/loop (us): "); Serial.println(elapsedTime/float(loops),3);


  t1Start = TCNT1;
  for (uint8_t i=0; i<loops; i++) {
    result = (i * 79) >> 10;
  }
  t1Stop = TCNT1;

  elapsedCycles = t1Stop - t1Start;
  elapsedTime = float(elapsedCycles) / 16.0;
  Serial.println("\n(i * 79) >> 10");
  Serial.print("result = "); Serial.println(result);
  Serial.print(t1Start); Serial.print(","); Serial.println(t1Stop);
  Serial.print("cycles: "); Serial.println(elapsedCycles);
  Serial.print("time (us): "); Serial.println(elapsedTime,3);
  Serial.print("loops: "); Serial.println(loops);
  Serial.print("time/loop (us): "); Serial.println(elapsedTime/float(loops),3);



  Serial.println("\nuint16_t(i * 79) >> 11");
  t1Start = TCNT1;
  for (uint8_t i=0; i<loops; i++) {
    result = uint16_t(i * 79) >> 11;
  }
  t1Stop = TCNT1;

  elapsedCycles = t1Stop - t1Start;
  elapsedTime = float(elapsedCycles) / 16.0;
  Serial.print(t1Start); Serial.print(","); Serial.println(t1Stop);
  Serial.print("result = "); Serial.println(result);
  Serial.print("cycles: "); Serial.println(elapsedCycles);
  Serial.print("time (us): "); Serial.println(elapsedTime,3);
  Serial.print("loops: "); Serial.println(loops);
  Serial.print("time/loop (us): "); Serial.println(elapsedTime/float(loops),3);
}

void loop(){}

edit: Thinking about it, I suspect that the serial transmit is happening asynchronously during the succeeding timing loop, delaying it slightly.

the number 79 is signed making the (i*79) signed but that is also true for >> 10....

(i * (uint16_t)79) >> 11; ==> time/loop (us): 0.723

De shift 10 version is also one clock faster when casting.

result = uint16_t(i * 79) >> 10; ==> time/loop (us): 0.661

edit: Thinking about it, I suspect that the serial transmit is happening asynchronously during the succeeding timing loop, delaying it slightly.

you can circumvent that with a delay(3000) or so

(i * 79) >> 11 result = 9 27733,43901 cycles: 16168 time (us): 1010.500 loops: 255 time/loop (us): 3.963

(i * 79) >> 10 result = 19 6559,9879 cycles: 3320 time (us): 207.500 loops: 255 time/loop (us): 0.814

uint16_t(i * 79) >> 11 2476,5286 result = 9 cycles: 2810 time (us): 175.625 loops: 255 time/loop (us): 0.689

met shift 10 gecast

(i * 79) >> 10
result = 19
6559,9114
cycles: 2555
time (us): 159.688
loops: 255
time/loop (us): 0.626

robtillaart: you can circumvent that with a delay(3000) or so

Yes, thank you, I already realized that. Timing a few instructions seems so simple on the surface.

It looks like the multiply/shift is the ticket. Sub 1 us is what I was hoping for. Thank you again.

That said, I think I will spend the time to better understand the gcc inline asm syntax.

To prevent the compiler optimizing the timing loop, I used a random() to get another input value per iteration of the loop. The timing is much larger, but by looking at the diffs one can see the time less used by the optimized versions.

uint32_t start;
uint32_t stop;

volatile uint8_t t;

void setup() 
{
  Serial.begin(115200);
  Serial.println("Start ");

  randomSeed(analogRead(A0));

  uint8_t n = random(255);
  delay(1000);
  start = micros();
  for (uint8_t i=0; i<255; i++)
  {
    t = n/26;
    n = random(255);
  }
  stop = micros();
  Serial.println(stop-start);

  delay(1000);
  start = micros();
  for (uint8_t i=0; i<255; i++)
  {
    t = n * 79 >> 11;
    n = random(255);
  }
  stop = micros();
  Serial.println(stop-start);

  delay(1000);
  start = micros();
  for (uint8_t i=0; i<255; i++)
  {
    t = n * 79 >> 10;
    n = random(255);
  }
  stop = micros();
  Serial.println(stop-start);
  Serial.println();

  for (uint8_t i=0; i<255; i++)
  {
    Serial.print(i/26);
    Serial.print("\t");
    Serial.print(i * 79 >> 11);
    Serial.print("\t");
    Serial.println(i/26 - (i * 79 >> 11) );
  }

  for (uint8_t i=0; i<255; i++)
  {
    Serial.print(i/13);
    Serial.print("\t");
    Serial.print(i * 79 >> 10);
    Serial.print("\t");
    Serial.println(i/13 - (i * 79 >> 10) );
  }
}

void loop() 
{
}

Start
36444 => reference
35304 => 1140/255 ~4.5 usec faster
35296 => 1178/255 ~4.6 usec faster

so the mul/shift is between 4 and 5 usec faster than a division

[did not test the casting variation]