Fastest U32 divide by 10 routine

I've been tinkering with running a 328p at 20MHz. The boards running at 16 MHz use a divide by 64 prescaler on Timer0. The micros() routine prefixes the current timer value with the the overflow count and then multiplies by four to convert to microseconds. Much in the same way, I need to multiply by 3.2 to calculate microseconds.

I figured the easiest way to do this, in the least number of cycles, was to multiply by 32 and then divide by 10. Well, the multiply is really fast since it is just a 5-bit shift to the left. On the other hand, the divide is one expensive operation. The standard library takes 38uS to do the whole function call, up from 4uS on the 16MHz boards. I've been able to get the divide by 10 operation down to the point that the whole function call takes ~15.8uS.

Anybody know of anything faster? That's only about 300 cycles, which seems pretty quick. micros() is the only thing really suffering from the 20MHz speed boost, at least that I've found so far. All the other system timing seems pretty good. 25% faster is 25% faster.

Did you try multiply by 16 then divide by 5?

fungus: Did you try multiply by 16 then divide by 5?

Tried a couple of routines. The first one was slower, but the second one shaved almost 1uS (20 cycles) off the time. It clocks in at ~14.8uS per call. Thanks for the idea.

Another idea : 205/2048 is almost 0.1 (~0.1% error) might be acceptable?

x = ((t << 5) * 205L) >> 11;

what does it clock?

This is the core library function micros(). I’m trying to get it to return an accurate result as efficiently as possible. I appreciate your help, but I really want to keep the error as close to zero as possible hence the multiply by 3.2.

Fungus’ idea will allow me to double the time between rollovers. Not really much of a concern, especially since I have it set right now to time out after about 53 seconds for rollover testing purposes. Nothing that I have tested so far seems to be bothered by it. Omniboot and ArduinoISP work fine and all timer functions seem to be working properly. It’s just this large increase in the amount of time that micros() takes to run vs. on a 16MHz board where the calculation is super lightweight since it ends up being some simple shifting.

The routine I have probably wouldn’t strike anyone as being particularly efficient since it has several lines of code vs a single line. It’s a bunch of shifting a multiply or two, but it’s takes only about 1/3 the time of the regular integer divide by 10. Here is the routine to divide an unsigned long by 5:

z, q and r are unsigned long; m is the Timer0 overflow count; and t is the Timer0 current count

    z = (((m << 8) | t) * 16);
    q = (z >> 1) + (z >> 2);
    q = q + (q >> 4);
    q = q + (q >> 8);
    q = q + (q >> 16);
    q = q >> 2;
    r = z - q*5;
    return q + (7*r >> 5);

Hard to believe that is three times faster than just doing:

return(((m<<8)|t) * 32) / 10;

http://stackoverflow.com/questions/2033210/c-fast-division-mod-by-10x

@afremont
I’ve seen that code before in the book hackers delight - http://www.hackersdelight.org/divcMore.pdf - ! great book for optimizing

think you can simplify it too

// z = (((m << 8) | t) * 16);      
z = (m << 8) | t;   // removed factor 16

//  q = (z >> 1) + (z >> 2);         
q = (z << 1) + z;   // corrected factor 4
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
// q = q >> 2;    // corrected factor 4
r = (z << 4) - q*5;
return q + (7*r >> 5);

performance ?

Assuming it didn't break anything, you took almost another 1.5uS off it (~10%). 10,000 calls takes just over 132mS making for a function call and return time of 13.2uS now including loop overhead. Looks good, but I still have to do an exhaustive test to make sure every value works. Not something I would always do with an algorithm, but with optimized code like this, it's a necessity.

That's really neat that you can apparently visualize these functions in your head. I don't really understand how you took the multiply away. I mean you obviously accomplish it, but I don't fully get how you do it after starting to work on the divide? You don't appear to lose any precision. How does that work?

That’s really neat that you can apparently visualize these functions in your head. I don’t really understand how you took the multiply away. I mean you obviously accomplish it, but I don’t fully get how you do it after starting to work on the divide? You don’t appear to lose any precision. How does that work?

// don’t fear the bits :wink:

first: the multiply *16 is a shift left 4 positions

Then the next thing the code did was to shift right 1 and 2 positions
q = (z >> 1) + (z >> 2);
this is (partial) needless shifting.

so I first removed the * 16 and corrected that in the second line, creating these 2 lines

// z = (((m << 8) | t) * 16);      
z = (m << 8) | t;   // removed factor 16
//  q = (z >> 1) + (z >> 2);         
q = (z << 3) + (z << 2);   // corrected factor 16

Then I noticed the line: q = q >> 2 which is a divide by 4 and it throws away the last two bits effectively.

I saw that this new created line: q = (z << 3) + (z << 2); complemented the line: q = q >> 2 at least for one term
as the lines in between are “multiplications” I could merge these two lines

multiplications intermezzo:
q = q + (q>>4) can be read like q = int(q * 1.0625);
q = q + (q>>8) can be read like q = int(q * 1.00…)

So that brought me to:

// z = (((m << 8) | t) * 16);      
z = (m << 8) | t;   // removed factor 16  (shift left 4)
//  q = (z >> 1) + (z >> 2);         
// q = (z << 3) + (z << 2);   // corrected factor 16
q = (z << 1) + z; // corrected factor 4   (shift right 2)
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
// q = q >> 2;    // corrected factor 4  (shift right 2) 

r = z  - q*5;  //<<<<<<<<<<<<<<<<<<<<<<<<<<< still some trouble here as I had patched the original value of z
return q + (7*r >> 5);

r = z - q*5;
became
r = (z << 4) - q * 5; // bring in the factor 16 again that I removed from z in the first line

so the final code is

z = (m << 8) | t;   
q = (z << 1) + z; 
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
r = (z << 4) - q * 5; 
return q + (7*r >> 5);

Now while I write this I see one other possible optimization, the trick is that

q = q + (q >> 4) + (q >> 8) + (q >> 16);
q = q + 16 * (q >> 8) + (q >> 8) + (q >> 16); // (q >> 4) = 16 * (q >>8)

q = q + 17 * (q >> 8) + (q >> 16);
q = q + 4252 * (q >> 16) + (q >> 16); // (q >> 8) = 256 * (q >> 16)

q = q + 4353*(q >> 16);

That should bring the code to

z = (m << 8) | t;   
q = (z << 1) + z; 
q = q + 4353*(q >> 16);  
r = (z << 4) - q * 5; 
return q + (7*r >> 5);

I don’t know if it is faster - I even expect it is slower - please check

Moderator edit: turned off smileys to improve legibility.

Comment: These conversations are among the most interesting and educational I've had the good fortune to read... Both you guys are great to read. Thank You...

Doc

I’ll attempt to explain some of the mathematics behind this. Multiplying by 3.2 is, as fungus points out, the same as multiplying by 16 and dividing by 5

Multiply top and bottom by 3:
x * 16/5 == x * 3 * (16/15)

This is useful because there’s easy ways to multiply by 3 and 16/15

x=x*3 is the same as x+=(x<<1)

x=x*16/15 can be approximated as this:
x+=(x>>4);
x+=(x>>8);
x+=(x>>16);
// x+=(x>>32); // for u64

(The maths behind that is that 16/15 == 1 + 1/16 + 1/256 + 1/4096 + 1/65536 + … == 1 + 1/16 + 1/(1616) + 1/(161616) + 1/(16161616) + … == (1+1/16)(1+1/256)(1+1/65536)*… )

So the bulk of the calculation is just in the four lines:

x+=(x<<1);
x+=(x>>4);
x+=(x>>8);
x+=(x>>16);

which is very similar to the beginning of robtillaart’s optimized version (the rest of that code is error correction).

However the above few lines alone is in fact never out by more than about 4. This can be improved to 2 with an addition and a subtraction:

x+=1;
x+=(x<<1);
x+=(x>>4);
x+=(x>>8);
x+=(x>>16);
x-=1;

multiply by 32 and then divide by 10.

Hmm. Since multiply is faster than divide (in general, but especially on AVR), it would be significantly better to find a fraction that involves divide by a power-of-2, and multiply by an arbitrary constant. Though it doesn’t look like there is any exact value that will work :frowning:

x+=(x<<1);
x+=(x>>4);
x+=(x>>8);
x+=(x>>16);

Time to look at the assembly code produced. This is exactly the sort of math where a C compiler will produce code that is less efficient than it could be (in particular, it won’t/can’t notice that x>>8 and x>>16 are no longer 32bit numbers.) Hmm. Not that it saves a lot. It’s the (x>>4) that is particularly expensive. (all those lovely “shift based” algorithms tend to get a bit questionable when your CPU can only shift 8 bits by one bit at a time.) (I think I can get rid of about 7 instructions from the 8/16bit shifted math operations. But the shift-by-4 bits is 21 instructions by itself.)

Moderator edit: turned off smileys to improve legibility.

stimmer:
I’ll attempt to explain some of the mathematics behind this. Multiplying by 3.2 is, as fungus points out, the same as multiplying by 16 and dividing by 5

The reason I mentioned that deep piece of wisdom is because I know the Arduino multiply/divide routines shift the operands right and exit when they become zero. Also, the CPU can only shift one bit at a time, more than that takes a loop.

On a desktop PC it wouldn’t make any difference. On an AVR chip less bits in a parameter means less passes through a loop.

stimmer: This is useful because there's easy ways to multiply by 3 and 16/15

The Mega328 has an 8x8 multiply instruction that takes two clock cycles. The very fastest code would probably use that, not bit-shifts.

You'll probably have to write it in assembly language though... :)

I wanted to say thank you to everyone for their effort on this. I really appreciate the input. Special thanks to Fungus and RobT for making the code about 17% faster in short order.

Since the hardware does have a MUL instruction, I'm looking into the methods of division by multiplying by a "magic number". These might save a significant amount of time. While this method of shifting runs in approx 1/3 the time the library does when dividing by 10, it's still taking more than 250 cycles to do the function call. I'm not complaining mind you, as this is great, but with a hardware multiply capability I'm thinking that it might be possible to cut this time in half, or better.

Again, I really appreciate the effort. :)

@afremont

Since the hardware does have a MUL instruction, I’m looking into the methods of division by multiplying by a “magic number”.

205 might be the magic number you are looking for; have you clocked it?

(reply#3)

Another idea : 205/2048 is almost 0.1 (~0.1% error) might be acceptable?
x = ((t << 5) * 205L) >> 11;
what does it clock?

@robtillaart, hey man I feel like I slighted you on the the thanks. I really like how you incorporated the multiply by 16 into the formula making it more efficient. I thought that was pretty slick, nice job. :) Thanks again and I'll check out that ratio of yours. I'm sure it will run extremely fast, but as for the 1% error, this is one nail I'd like to hit squarely on the head if I can. It's not like it's something that runs all the time like the Timer0 overflow routine, so I'd rather concentrate on getting it exactly right if I can. I'm happy with what you've done so far in terms of how long it takes to execute. I think it's acceptable as is, but I'd love to make it faster without sacrificing accuracy.

0.1% (1 in 1000). Still interested in what it clocks on your 20Mhz duino.
you might add it to the core as a fbi_micros(); // fast but inaccurate :wink: to be used when performance is an issue.

found this interesting blog today - http://blogs.msdn.com/b/devdev/archive/2005/12/12/502980.aspx -

in short:
dividing by 5 ==> multiplying by 3435973837 & 0xFFFFFFFF

so:
y = (x*16)/5; => y = (x << 4) * 3435973837 & 0xFFFFFFFF; // updated

worth a try

@afremont,

your original problem was multiplying x 3.2
I wrote down 2 pages of binary approximations and came up with this slight variation.

void setup()
{
  Serial.begin(115200);

  unsigned long start = micros();
  for (long i=0; i< 1000; i++)
  {
    volatile long x = (i<<5) + (i<<1);
    x += (x>>5);
    x += (x>>8);
    x += (x>>16);
    x = x - (i<<5);
    //    Serial.print(i);
    //    Serial.print('\t');
    //    Serial.print(x);
    //    Serial.print('\t');
    //    Serial.println(x*1.0/i, 2);
  }
  unsigned long duration = micros() - start;
  Serial.println(duration/1000.0);
  Serial.println(duration/1000.0*16/20);

}

void loop(){
}

it “fails” the first 366* micros [ mod(2^32) ] but after that it does a fine multiply x 3.2

* but quite accurate for integer math

runs 13.90 micros on my 16Mhz Arduino ==> expect 11.12 micros on 20Mhz

can you confirm?

FYI,
the formula proposed above (post #17)

volatile unsigned long x = ((i << 4) * 3435973837UL) & 0xFFFFFFFF;

seems only work if i is a multiple of 5, don’t know why … (not investigated the details)

0	0	nan
1	3435973840	3435973888.00
2	2576980384	1288490240.00
3	1717986928	572662336.00
4	858993472	214748368.00
5	16	3.20
6	3435973856	572662336.00
7	2576980400	368140064.00
8	1717986944	214748368.00
9	858993488	95443720.00
10	32	3.20
11	3435973872	312361248.00
12	2576980416	214748368.00
13	1717986960	132152840.00
14	858993504	61356680.00
15	48	3.20
16	3435973888	214748368.00
17	2576980432	151587088.00
18	1717986976	95443720.00
19	858993520	45210188.00
20	64	3.20
21	3435973904	163617808.00
22	2576980448	117135480.00
23	1717986992	74695088.00
24	858993536	35791396.00
25	80	3.20
26	3435973920	132152840.00
27	2576980464	95443720.00
28	1717987008	61356680.00
29	858993552	29620466.00
30	96	3.20
31	3435973936	110837864.00
32	2576980480	80530640.00
33	1717987024	52060216.00
34	858993568	25264516.00
35	112	3.20
36	3435973952	95443720.00
37	2576980496	69648120.00
38	1717987040	45210188.00
39	858993584	22025476.00
40	128	3.20

(it is really fast ~2micros, but faulty :frowning: