# Arduino Forum

## Development => Other Software Development => Topic started by: robtillaart on May 20, 2013, 08:39 pm

Title: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on May 20, 2013, 08:39 pm
For a project I needed both / 10 and  %10 several times. Although the straightforward code was fast enough I investigated if it could be even faster. Could be useful e.g. for a library or so. This resulted in the following divmod10() function that calculates both the divide by 10 and the modulo 10 "simultaneously".

Note 1: the code is written for unsigned 32 bit math. It can not be used for signed numbers. One can derive a 16 bit version by removing one line and making the types 16 bit.

Note 2: tested only on a UNO
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  // q = in * 0.8;  uint32_t q = (in >> 1) + (in >> 2);  q = q + (q >> 4);  q = q + (q >> 8);  q = q + (q >> 16);  // not needed for 16 bit version  // q = q / 8;  ==> q =  in *0.1;  q = q >> 3;    // determine error  uint32_t  r = in - ((q << 3) + (q << 1));   // r = in - q*10;  div = q + (r > 9);  if (r > 9) mod = r - 10;  else mod = r;}`

The code is based upon the divu10() code from the book Hackers Delight1. My insight was that the error formula in divu10() was in fact modulo 10 but not always. Sometimes it was 10 more.  I wrote a small test-sketch that is not exhaustive but gives quite some confidence. The performance of the divmod10() is very good, the output of the test-sketch determines timing in microseconds per call.

i/10      38.6920
i%10      38.6120
divmod10   14.9240

Divmod10() is more than twice the speed of /10  or  %10. When both are needed the gain is approx a factor 5, or ~60 micros per calculation.

Test-sketch
Code: [Select]
`////    FILE: divmod10.ino//  AUTHOR: Rob Tillaart//    DATE: 2013-05-20//// PUPROSE: fast divide and modulo by 10 in one function//// Code based upon book hackers delight// unsigned long start = 0;unsigned long stop = 0;volatile unsigned long q;void setup(){  Serial.begin(115200);  Serial.println("testing divmod10()\n");  Serial.print("i/10\t");  start = micros();  for (unsigned long i = 0; i < 1000; i++)  {    q = i/10;  }  stop = micros();  Serial.println((stop - start)/1000.0, 4);  Serial.print("i%10\t");  start = micros();  for (unsigned long i = 0; i < 1000; i++)  {    q = i%10;  }  stop = micros();  Serial.println((stop - start)/1000.0, 4);  Serial.println("--------------");  Serial.print("divmod10\t");  start = micros();  uint32_t m = 0;  uint32_t d = 0;  for (unsigned long i = 0; i < 1000; i++)  {    divmod10(i, d, m);  }  stop = micros();  Serial.println((stop - start)/1000.0, 4);  Serial.println("--------------");  Serial.println("test div 0..65535");  for (unsigned int i = 0; i< 65535; i++)  {    uint32_t a = i/10;    uint32_t d;    uint32_t m;    divmod10(i, d, m);    if ( abs(a - d) > 0)    {      Serial.print(i);      Serial.print("\t");      Serial.print(a);      Serial.print("\t");      Serial.println(d);    }  }  Serial.println("test mod 0..65535");  for (unsigned int i = 0; i < 65535; i++)  {    uint32_t a = i%10;    uint32_t d;    uint32_t m;    divmod10(i, d, m);    if ( abs(a - m) > 0)    {      Serial.print(i);      Serial.print("\t");      Serial.print(a);      Serial.print("\t");      Serial.println(m);    }  }  Serial.println("test div 0..maxlong (20K random samples)");  for (unsigned int i = 0; i< 20000; i++)  {    uint32_t n = random(0xFFFFFFFF);    uint32_t a = n/10;    uint32_t d;    uint32_t m;    divmod10(n, d, m);    if ( abs(a - d) > 0)    {      Serial.print(n);      Serial.print("\t");      Serial.print(a);      Serial.print("\t");      Serial.println(d);    }  }  Serial.println("test mod 0..maxlong (20K random samples)");  for (unsigned int i = 0; i< 20000; i++)  {    uint32_t n = random(0xFFFFFFFF);    uint32_t a = n%10;    uint32_t d;    uint32_t m;    divmod10(n, d, m);    if ( abs(a - m) > 0)    {      Serial.print(n);      Serial.print("\t");      Serial.print(a);      Serial.print("\t");      Serial.println(m);    }  }  Serial.println("done");}void loop(){}void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t q = (in >> 1) + (in >> 2);  q = q + (q >> 4);  q = q + (q >> 8);  q = q + (q >> 16);  q = q >> 3;  uint32_t  r = in - ((q << 3) + (q << 1)); // r = in - q*10;  div = q + (r > 9);  if (r > 9) mod = r - 10;  else mod = r;}`

As always comments, remarks are welcome.

1 - http://www.amazon.com/Hackers-Delight-Edition-Henry-Warren/dp/0321842685/ -
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Jantje on May 20, 2013, 09:22 pm
you keep on surprising me  :D
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on May 21, 2013, 03:36 am

No timing comparison to the[font=Courier New] div / ldiv [/font]functions?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on May 21, 2013, 11:05 pm
Not thought of div and ldiv. So wrappeed them into the test sketch :

testing divmod10()

i/10   38.6920
i%10   38.6120
--------------
ldiv   44.7880
--------------
div   15.3960
--------------
divmod10   14.9200
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

ldiv is very interesting as for almost the same performance one gets / & %

div is 16 bit and signed, where the rest is 32 bit (unsigned), still interesting numbers.
Should make a 16 bit test version.

Code: [Select]
`  Serial.print("ldiv\t");  start = micros();  ldiv_t ld;  volatile long x;  for (unsigned long i = 0; i < 1000; i++)  {    ld = ldiv(i,10);  }  stop = micros();  Serial.println((stop - start)/1000.0, 4);  Serial.println("--------------");  x = ld.quot;  Serial.print("div\t");  start = micros();  div_t id;  for (unsigned long i = 0; i < 1000; i++)  {    id = div(i,10);  }  stop = micros();  Serial.println((stop - start)/1000.0, 4);  Serial.println("--------------");  x = id.quot;`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on May 25, 2013, 01:33 pm
Hi robtillaart,

I wondered why the correction term (and therefore if-then-else) is needed. (This slows down performance.) I tried to find a precise approximation and came up with this:
Code: [Select]
`// for in between 0x0 .. 0x7fffffffvoid divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t x=in>>1; // div = in/10 <==> div = ((in/2)/5)  uint32_t q=x;  q = x + (q >> 8);  q = x + (q >> 8);  x++;  q = x + (q >> 8);  q = x + (q >> 8);  q = 3*q;  div = (q + (q >> 4)) >> 4;  mod = in - ((div << 3) + (div << 1));}`

On my Duemilanove it's faster then the divu10() code from the book Hackers Delight you gave. Maybe because no if-the-else is needed or shift by 8 is good for the optimizer. But I have no insight into the resulting assembler code.

Its a somewhat "weird" solution, again.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on May 25, 2013, 03:28 pm

i/10   38.6920
i%10   38.6120
--------------
ldiv   44.7880
--------------
div   15.3960
--------------
divmod10   14.9200  ( )
--------------
divmod10 2   13.6040  ( in/2/5 version)
--------------

That is 10% faster. Yes! almost 3x faster than the native one
your comments states it works for 0..0x7FFFFFFF  that is half of the possible numbers.
My 20 K random test show it seems to work also above it (akthough 20K samples is no proof)

- can it be faster?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on May 25, 2013, 04:22 pm
inspired by 10 = 2 x 5 as you did  I found this one

Code: [Select]
`void divmod10_2(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t x = in>>1;  uint32_t q = x;  q = x + (q >> 8);  q = x + (q >> 8);  x++;  q = x + (q >> 8);  q = x + (q >> 8);  q = 3*q;  div = (q + (q >> 4)) >> 4;  mod = in - (((div << 2) + div)<<1);  // <<<<<<<<<<< this line changed, in total one shift less.}`

it resulted in : 13.1600    another slice off

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on Jun 01, 2013, 01:44 pm
Yes, constrain 0..0x7FFFFFFF was needed because it only works in this interval.

BTW: Your test with random() is useless for testing in 0..0xFFFFFFFF because it produces only values between 0..RANDOM_MAX (with RANDOM_MAX = 0x7FFFFFFF). ( See http://www.nongnu.org/avr-libc/user-manual/group__avr__stdlib.html )

"can it be faster?"
Yes, it can!

I took the challenge and found a solution which is the fastest (so far on my atmega328p) and finally works on the complete interval 0..0xFFFFFFFF.
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){ uint32_t x=(in>>1); // div = in/10 <==> div = ((in/2)/5) uint32_t q=x; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x + 1; x= q; q= (q>>1) + x; q= (q>>3) + x; q= (q>>1) + x; div = (q >> 3); mod = in - (((div << 2) + div) << 1);}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 01, 2013, 08:16 pm
divmod10 ==> 12.6600

impressive!
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: ajk_ on Jun 06, 2013, 10:09 pm
Make them q and x register variables, it should go even faster.  :smiley-mr-green:
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 06, 2013, 10:24 pm
tried adding register to q and x, but no success,

Think the compiler either already optimizes the vars as they are not volatile, or the compiler ignores the keyword.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pYro_65 on Jun 07, 2013, 04:59 pm
I was curious so I had a quick browse.

According to the C++ standard,

Quote
D.4 register keyword [depr.register]
1 The use of the register keyword as a storage-class-specifier (7.1.1) is deprecated.

And 7.1.1:
Quote
The register speci?er shall be applied only to names of variables declared in a block (6.3) or to function parameters (8.4). It speci?es that the named variable has automatic storage duration (3.7.3). A variable declared without a storage-class-speci?er at block scope or declared as a function parameter has automatic storage duration by default.

So to summarize, register has the same semantics as auto. Local variables are auto by default, therefore register is completely redundant in C++.

In C, it simply stops you from taking the address as this isn't allowed, whereas if done in C++, the register semantics are simply ignored.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: ajk_ on Jun 08, 2013, 02:12 am
Nice to know that G++ does this. Not all compilers to though.
It does not mean that the code is incorrect in any case.
I guess it depends on the compiler and optimizer.

Cool hack none-the-less.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pYro_65 on Jun 08, 2013, 02:37 am
Quote
Nice to know that G++ does this. Not all compilers to though.

Thats not GCC documentation, its the ISO C++99 standard. The same goes for C++11.
Optimizing compilers are able to pick register candidates far better than us. Even then 'register', like 'inline' is only a hint, and can be ignored.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on Jun 10, 2013, 11:46 pm
Sorry for this. I don't want to abuse this thread for progress reports on senseless optimization. But, please excuse me ... I'm in a kind of flow.

Another ~1.5% off:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){ uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = (0.75*in) >> 3 uint32_t q= (x>>4) + x; x= q; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x; div = (q >> 3); mod = in - (((div << 2) + div) << 1);}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pYro_65 on Jun 11, 2013, 03:19 am

Sorry for this. I don't want to abuse this thread for progress reports on senseless optimization. But, please excuse me ... I'm in a kind of flow.

I don't think you're in danger of that, I'm sure robtillaart will enjoy testing your approach. Its not senseless either, someone one day will need to squeeze in a few extra repetitions that the standard method can't do in time.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 11, 2013, 05:52 pm
Well done genom2!!

On my "reference UNO" it clocks 12.4760

I looked at the code and I could remove 2 lines and it still passes the random test..UPdate : false positve see below

timing becomes a staggering 11.4680
Code: [Select]
`void divmod10_4(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = (0.75*in) >> 3  uint32_t q= (x>>4) + x;  // 0.796875 *in  x= q;  q= (q>>8) + x;     // 0.799987793 * in  q= (q>>8) + x;     // 0.803112745 * in  //q= (q>>8) + x;  //q= (q>>8) + x;  div = (q >> 3);  mod = in - (((div << 2) + div) << 1);}`

did a extended 1000K test

divmod10       11.4680
--------------
test div 0..65535
test mod 0..65535
test divmod 0..maxlong (1000K random samples)
done

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on Jun 11, 2013, 09:49 pm
No, you can't remove the 2 lines! As you said, a random test is no proof!

Take this number 7031721 as a counter example:
7031721/10 expected 703172 but your modification returns 703171

But an inconspicuous modification on calculating mod (in the last line) does the trick and another ~8% off:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){ uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8 uint32_t q= (x>>4) + x; x= q; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x;        q= (q>>8) + x; div = (q >> 3); mod = in - ((q & ~0x7) + (div << 1));}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 11, 2013, 10:29 pm
Great counter example!
and a very smart way to mod --> 11.7200  impressive !
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: ajk_ on Jun 12, 2013, 07:19 am
I don't have that exact Arduino model, but I'm curious -- if rotates happen one bit at a time, and take 2 instructions per bit, or 16 operations to rotate, perhaps a copy is faster? What do you get with this?  :smiley-mr-green:

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){        union {                uint32_t q;                uint8_t qq;        };        union {                uint32_t r;                uint8_t rr;        };        r = 0;        uint32_t x = (in|1) - (in >> 2); // div = in/10 <~~> div = (0.75*in) >> 3        q = (x >> 4) + x;        x = q;        rr = qq;        rr = qq;        rr = qq;        q = r +x        rr = qq;        rr = qq;        rr = qq;        q = r + x        rr = qq;        rr = qq;        rr = qq;        q = r + x        rr = qq;        rr = qq;        rr = qq;        q = r + x        div = (q >> 3);        mod = in - (((div << 2) + div) << 1);}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 12, 2013, 07:38 pm
fixed some missing ;

divmod10       14.5480

Problem might be that the index[] costs extra, with pointer math one can solve that. Have no time to try ...
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jun 12, 2013, 07:58 pm
Problem might be that the index[] costs extra, with pointer math one can solve that.

Costs twice.  Pointers / arrays tie up a register pair that could otherwise be used to hold a 16 bit value.  In the extreme case, this forces intermediate values to be stored on the stack.  Adding pointers / arrays typically causes minor problems for the optimizer (data has to be moved between registers).

Shifts by multiples of 4 bits (4, 8, 12, 16, ...) are generally not very expensive on AVR processors with the GCC compiler.  The compiler "knows" the results of a multiple-of-8 shift are ready-to-use in a another register (essentially free; basically what @ajk_ is doing).  The AVR processor has a "nibble swap" instruction that helps with multiple-of-4 shifts (not free but about the same price as a shift by 2 bits).
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: ajk_ on Jun 14, 2013, 06:57 am
Bummer, that is possibly why doing the same in assembler whips.
I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.
Looking at the AVR GCC optimizer 'bug list' there are a pile of missed opportunities to optimize.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jun 14, 2013, 08:21 am
I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.

Apparently I did not word my post very well.  The point is that the compiler does optimize 8, 16, 24 bit shifts.  They are optimized to the point of being essentially free.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 14, 2013, 09:45 pm
at the cost of 22bytes of flash:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8  uint32_t x;    asm volatile(    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );    q += x;    x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  x = (q >> 2);  mod = in - ((q & ~0x7) + (x & ~0x1)); //the 'x' here used to be: ((q>>3)<<1), but now it is just ((q>>2) & 0xFFFFFFFE) which result in the same value  div = (x >> 1); //div used to be (q>>3). It is now ((q>>2)>>1) which is the same.}`

I've basically targeted the '>>4' which was being done in a loop of 4 shifts.

I have also modified the last lines to reduce the number of bit shifts.

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   10.2680
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

Thats a further 12.39% reduction.

EDIT

For no extra cost on my last mod:

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8  uint32_t x;    asm volatile(    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );    q += x;    x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q &= ~0x7;  x = (q >> 2);  mod = in - (q + x);  div = (x >> 1);}`

I've modified where the (q & ~0x7) is done to remove the need for the (x & ~0x1) that I had to add.

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   10.0800
--------------
test div 0..MAX
test mod 0..MAX
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

Thats a further 1.83% reduction on my last one.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 15, 2013, 12:57 am
Wow!

didn't expect it would come so close to 10 usec - on my UNO it clocks even 10.2080 !

worked on those last lines too, but I had
Code: [Select]
`  x = (q >> 2);  div = (x >> 1);   mod = in - ((q & ~0x7) + (x & ~0x1));`
which took longer (12++) ==> so the order of the lines does matter. (it needs to preserve the previous value of x in the pre-last line)

Can we win time by the fact that mod is always between 0..9 so in fact an uint8_t should be large enough, is an uint32_t overkill?
Code: [Select]
`void divmod10_4_1(uint32_t in, uint32_t &div, uint8_t &mod)   // made mod a byte{  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8  uint32_t x;  asm volatile(  "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t":     "=r" (x):     "r" (q):    );  q += x;  x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  x = (q >> 2);  mod = in - (q & ~0x7) - (x & ~0x1);   div = (x >> 1);}`

Quote
testing divmod10()

i/10   38.6920
i%10   38.6120
--------------
divmod10       10.1440
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (1000K random samples)
done

The diff is only 0.064 usec ( 0.6% or just 1 instruction )  . So we are 3 instructions from the 10 microsecond limit..
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: aarondc on Jun 15, 2013, 11:46 am
This thread deserves an award or something. Lovin' it.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 15, 2013, 12:01 pm
While its good to reduce the mod to 'byte', it causes the compiler to drop one of its beautiful optimisations, plus miss another obvious one.

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  uint32_t* divPtr;  uint8_t* modPtr;  asm volatile(    "movw %0, %2  \n\t"    "movw %1, %3  \n\t"    : "=z" (divPtr), "=x" (modPtr)    : "r" (&div), "r" (&mod)    :  );//force the compiler to use the x register for mod and z for div.    //This is because the compiler can then use st Z, st Z+1, st Z+2 ...    //for div, and just st X for mod. before it used st Z for mod and had    //to do: st X+, st X+, st X+, st X+, X=X-3. So we save a 2clock instruction.    uint32_t q = (in|1);// div = in/10 <~~> div = 0.75*in/8  uint32_t x;  asm volatile( //x = in >> 2    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (in)    :  );    q = q - x;    asm volatile( //x = q >> 4    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );    q += x;    x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;    q &= ~0x7;  asm volatile( //x = q >> 2    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );  *modPtr = (in - (q + x));  *divPtr = (x >> 1);}`

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   9.2000
--------------
test div 0..MAX
test mod 0..MAX
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on Jun 15, 2013, 01:23 pm
To cite robtillaart's motto:
"In theory there is no difference between theory and practice, however in practice there are many..."

So, I tried this:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){ uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8 uint32_t q= (x>>4) + x; x= q; q= (q>>16) + (x>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q &= ~0x7; x = (q >> 2); mod = in - (q + x); div = (x >> 1);}`
At least in theory there should be some (at least one) saveings on assembler instruction. Because this:
Code: [Select]
` q= (q>>16) + (x>>8) + x; q= (q>>8) + x; q= (q>>8) + x;`
is a valid substitute for this:
Code: [Select]
` q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x;`

However in practice it made performance worse.

I agree, the (avr-)gcc does a beautiful job. But, from this result I have the feeling that shift 16 is not in its optimization scope.

Can manually coded assembler do the job? (Which is not in my scope.)

BTW: This is not a valid substitute:
Code: [Select]
` q= (q>>16) + (x>>8) + x; q= (q>>16) + (x>>8) + x;`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 15, 2013, 02:17 pm
I'm not sure that:
Code: [Select]
`x = q;q= (q>>16) + (x>>8) + x;`
is a valid substitution for
Code: [Select]
`x = qq= (q>>8) + x;q= (q>>8) + x;`

take q = 0x00000FFF

Code: [Select]
`x = 0x00000FFFq = 0x00000000 + 0x0000000F + 0x00000FFF = 0x0000100E`

In the original:
Code: [Select]
`x = 0x00000FFFq = 0x0000000F + 0x00000FFF = 0x0000100Eq = 0x00000010 + 0x00000FFF = 0x0000100F`

Notice the discrepancy? Its because you lost a vital carried bit.

Can manually coded assembler do the job? (Which is not in my scope.)

I have been :)
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 15, 2013, 04:41 pm
There is one more MASSIVE optimisation which can be acheived, but I can't make gcc do it.

The solution... write the entire function manually in assembler.

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)  //and that in arrives in r22:r25 quad   asm volatile(    "movw r30, %2   \n\t"  //uint32_t* divPtr = &div;    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;        "mov   r0, %A0  \n\t"  //byte temp = in    "movw r18, %A0  \n\t"  //uint32_t q = in;    "movw r20, %C0  \n\t"      "ori  r18, 0x01 \n\t"  //q |= 1;        "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"      "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"        "sub  r18, r22  \n\t" //q = q - x;    "sbc  r19, r23  \n\t"    "sbc  r20, r24  \n\t"    "sbc  r21, r25  \n\t"        "movw r22, r18  \n\t" //x = q;    "movw r24, r20  \n\t"    "lsr  r25       \n\t" //x = x >> 4;    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"        "add  r22, r18  \n\t" //x = x + q    "adc  r23, r19  \n\t"    "adc  r24, r20  \n\t"    "adc  r25, r21  \n\t"        "movw r18, r22  \n\t" //q = x    "movw r20, r24  \n\t"    "add  r18, r23  \n\t" //q = q + (x >> 8)    "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r1   \n\t"    "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"    "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"        "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"        "andi r18, 0xF8 \n\t" //q = q & ~0x7        "sub   r0, r18  \n\t" //in = in - q        "lsr  r21       \n\t" //q = q >> 2    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "lsr  r21       \n\t"    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"        "sub  r0, r18   \n\t" //in = in - q    "st    X, r0    \n\t" //mod = in;        "lsr  r21       \n\t" //q = q >> 1    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"        "st    Z, r18  \n\t" //div = q    "std  Z+1, r19  \n\t"    "std  Z+2, r20  \n\t"    "std  Z+3, r21  \n\t"        :    : "r" (in), "r" (&mod), "r" (&div)    : "r0", "r26", "r27", "r31", "r31"  );  }`

And the results speak for themselves:
Quote

i/10   38.8360
i%10   38.7400
--------------
divmod10   7.8800
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jun 15, 2013, 10:14 pm
Quote
divmod10   7.8800

Including the call and ret it takes 101 clock cycles to execute the code which comes out to 101 cycles / 16 cycles per microsecond = 6.3125 microseconds.  The measurement code needs some work.  Excellent job on the assembly!

Code: [Select]
`call 2movw r30, 2% 1movw r26, 1% 1mov r0, %A0 1movw r18, %A0 1movw r20, %C0 1ori r18, 0x01 1lsr r25 1ror r24 1ror r23 1ror r22 1lsr r25 1ror r24 1ror r23 1ror r22 1sub r18, r22 1sbc r19, r23 1sbc r20, r24 1sbc r21, r25 1movw r22, r18 1movw r24, r20 1lsr r25 1ror r24 1ror r23 1ror r22 1lsr r25 1ror r24 1ror r23 1ror r22 1lsr r25 1ror r24 1ror r23 1ror r22 1lsr r25 1ror r24 1ror r23 1ror r22 1add r22, r18 1adc r23, r19 1adc r24, r20 1adc r25, r21 1movw r18, r22 1movw r20, r24 1add r18, r23 1adc r19, r24 1adc r20, r25 1adc r21, r1 1mov r18, r19 1mov r19, r20 1mov r20, r21 1eor r21, r21 1add r18, r22 1adc r19, r23 1adc r20, r24 1adc r21, r25 1mov r18, r19 1mov r19, r20 1mov r20, r21 1eor r21, r21 1add r18, r22 1adc r19, r23 1adc r20, r24 1adc r21, r25 1mov r18, r19 1mov r19, r20 1mov r20, r21 1eor r21, r21 1add r18, r22 1adc r19, r23 1adc r20, r24 1adc r21, r25 1andi r18, 0xF8 1sub r0, r18 1lsr r21 1ror r20 1ror r19 1ror r18 1lsr r21 1ror r20 1ror r19 1ror r18 1sub r0, r18 1st X, r0 2lsr r21 1ror r20 1ror r19 1ror r18 1st Z, r18 2std Z+1, r19 2std Z+2, r20 2std Z+3, r21 2ret 4`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on Jun 16, 2013, 03:59 pm
I wonder why Tom Carpenter didn't took the chance to implement my suggested algorithmic improvement given in reply #28?

So, I coded it myself into the given assembler template and indeed found the expected performance gain. (I don't want to bother you with this solution because I found an even better one, see below.)

Meanwhile I found that:
Code: [Select]
` q= (q>>8) + x; q= (q>>16) + (x>>8) + x; q= (q>>8) + x;`
is a valid substitute for this:
Code: [Select]
` q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x;`
as well.

This last substitute offers even more optimization choices at assembler level and I (temporary) end up with this:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)  //and that in arrives in r22:r25 quad   asm volatile(    "movw r30, %2   \n\t"  //uint32_t* divPtr = &div;    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;    "mov   r0, %A0  \n\t"  //byte temp = in    "movw r18, %A0  \n\t"  //uint32_t q = in;    "movw r20, %C0  \n\t"    "ori  r18, 0x01 \n\t"  //q |= 1;    "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "sub  r18, r22  \n\t" //q = q - x;    "sbc  r19, r23  \n\t"    "sbc  r20, r24  \n\t"    "sbc  r21, r25  \n\t"    "movw r22, r18  \n\t" //x = q;    "movw r24, r20  \n\t"    "lsr  r25       \n\t" //x = x >> 4;    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "add  r22, r18  \n\t" //x = x + q    "adc  r23, r19  \n\t"    "adc  r24, r20  \n\t"    "adc  r25, r21  \n\t"    //q= x + (x>>8);    "movw r18, r22  \n\t" //q = x    "movw r20, r24  \n\t"    "add  r18, r23  \n\t" //q = q + (x >> 8)    "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r1   \n\t"    //q= (q>>16) + (x>>8) + x;    "movw r18, r20  \n\t" //q= (q>>16);    "clr  r20   \n\t"    "clr  r21   \n\t"    "add  r18, r23  \n\t" //q += (x >> 8)    "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r21  \n\t" // we need the carry only    "add  r18, r22  \n\t" //q += x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"    //q= (q>>8) + x;    "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "clr  r21       \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"    "andi r18, 0xF8 \n\t" //q = q & ~0x7    "sub   r0, r18  \n\t" //in = in - q    "lsr  r21       \n\t" //q = q >> 2    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "lsr  r21       \n\t"    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "sub  r0, r18   \n\t" //in = in - q    "st    X, r0    \n\t" //mod = in;    "lsr  r21       \n\t" //q = q >> 1    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "st    Z, r18  \n\t" //div = q    "std  Z+1, r19  \n\t"    "std  Z+2, r20  \n\t"    "std  Z+3, r21  \n\t"    :    : "r" (in), "r" (&mod), "r" (&div)    : "r0", "r26", "r27", "r31", "r31"  );}`
With another slice of ~4% off.

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 16, 2013, 05:02 pm

I wonder why Tom Carpenter didn't took the chance to implement my suggested algorithmic improvement given in reply #28?

See reply #29 for why i didn't. I thought that what you suggested in reply #28 was not a valid substitute. But actually having tested further, the discrepancy I found seems to self correct.

Your second seems a better optimisation. It saves 5 clock cycles.

Tidied up a bit:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)  //and that in arrives in r22:r25 quad   asm volatile(    "movw r30, %2   \n\t"  //uint32_t* divPtr = &div;    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;        "mov   r0, %A0  \n\t"  //byte temp = in    "movw r18, %A0  \n\t"  //uint32_t q = in;    "movw r20, %C0  \n\t"      "ori  r18, 0x01 \n\t"  //q |= 1;        "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"      "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"        "sub  r18, r22  \n\t" //q = q - x;    "sbc  r19, r23  \n\t"    "sbc  r20, r24  \n\t"    "sbc  r21, r25  \n\t"        "movw r22, r18  \n\t" //x = q;    "movw r24, r20  \n\t"    "lsr  r25       \n\t" //x = x >> 4;    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"        "add  r22, r18  \n\t" //x = x + q    "adc  r23, r19  \n\t"    "adc  r24, r20  \n\t"    "adc  r25, r21  \n\t"        "movw r18, r22  \n\t" //q = x    "movw r20, r24  \n\t"    "add  r18, r23  \n\t" //q = q + (x >> 8)    "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r1   \n\t"    "movw r18, r20  \n\t" //q = q >> 16    "eor  r20, r20  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r23  \n\t" //q = q + (x>>8)    "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r1   \n\t" //NOTE: r1 is a known 0.    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"        "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"        "andi r18, 0xF8 \n\t" //q = q & ~0x7        "sub   r0, r18  \n\t" //in = in - q        "lsr  r21       \n\t" //q = q >> 2    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "lsr  r21       \n\t"    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"        "sub  r0, r18   \n\t" //in = in - q    "st    X, r0    \n\t" //mod = in;        "lsr  r21       \n\t" //q = q >> 1    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"        "st    Z, r18  \n\t" //div = q    "std  Z+1, r19  \n\t"    "std  Z+2, r20  \n\t"    "std  Z+3, r21  \n\t"        :    : "r" (in), "r" (&mod), "r" (&div)    : "r0", "r26", "r27", "r31", "r31"  );}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 16, 2013, 08:19 pm
in one word - respect!

It becomes almost an assembler course! - I can read it but almost no experience writing avr asm.

@Tom & genom2
A question arises:  is porting the fastest C to asm automatically the fastest asm version?
I mean, the previous C versions that had less statements, can they be more efficient when hand-coded to asm?

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 16, 2013, 10:18 pm

@Tom & genom2
A question arises:  is porting the fastest C to asm automatically the fastest asm version?
I mean, the previous C versions that had less statements, can they be more efficient when hand-coded to asm?

Most things can be optimised by manually writing the assembly, at least in part.

A lot of the asm that gcc generated for the C versions of divmod10 was fine and very well optimised, in fact much I've just used. The trouble is that there are optimisations get missed, partly because to see them requires a bit of creativity.
The optimiser is very good, there are some times I look at what it comes up with and think, "wow... that was clever!". But there are other times I think, "why don't you just do this, it would be far quicker".

Really it is about trying to write the C in just the right order and manner that the compiler can optimise well. That is something which takes practice, patience and observing how different code ordering affects the compiler output.

Most of the time now I am using a lot of inline assembler in time critical areas, such as this interrupt routine:
Code: [Select]
`ISR(TIMER3_CAPT_vect) {  byte timeSegment = distributionSegment(DC);  byte index = (distributionWidth[DC] << 1) & timeSegment;  interruptOVFCount(DC) = *(int*)((byte*)timerOVF[DC] + index); //move to next pwm period  distributionSegment(DC) = timeSegment + 1;    unsigned int count = interruptCount(DC)-1; //OCR1B is used as it is an unused register which affords us quick access.  if (count == 0) {    register byte port asm("r18") = stepPort(DC);    register unsigned int startVal asm("r24") = currentMotorSpeed(DC);    //byte port = STEP0PORT;    //unsigned int startVal = currentMotorSpeed(DC);    interruptCount(DC) = startVal; //start val is used for specifying how many interrupts between steps. This tends to IVal after acceleration    if (port & stepHigh(DC)){      stepPort(DC) = port & stepLow(DC);      unsigned long jVal = synta.cmd.jVal[DC];      jVal = jVal + synta.cmd.stepDir[DC];      synta.cmd.jVal[DC] = jVal;      if(synta.cmd.gotoEn[DC]){        if (gotoPosn[DC] == jVal){ //reached the decelleration marker. Note that this line loads gotoPosn[] into r24:r27          //will decellerate the rest of the way. So first move gotoPosn[] to end point.          if (decelerationSteps(DC) & _BV(7)) {            /*            unsigned long gotoPos = gotoPosn[DC];            if (synta.cmd.stepDir[DC] < 0){              gotoPosn[DC] = gotoPos + decelerationSteps(DC);            } else {              gotoPosn[DC] = gotoPos - decelerationSteps(DC);            }            */            //--- This is a heavily optimised version of the code commented out just above ------------            //During compare of jVal and gotoPosn[], gotoPosn[] was loaded into r24 to r27            register char stepDir asm("r19") = synta.cmd.stepDir[DC];            asm volatile(              "in   %A0, %2  \n\t"              "cp   %1, __zero_reg__  \n\t"              "brlt .+4      \n\t"              "neg  %A0      \n\t"              "eor  %B0, %B0   \n\t"              "mov  %C0, %B0   \n\t"              "mov  %D0, %B0   \n\t"              "add  %A0, r24  \n\t" //add previously loaded gotoPosn[] registers to new gotoPosn[] registers.              "adc  %B0, r25  \n\t"              "adc  %C0, r26  \n\t"              "adc  %D0, r27  \n\t"              : "=a" (gotoPosn[DC]) //goto selects r18:r21. This adds sts commands for all 4 bytes              : "r" (stepDir),"I" (_SFR_IO_ADDR(decelerationSteps(DC)))       //stepDir is in r19 to match second byte of goto.              :            );            //-----------------------------------------------------------------------------------------            decelerationSteps(DC) = 0; //say we are stopping            synta.cmd.IVal[DC] = stopSpeed[DC]; //decellerate to min speed. This is possible in at most 80 steps.          } else {            goto stopMotorISR3;          }        }      }      if (currentMotorSpeed(DC) > stopSpeed[DC]) {stopMotorISR3:        synta.cmd.gotoEn[DC] = 0; //Not in goto mode.        synta.cmd.stopped[DC] = 1; //mark as stopped        timerDisable(DC);      }    } else {      stepPort(DC) = port | stepHigh(DC);      register unsigned int iVal asm("r20") = synta.cmd.IVal[DC];      //unsigned int iVal = synta.cmd.IVal[RA];      register byte increment asm("r0") = synta.cmd.stepIncrement[DC];      asm volatile(        "movw r18, %0   \n\t"        "sub  %A0, %2    \n\t"        "sbc  %B0, __zero_reg__    \n\t"        "cp   %A0, %A1   \n\t"        "cpc  %B0, %B1   \n\t"        "brge 1f         \n\t" //branch if iVal <= (startVal-increment)        "movw  %0, r18   \n\t"        "add  %A0, %2    \n\t"        "adc  %B0, __zero_reg__    \n\t"        "cp   %A1, %A0   \n\t"        "cpc  %B1, %B0   \n\t"        "brge 1f         \n\t" //branch if (startVal+increment) <= iVal        "movw  %0, %1   \n\t"  //copy iVal to startVal        "1:             \n\t"        : "=&w" (startVal) //startVal is in r24:25        : "a" (iVal), "t" (increment) //iVal is in r20:21        :      );      currentMotorSpeed(DC) = startVal; //store startVal      /*      if (startVal - increment >= iVal) { // if (iVal <= startVal-increment)        startVal = startVal - increment;      } else if (startVal + increment <= iVal){        startVal = startVal + increment;      } else {        startVal = iVal;      }      currentMotorSpeed(DC) = startVal;      */    }  } else {    interruptCount(DC) = count;  }}`
There is much that needs to be done in as short a space of time as possible. There are places where looking at the assembler I can see areas which I can see a better way. Sometimes that is just arranging if statements in a different way, others changing the use of variables.
Notice for example this line:
Code: [Select]
`  interruptOVFCount(DC) = *(int*)((byte*)timerOVF[DC] + index); //move to next pwm period`
I could have written:
Code: [Select]
`  interruptOVFCount(DC) = timerOVF[DC][index]; //move to next pwm period`
Which makes it a lot more readable, but results in 2 extra clock cycles.
Over the whole interrupt, two days of optimisation brought it down from 500+ clock cycles to a little over 170.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: genom2 on Jun 18, 2013, 09:45 pm
I'm not able to make a profound statement, because I have only little experience with assembler. So far I don't like assembler because its always specific to CPUs. (What about divmod10(..) for ARM based Arduinos?). On the other side I like programming with JAVA (which is - in theory - an interpreter). All my algorithmic improvements made for this thread are originally developed with JAVA on a PC (using '>>>' for unsigned right shift) and manually "compiled" to C, to make the proof on my Duemilanove.

In the past I used C to implement my ideas, just because there was no JAVA for microcontrollers like Arduinos. But there is light at the end of the tunnel: I'm the author of HaikuVM, a full featured open source JAVA VM for AVRs (works even for an ATmega8). Hoping HaikuVM becomes more popular in the future.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 19, 2013, 10:44 pm
[off topic]
Quote
Hoping HaikuVM becomes more popular in the future.

You can always discuss it in the exhibition gallery when it runs on an Arduino !
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 20, 2013, 07:08 pm
Wow, pretty amazing work!

This may be a dumb or overly obvious question, but would it be ok to include this in Arduino's Print.cpp file, which is LGLP licensed?

Currently that file only has attribution to David Mellis.  If I make a copy including this amazing code, who should I add to the attribution and how should your names appear?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 20, 2013, 11:02 pm
Quote
This may be a dumb or overly obvious question, but would it be ok to include this in Arduino's Print.cpp file, which is LGLP licensed?

Questions are never dumb ;)

Print.cpp is one I had in mind too, printing numbers could be (estimate) 10-30% faster.
imho the divmod10() should be usable from user code too, so encapsulate in a lib fastmath.h or arduino.h (?).

I do not know the legal details of the LGPL (can you explain it simply?), but my version of the code can be used.

Still todo: test on several platforms: 168, 268, tiny, teensy, due etc.

Quote
If I make a copy including this amazing code, who should I add to the attribution and how should your names appear?

imho: reference to this thread and optionally the forum names.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 22, 2013, 07:02 pm

Quote
divmod10   7.8800

Including the call and ret it takes 101 clock cycles to execute the code which comes out to 101 cycles / 16 cycles per microsecond = 6.3125 microseconds.  The measurement code needs some work.

Here a redo of the measurement code
The first loop tests 100.000 x 1 call to divmod. (includes loop overhead)
The second loop tests 100.000 x 2 calls to divmod. (includes loop overhead)

By subtracting the found values one get a better approximation of the time per divmod().

Code: [Select]
`unsigned long start = 0;unsigned long stop = 0;volatile unsigned long q;void setup(){  Serial.begin(115200);  Serial.println("testing divmod10()\n");  Serial.print("divmod10\t");  start = micros();  uint8_t m = 0;  uint32_t d = 0;  for (unsigned long i = 0; i < 100000; i++)  {    divmod10(i, d, m);  }  stop = micros();  float f1 = (stop - start)/100000.0;  Serial.println(f1, 4);  Serial.println("--------------");  Serial.print("divmod10\t");  start = micros();  m = 0;  d = 0;  for (unsigned long i = 0; i < 100000; i++)  {    divmod10(i, d, m);    divmod10(i, d, m);  }  stop = micros();  float f2 = (stop - start)/100000.0;  Serial.println(f2, 4);  Serial.println("--------------");  Serial.print("one run takes: ");  Serial.println(f2-f1, 4);  Serial.println("--------------");  Serial.println("done");}void loop(){}void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)  //and that in arrives in r22:r25 quad  asm volatile(  "movw r30, %2   \n\t"  //uint32_t* divPtr = &div;  "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;  "mov   r0, %A0  \n\t"  //byte temp = in  "movw r18, %A0  \n\t"  //uint32_t q = in;  "movw r20, %C0  \n\t"      "ori  r18, 0x01 \n\t"  //q |= 1;  "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0  "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"      "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "sub  r18, r22  \n\t" //q = q - x;  "sbc  r19, r23  \n\t"    "sbc  r20, r24  \n\t"    "sbc  r21, r25  \n\t"    "movw r22, r18  \n\t" //x = q;  "movw r24, r20  \n\t"    "lsr  r25       \n\t" //x = x >> 4;  "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "add  r22, r18  \n\t" //x = x + q  "adc  r23, r19  \n\t"    "adc  r24, r20  \n\t"    "adc  r25, r21  \n\t"    "movw r18, r22  \n\t" //q = x  "movw r20, r24  \n\t"    "add  r18, r23  \n\t" //q = q + (x >> 8)  "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r1   \n\t"    "mov  r18, r19  \n\t" //q = q >> 8  "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x  "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"    "mov  r18, r19  \n\t" //q = q >> 8  "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x  "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"    "mov  r18, r19  \n\t" //q = q >> 8  "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x  "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"    "andi r18, 0xF8 \n\t" //q = q & ~0x7  "sub   r0, r18  \n\t" //in = in - q  "lsr  r21       \n\t" //q = q >> 2   "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "lsr  r21       \n\t"    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "sub  r0, r18   \n\t" //in = in - q  "st    X, r0    \n\t" //mod = in;  "lsr  r21       \n\t" //q = q >> 1  "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "st     Z, r18  \n\t" //div = q  "std  Z+1, r19  \n\t"    "std  Z+2, r20  \n\t"    "std  Z+3, r21  \n\t": :     "r" (in), "r" (&mod), "r" (&div):     "r0", "r26", "r27", "r31", "r31"    );  }`

The output:
divmod10   7.6716
--------------
divmod10   14.4000
--------------
one run takes: 6.7284

6.3125 and 6.7284 differ 6.6%  where the original estimate 7.7360 differs 22.5%  (accuracy improved by factor 3+)

The difference of 0.4 uSec can be due to parameter passing?

in conclusion:
1) time measurement can be improved.
3) previous measurements are all including the loop overhead so ~1.1 uSec too high
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 22, 2013, 07:49 pm
The 0.4uS will be the for loop. You have to increment 'i', which takes 4 clock cycles, plus a compare is needed to check if 'i' is < 100000 which will take 4 as well.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 22, 2013, 08:22 pm
but both loop 1 and 2 have the same loop overhead
=> so the difference cannot include the loop overhead, so the time can only be the 100K calls to divmod()

I can only think of param handling, placing i and the addresses of d and m on the stack.

yes there is the call to millis() to but that is done only once for 100K calls so neglectible
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 22, 2013, 11:58 pm
True.

Looking at the assembly, there are 4 clock cycles to copy values into registers in the loop:
Code: [Select]
`     330: c8 01       movw r24, r16     332: b7 01       movw r22, r14     334: a5 01       movw r20, r10     336: 96 01       movw r18, r12     338: 04 df       rcall .-504     ; 0x142 <_Z8divmod10mRmRh> //first call     33a: c8 01       movw r24, r16  //each of these 4 are a single clock cycle     33c: b7 01       movw r22, r14     33e: a5 01       movw r20, r10     340: 96 01       movw r18, r12     342: ff de       rcall .-514     ; 0x142 <_Z8divmod10mRmRh> //second call`

That will account for about half of the discrepancy.

Beyond that there is no difference in loop overhead. I only looked at the code you put quickly so had missed what you were trying to do.

One thing to remember is that micros() is not that accurate and there may be some error in it. Especially as in the longer loop (the one which calls divmod10() twice), there will be more micros() interrupts occuring which will add additional overhead to that loop.

EDIT:

Interestingly with the latest version should be 103 clock cycles long including ret and call (104 on a mega as ret takes an aditional clock cycle). So the target is really only 6.4375us (6.6875us if you include the 4 movw clock cycles used to load the registers).

To improve measurement, i used Timer1 to count how many clock cycles are used, and disabled millis(), which removes any error from interrupts.
Code: [Select]
`unsigned long start = 0;unsigned long stop = 0;volatile unsigned long q;void setup(){  Serial.begin(115200);  Serial.println("testing divmod10()\n");    byte backup = TIMSK0;    Serial.print("divmod10\t");    TCCR1A = 0;  TCCR1B = 5;    uint8_t m = 0;  uint32_t d = 0;  Serial.flush(); //wait for serial buffer to clear  TIMSK0 = 0; //disable millis;  TCNT1 = 0;  for (unsigned long i = 0; i < 100000; i++)  {    divmod10(i, d, m);  }  stop = TCNT1; //how many clock cycles.  TIMSK0 = backup; //renable millis;  float f1 = stop*64;//There are 64us per clock cycle with a 1:1024 prescaler.  f1 /= 100000.0;  Serial.println(f1, 4);  Serial.println("--------------");  Serial.print("divmod10\t");  Serial.flush(); //wait for serial buffer to clear  m = 0;  d = 0;  TIMSK0 = 0; //disable millis;  TCNT1 = 0;  for (unsigned long i = 0; i < 100000; i++)  {    divmod10(i, d, m);    divmod10(i, d, m);  }  stop = TCNT1;  TIMSK0 = backup; //renable millis;  float f2 = stop * 64;//There are 64us per clock cycle with a 1:1024 prescaler.  f2 /= 100000.0;  Serial.println(f2, 4);  Serial.println("--------------");  Serial.print("one run takes: ");  Serial.println(f2-f1, 4);  Serial.println("--------------");  Serial.println("done");}void loop(){}void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)  //and that in arrives in r22:r25 quad  asm volatile(    "movw r30, %2   \n\t"  //uint32_t* divPtr = &div;    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;      "mov   r0, %A0  \n\t"  //byte temp = in    "movw r18, %A0  \n\t"  //uint32_t q = in;    "movw r20, %C0  \n\t"      "ori  r18, 0x01 \n\t"  //q |= 1;      "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"      "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"      "sub  r18, r22  \n\t" //q = q - x;    "sbc  r19, r23  \n\t"    "sbc  r20, r24  \n\t"    "sbc  r21, r25  \n\t"      "movw r22, r18  \n\t" //x = q;    "movw r24, r20  \n\t"    "lsr  r25       \n\t" //x = x >> 4;    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"    "lsr  r25       \n\t"    "ror  r24       \n\t"    "ror  r23       \n\t"    "ror  r22       \n\t"      "add  r22, r18  \n\t" //x = x + q    "adc  r23, r19  \n\t"    "adc  r24, r20  \n\t"    "adc  r25, r21  \n\t"      "movw r18, r22  \n\t" //q = x    "movw r20, r24  \n\t"    "add  r18, r23  \n\t" //q = q + (x >> 8)    "adc  r19, r24  \n\t"    "adc  r20, r25  \n\t"    "adc  r21, r1   \n\t"      "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"      "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"      "mov  r18, r19  \n\t" //q = q >> 8    "mov  r19, r20  \n\t"    "mov  r20, r21  \n\t"    "eor  r21, r21  \n\t"    "add  r18, r22  \n\t" //q = q + x    "adc  r19, r23  \n\t"    "adc  r20, r24  \n\t"    "adc  r21, r25  \n\t"      "andi r18, 0xF8 \n\t" //q = q & ~0x7      "sub   r0, r18  \n\t" //in = in - q      "lsr  r21       \n\t" //q = q >> 2    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"    "lsr  r21       \n\t"    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"      "sub  r0, r18   \n\t" //in = in - q    "st    X, r0    \n\t" //mod = in;      "lsr  r21       \n\t" //q = q >> 1    "ror  r20       \n\t"    "ror  r19       \n\t"    "ror  r18       \n\t"      "st    Z, r18  \n\t" //div = q    "std  Z+1, r19  \n\t"    "std  Z+2, r20  \n\t"    "std  Z+3, r21  \n\t"      :    :  "r" (in), "r" (&mod), "r" (&div)    : "r0", "r26", "r27", "r31", "r31"  );  }`

On an Arduino Mega, this yeilds:
Quote
testing divmod10()

divmod10   7.6877
--------------
divmod10   14.4378
--------------
one run takes: 6.7501
--------------
done

On a mega there are 104clocks plus the 4 for movw, so 108 in total. So the time required should be:
1000000 * 108 / 16000000 = 6.75us, so bang on.

On an Arduino Uno, it yeilds:
Quote
testing divmod10()

divmod10   7.6250
--------------
divmod10   14.3123
--------------
one run takes: 6.6874
--------------
done

On an Uno, there are 103clocks plus the 4 for movw, so 107 in total. So the time required should be:
1000000 * 107 / 16000000 = 6.6875us, so again, the test is bang on.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: drjiohnsmith on Jun 24, 2013, 11:49 am
on this forum we seem to have two threads running on ways to implement very efficient division and mod operations.

http://forum.arduino.cc/index.php?topic=172635.0

http://forum.arduino.cc/index.php?topic=167414.0

I'd hate to loose these over time,
which might well happen if they just reside on the forum.

is there any posibility that these could be put together, and put on the playground or something so they are not lost .

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 24, 2013, 12:23 pm
To reassure myself about this algorithm, I started an exhaustive test.  At 16 MHz, it should check all 2^32 inputs to divmod10 against the C library divide and modulus in 4 days and 3 hours.

To integrate this into the Print class, an even more efficient way might involve an inline assembly macro rather than a function call.  Not only does that eliminate the call and return instructions, but it also allows the compiler to directly replace the registers in the asm with whatever it allocated to the variables.  It avoids the overhead of the called function allocating lots of the registers, which the compiler would save if they're in use, and even if not were to be saved, the compiler's register allocator is forced to rearrange register usage in the surrounding code a function is called.

In Print, the original 32 bit input is replaced with the quotient, which seems to fit pretty well to this code.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 24, 2013, 01:56 pm
I know this might seem a daft question, but does the print class actually use a straight /10 ? I thought it had to work for different bases (BIN, HEX, DEC, OCT) etc.
I suppose you could have a different printNumber() funtion for base 10.

Once you have done it with calling the divmod10() function directly, I'll look into inlining the function.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 24, 2013, 07:02 pm
Quote
I know this might seem a daft question, but does the print class actually use a straight /10 ?

no it uses base, but I estimate 95% is done the DECimal way, then another 4% as HEX, 1% BIN, other bases < 1%

The printNumber could be implemented this way (optimized speed on cost of size)

Code: [Select]
`size_t Print::printNumber(unsigned long n, uint8_t base) {  char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.  char *str = &buf[sizeof(buf) - 1];  *str = '\0';  // prevent crash if called with base == 1  if (base < 2) base = 10;  do {    unsigned long m = n;    n /= base;    char c = m - base * n;    *--str = c < 10 ? c + '0' : c + 'A' - 10;  } while(n);  return write(str);}`

would become something like this, with optimized versions for DEC, HEX, BIN and OCT. (imho OCT is used to few to optimize, but OK)

Code: [Select]
`size_t Print::printNumber(unsigned long n, uint8_t base) {  char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.  char *str = &buf[sizeof(buf) - 1];  *str = '\0';  char c;  uint32_t d;  // prevent crash if called with base == 1  if (base < 2) base = 10;  switch(base)  {    case HEX:      do {        c = n & 0x0F;        n >>= 4;        *--str = c < 10 ? c + '0' : c + 'A' - 10;      } while(n);      break;    case DEC:      do {        divmod10(n, &d, &c)        n = d;        *--str = c + '0';      } while(n);      break;    case OCT:      // *--str = '0';   //leading 0 to indicate octal ?      do {        c = n & 0x07;        n >>= 3;        *--str = c + '0';      } while(n);      break;    case BIN:      // *--str = 'b';  // leading b to indicate binary ?      do {        c = n & 0x01;        n >>= 1;        *--str = c + '0';      } while(n);      break;    default:      do {        d = n;        n /= base;        c = d - base * n;        *--str = c < 10 ? c + '0' : c + 'A' - 10;      } while(n);    break;  }  return write(str);}`
(code not tested)
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 24, 2013, 07:07 pm
I was planning to make different printNumber functions, like printNumberDec, printNumberHex, printNumberGeneric.  Then printNumber would be an inline function defined in Print.h.

Nearly all Print usage is base 10, but even when it's others, the 2nd parameter is virtually always a compile time constant.  Using an inline function will allow the compiler to optimize away the test and directly insert the call to a specific version.  If the others aren't used in the entire program, the linker will not put their code into the final executable.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 24, 2013, 08:56 pm
I put divmod10 into Print.cpp.  I did some quick testing and it seems to work quite well.

This is Teensy's implementation of Print, which differs slightly from Arduino's.  It has several other optimizations....

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 24, 2013, 09:12 pm
Ideally, divmod10 could be an inline asm macro that modifies "n" in place and returns the remainder in a register.  That would eliminate the function call+return, the pointer overhead, and the compiler having to reallocate registers during the loop because a function is called.

It'll probably require a 32 bit temporary, if I understand the existing asm code?

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 24, 2013, 10:14 pm
Hi Paul,

I have done the same mod to the standard Arduino Print class, which is attached.

There are a couple of improvements I have made:
(1)
The no base print calls can just call printNumberDec() directly rather than going through printNumber().
Code: [Select]
`    size_t print(unsigned char b) { return printNumberDec(b, 0); }    size_t print(int n) { return print((long)n); }    size_t print(unsigned int n) { return printNumberDec(n, 0); }    size_t print(long n)      {if (n < 0) {return printNumberDec(-n, 1);} else {return printNumberDec(n, 0);} }    size_t print(unsigned long n) { return printNumberDec(n, 0); }`

(2)
As suggested, I have made an inline version of divmod10 for the printNumberDec() function. You should be able to merge this back into the Teensy version fairly readily.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 24, 2013, 10:20 pm
@Paul Stoffregen

a quick look at print.cpp

in the generic printNumber you have this code

Code: [Select]
` } else if (base == 1) { base = 10; }`
could be
Code: [Select]
` } else if (base == 1) { printNumberDEC(n, sign); }`

count += printNumber(int_part, sign, 10);
==>
count += printNumberDEC(int_part, sign);

Did you measure timing improvements?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 24, 2013, 10:43 pm
I converted it to inline asm too, slightly differently but very similar.

Code: [Select]
`#define divmod10_asm(in32, tmp32, mod8)                         \asm volatile (                                                  \        "mov    %2, %A0         \n\t" /* mod = in */            \        "ori    %A0, 1          \n\t" /* q = in | 1 */          \        "movw   %A1, %A0        \n\t" /* x = q */               \        "movw   %C1, %C0        \n\t"                           \        "lsr    %D1             \n\t"  /* x = x >> 2 */         \        "ror    %C1             \n\t"                           \        "ror    %B1             \n\t"                           \        "ror    %A1             \n\t"                           \        "lsr    %D1             \n\t"                           \        "ror    %C1             \n\t"                           \        "ror    %B1             \n\t"                           \        "ror    %A1             \n\t"                           \        "sub    %A0, %A1        \n\t" /* q = q - x  */          \        "sbc    %B0, %B1        \n\t"                           \        "sbc    %C0, %C1        \n\t"                           \        "sbc    %D0, %D1        \n\t"                           \        "movw   %A1, %A0        \n\t" /* x = q  */              \        "movw   %C1, %C0        \n\t"                           \        "lsr    %D1             \n\t" /* x = x >> 4  */         \        "ror    %C1             \n\t"                           \        "ror    %B1             \n\t"                           \        "ror    %A1             \n\t"                           \        "lsr    %D1             \n\t"                           \        "ror    %C1             \n\t"                           \        "ror    %B1             \n\t"                           \        "ror    %A1             \n\t"                           \        "lsr    %D1             \n\t"                           \        "ror    %C1             \n\t"                           \        "ror    %B1             \n\t"                           \        "ror    %A1             \n\t"                           \        "lsr    %D1             \n\t"                           \        "ror    %C1             \n\t"                           \        "ror    %B1             \n\t"                           \        "ror    %A1             \n\t"                           \        "add    %A1, %A0        \n\t" /* x = x + q */           \        "adc    %B1, %B0        \n\t"                           \        "adc    %C1, %C0        \n\t"                           \        "adc    %D1, %D0        \n\t"                           \        "movw   %A0, %A1        \n\t" /* q = x */               \        "movw   %C0, %C1        \n\t"                           \        "add    %A0, %B1        \n\t" /* q = q + (x >> 8) */            \        "adc    %B0, %C1        \n\t"                           \        "adc    %C0, %D1        \n\t"                           \        "adc    %D0, r1         \n\t"                           \        "mov    %A0, %B0        \n\t" /* q = q >> 8 */          \        "mov    %B0, %C0        \n\t"                           \        "mov    %C0, %D0        \n\t"                           \        "eor    %D0, %D0        \n\t"                           \        "add    %A0, %A1        \n\t" /* q = q + x */           \        "adc    %B0, %B1        \n\t"                           \        "adc    %C0, %C1        \n\t"                           \        "adc    %D0, %D1        \n\t"                           \        "mov    %A0, %B0        \n\t" /* q = q >> 8 */          \        "mov    %B0, %C0        \n\t"                           \        "mov    %C0, %D0        \n\t"                           \        "eor    %D0, %D0        \n\t"                           \        "add    %A0, %A1        \n\t" /* q = q + x */           \        "adc    %B0, %B1        \n\t"                           \        "adc    %C0, %C1        \n\t"                           \        "adc    %D0, %D1        \n\t"                           \        "mov    %A0, %B0        \n\t" /* q = q >> 8 */          \        "mov    %B0, %C0        \n\t"                           \        "mov    %C0, %D0        \n\t"                           \        "eor    %D0, %D0        \n\t"                           \        "add    %A0, %A1        \n\t" /* q = q + x */           \        "adc    %B0, %B1        \n\t"                           \        "adc    %C0, %C1        \n\t"                           \        "adc    %D0, %D1        \n\t"                           \        "andi   %A0, 0xF8       \n\t" /* q = q & ~0x7 */        \        "sub    %2, %A0         \n\t" /* mod = mod - q */       \        "lsr    %D0             \n\t" /* q = q >> 2  */         \        "ror    %C0             \n\t"                           \        "ror    %B0             \n\t"                           \        "ror    %A0             \n\t"                           \        "lsr    %D0             \n\t"                           \        "ror    %C0             \n\t"                           \        "ror    %B0             \n\t"                           \        "ror    %A0             \n\t"                           \        "sub    %2, %A0         \n\t" /* mod = mod - q */       \        "lsr    %D0             \n\t" /* q = q >> 1 */          \        "ror    %C0             \n\t"                           \        "ror    %B0             \n\t"                           \        "ror    %A0             \n\t"                           \        :  "+d" (in32), "=r" (tmp32), "=r" (mod8) : :           \)`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 24, 2013, 11:28 pm

Did you measure timing improvements?

Yes, just now, printing several integers with micros() read inside the printNumberXXX functions.

There are a couple caveats, but basically:

With printNumberAny(): 1412 us   -- single call to C library divide
With printNumberDec(): 218 us     -- using divmod10_asm

Caveat #1: Adding the storage of micro() causes the compiler to allocate many more registers.  I didn't analyze the generated printNumberDec code in detail, but it is clear from a casual look at the assembly listing that just adding the test causes the whole function to be slower.

Caveat #2: The timer0 interrupt is at play, so the measurement include extra time and sometimes changes slightly from run to run.

Here's the (very arbitrary) test code:

Code: [Select]
`extern uint32_t usec_print;void setup(){  for (int i=0; i<100; i++) {    Serial.println(12345678l);    Serial.println(0);    Serial.println(792942ul);    Serial.println(12345);    Serial.println(2586173071ul);    Serial.println(234);  }  Serial.print("usec = ");  Serial.println((float)usec_print / 100.0);}void loop(){}`

Here are the Print files.  These have the test code inside and printNumberAny is enabled...
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 25, 2013, 01:07 am
As a simple comparison, take this code:
Code: [Select]
`unsigned long start = 0;unsigned long stop = 0;volatile unsigned long q;void setup(){  Serial.begin(115200);  Serial.println("testing");    byte backup = TIMSK0;    TCCR1A = 0;  TCCR1B = 4;   TIMSK1 |= _BV(TOIE1);    Serial.flush(); //wait for serial buffer to clear  TIMSK0 = 0; //disable millis;  TCNT1 = 0;    Serial.println(1073741824ul);  Serial.println(101819584ul);  Serial.println(10737);    stop = TCNT1; //how many clock cycles.  TIMSK0 = backup; //renable millis;  stop*=16;//There are 16us per clock cycle with a 1:256 prescaler.  Serial.print("Time=");  Serial.println(stop);  Serial.println("done");  TIMSK1 &= ~_BV(TOIE1);}void loop(){}ISR(TIMER1_OVF_vect) {  Serial.println("I Overflowed!"); //just to make sure we can tell if this happens.}`

The old print functions:
Quote
testing
1073741824
101819584
10737
Time=1408
done

The new functions:
Quote
testing
1073741824
101819584
10737
Time=480
done

A rather arbitrary test I know, but there is a clear difference, almost 1ms is knocked off the time it takes to fill the Serial buffer.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 02:12 am
I ran this on Teensy 2.0.  It takes 224... so a good portion of that 480 on normal Arduino is probably the slow hardware serial code which uses a vtable function call and circular buffer manipulation overhead for each individual byte.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 25, 2013, 08:07 am
for long numbers:

original = 1408/24 = ~59us / digit to print
"tom" = 480/24  = ~20 us/digit to print    (33%)
"paul" = 224/24 = ~ 9.5uS /digit to print  (16%)

A quick arbitrary float test  (same prog as above, slightly different printnumber)
Code: [Select]
`  Serial.println(10737.41824, 4);  Serial.println(1.01819584, 4);  Serial.println(107.37, 2);`

original printnumber
testing
10737.4179
1.0182
107.37
Time=2144
done

divmod10 printnumber
testing
10737.4179
1.0182
107.37
Time=1472
done

so for floats there is also serious gain ~30% gain

(note this gain is only due to the integral part, the remainder part is a loop using a float mul per digit
Code: [Select]
`  // Extract digits from the remainder one at a time  while (digits-- > 0)  {    remainder *= 10.0;    int toPrint = int(remainder);    n += print(toPrint);    remainder -= toPrint;   } `
this can be optimized to something like:
Code: [Select]
`  // Extract digits from the remainder one at a time  long t = 1;  while (digits-- > 0) t *= 10;  remainder *= t;  n += print(long(remainder));`

a quick test exposes the flaw in the proposed code but also indicates there is room for improvement.

testing
10737.4179
1.182  <<<<<< FAIL !!!
107.37
Time=1152  <<<  (another 25%)
done
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 11:15 am

"tom" = 480/24  = ~20 us/digit to print    (33%)
"paul" = 224/24 = ~ 9.5uS /digit to print  (16%)

At 16 MHz, the divmod10_asm() macro is 5.2 us.

It's exactly 83 instructions, all of them single-cycle.  There's no call-return, no pointers, no load-store instructions.  It operates entirely on values in only 9 registers, which are chosen by the compiler's register allocator based on the needs of the surrounding code.  Unless there's more algorithm magic, I believe it's safe to say this 83 cycle version is the fastest possible implementation.

Obviously there's some room for improvement in actually moving the data....
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: drjiohnsmith on Jun 25, 2013, 11:42 am
fast.

and all are much of an improvement on the original aruino,

'all' we need to do is document up and make available in a form every one can use easily.

thats that hard part,

now I wonder if a pre parser could notice the opportunity to use these in 'normal' arduino code and substitute !

OK OK OK

well done to all

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 12:02 pm

'all' we need to do is document up and make available in a form every one can use easily.
thats that hard part,

How about the Print.cpp and Print.h Tom posted on reply #51 or I posted on #49 and #54 ?

All you have to do to use these is drop them into your Arduino's hardware/arduino/cores/arduino folder (or hardware/teensy/cores/teensy), replacing the Print.cpp and Print.h that are already there.

On a Mac, you'll need to control-click on Arduino and choose "show package contents", then navigate or search until you find the folder with Print.cpp and Print.h.  I believe it's Contents/Resources/Java/hardware/arduino/cores/arduino, but I don't have a Mac handy at the moment to check.

Admittedly, the copy I posted on #54 had test code enabled.  Here's a "final" version with all the test stuff commented, as ready-to-use as it gets.  These are tested with Teensy but probably will work with official Arduino boards.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: aarondc on Jun 25, 2013, 12:22 pm
When Arduino release a new IDE, etc, those libraries you have upgraded are potentially going to get blown away, and require work to reinstall / overwrite.

One potential solution would be if you could set up pathing so modified libraries were linked if they existed in rather than default libraries.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 12:41 pm
I restarted my exhaustive test, using the asm macro version.  In a few days I should be able to confirm if it matches the C library version for all 2^32 possible inputs....
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 25, 2013, 07:34 pm
Code: [Select]
`Unless there's more algorithm magic, I believe it's safe to say this 83 cycle version is the fastest possible implementation.`
That's why I asked several posts back if one of the earlier C version - shorter in terms of C - could be faster is asm as it might be less statements.

Code: [Select]
`In a few days I should be able to confirm if it matches the C library version for all 2^32 possible inputs....`
how do you do this test? Generate strings and compare them?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 07:41 pm

how do you do this test? Generate strings and compare them?

Compute the quotient and modulus using both and check if either is different.  Basically, like this:

Code: [Select]
`                q1 = n / 10;                m1 = n - q1 * 10;                q2 = n;                divmod10_asm(q2, tmp, m2);                if (q1 != q2 || m1 != m2) {                        errorcount++;                }`

Here is the complete code.   It should run on any AVR-based Arduino compatible board.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 07:46 pm
Opps, forgot this.  To run it on non-Teensy boards, you'll also have to add this .h file.  Or you could comment out the stuff that prints dots and blinks the LED, but that would not be any fun.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 25, 2013, 08:13 pm

would it not be faster to
Code: [Select]
`for(unsigned long(n = 0; n < 0xFFFFFFFF; n++){  divmod10_asm(n, d, m);   if (n != (10*d + m)) Serial.println(i);}`
as division is quite expensive as we know ;)
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 25, 2013, 08:27 pm
That would be much faster, aside from a couple minor code problems.  But does only comparing the output to itself really form a valid test?

The slow way will complete in a few days, which really isn't very long in the grand scheme of things.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 25, 2013, 11:16 pm
Quote
But does only comparing the output to itself really form a valid test?

no there is a small chance in theory that there exist a (d, m) pair that would also solve the n = 10*d + m equation   (e.g. n = 100, d = 0,  m = 100)
but given the fact that the code is a white box and we know how the code works this chance is minimal.

To prevent this theoretical issue one could add a test for m to be smaller than 10.

Code: [Select]
`for(unsigned long(n = 0; n < 0xFFFFFFFF; n++){  divmod10_asm(n, d, m);   if (n != (10*d + m) || m >= 10 ) Serial.println(n);}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 25, 2013, 11:27 pm
Even still, using a formula to prove itself is circular reasoning, which is not a proof at all. (http://en.wikipedia.org/wiki/Circular_reasoning)
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 26, 2013, 01:21 am
The verification only needs to be run once, so it's probably not worth optimizing.  It's already been running over 12 hours and will be done in a couple days.

I believe a much better optimization effort would involve looking into why Arduino's Print takes 15 to actually do something with each digit that's generated in only 5 us.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: stimmer on Jun 26, 2013, 05:37 pm
Is there any reason for avoiding using mul in the assembly code? I know not all atmels have mul (I think some attinys don't) but it's there on the Uno and Mega.

Here's a version using mul and fixed point arithmetic:
Code: [Select]
`void setup() {  Serial.begin(115200);}void loop() {    uint32_t i=1;  uint32_t q=0;  uint8_t r=1;  while(1){    uint32_t j=i;    uint8_t k;     uint8_t t;    asm volatile(        // this code fragment works out i/10 and i%10 by calculating i*(51/256)*(256/255)/2 == i*51/510 == i/10        // by "j.k" I mean 32.8 fixed point, j is integer part, k is fractional part    // j.k = ((j+1.0)*51.0)/256.0    // (we add 1 because we will be using the floor of the result later)          " ldi %2,51     \n\t"      " mul %A0,%2    \n\t"      " clr %A0       \n\t"      " add r0,%2     \n\t"            " adc %A0,r1    \n\t"      " mov %1,r0     \n\t"      " mul %B0,%2    \n\t"      " clr %B0       \n\t"      " add %A0,r0    \n\t"      " adc %B0,r1    \n\t"      " mul %C0,%2    \n\t"      " clr %C0       \n\t"      " add %B0,r0    \n\t"      " adc %C0,r1    \n\t"      " mul %D0,%2    \n\t"      " clr %D0       \n\t"      " add %C0,r0    \n\t"      " adc %D0,r1    \n\t"      " clr r1        \n\t"                // j.k = j.k*256.0/255.0     // (actually the result will always be slightly low, which we need because we use the floor of the result)            " add %1,%A0    \n\t"      " adc %A0,%B0   \n\t"      " adc %B0,%C0   \n\t"      " adc %C0,%D0   \n\t"      " adc %D0,r1    \n\t"      " add %1,%B0    \n\t"      " adc %A0,%C0   \n\t"      " adc %B0,%D0   \n\t"      " adc %C0,r1    \n\t"      " adc %D0,r1    \n\t"      " add %1,%D0    \n\t"      " adc %A0,r1    \n\t"      " adc %B0,r1    \n\t"      " adc %C0,r1    \n\t"      " adc %D0,r1    \n\t"                    // j.k = j.k / 2.0              " lsr %D0       \n\t"      " ror %C0       \n\t"      " ror %B0       \n\t"      " ror %A0       \n\t"      " ror %1        \n\t"            // now i*51.0/510.0 < j.k < (i+1.0)*51.0/510.0   // therefore j == i/10 (since j == floor(j.k))   // and (k*10)>>8 == i % 10            " ldi %2,10     \n\t"      " mul %1,%2     \n\t"      " mov %1,r1     \n\t"            " clr r1        \n\t"            :"+r"(j),"=d"(k),"=d"(t)      :       :  );            if((j!=q) || (k!=r)){        Serial.print(i);Serial.print(" ");        Serial.print(j);Serial.print(" ");        Serial.print(k);Serial.println(" ");      }      if((i & 16777215)==0){Serial.print(i);Serial.print(" ");Serial.println(millis());        if(i==0)return;}      i++;      r++;if(r==10){r=0;q++;}        }}`

I've tested it on various values but the full test will take 6 hours to run.

My way of testing for correctness is to keep track of the expected quotient and remainder, incrementing the remainder each time the divisor is incremented and when the remainder gets to 10, increase the quotient:

Code: [Select]
`  uint32_t i=1;  uint32_t q=0;  uint8_t r=1;  while(1){ [[ work out div/mod 10 of i here and check against q and r]]      i++;      r++;if(r==10){r=0;q++;}        }`

This is fast and simple, and I hope it's agreed that it is obviously a correct way of checking.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 26, 2013, 10:23 pm
@stimmer
Can you wrap your code in a similar function as divmod10() and run a performance test for comparison?

please use Toms code as starter - http://forum.arduino.cc/index.php?topic=167414.msg1288779#msg1288779 -
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: stimmer on Jun 27, 2013, 12:14 am
As a normal function it would be dominated by overhead and marshalling. As an inline function or macro it should take 3us. The calculation is 43 instructions long, 5 of which are mul, so it takes 48 cycles in total. In a quick real-world test the result I got was 3.1245us as an inline function.

The full 32-bit range test has now completed without error :)
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Jun 27, 2013, 04:29 am
Lol. This is starting to get insane.

Credit where credit is due, thats an amazing optimisation stimmer  :D.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 27, 2013, 07:05 pm
Quote
In a quick real-world test the result I got was 3.1245us as an inline function.

speechless ...

@Paul, there is a serious new candidate for Print.cpp ;)

Recall we came from
i/10      38.6920 us
i%10      38.6120 us
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 27, 2013, 07:40 pm
IN a follow up on my own message

A quick arbitrary float test  (same prog as above, slightly different printnumber)
Code: [Select]
`  Serial.println(10737.41824, 4);  Serial.println(1.01819584, 4);  Serial.println(107.37, 2);`

original print.cpp:

testing
10737.4179
1.0182
107.37
Time=2144
done

divmod10 version:

testing
10737.4179
1.0182
107.37
Time=1472
done

so for floats there is also serious gain ~30% gain

(note this gain is only due to the integral part, the remainder part is a loop using a float mul per digit)
Code: [Select]
`  // Extract digits from the remainder one at a time  while (digits-- > 0)  {    remainder *= 10.0;    int toPrint = int(remainder);    n += print(toPrint);    remainder -= toPrint;   } `

Created a faster snippet to print the decimal part of a float (replaces the snippet above)
// not tested yet
Code: [Select]
`  //  print decimal part  uint32_t t = 1;  while (digits-- > 0) t = ((t<<2) + t)<<1;       // t *= 10;  uint32_t rem1 = remainder * t;       // only float mul left  uint32_t rem2 = ((rem1<<2) + rem1)<<1;      // rem2 = rem1*10;  // loop for the leading zero's  while (t > rem2)  {    n += print('0');    rem2 = ((rem2 <<2) + rem2)<<1;      // rem2 *=10;  }  n += print(rem1);      // uses the divmod10() for the remaining digits`

results

testing
10737.4179
1.0182
107.37
Time=1136  << 22% improved wrt 1472
done
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 27, 2013, 10:00 pm
Wow Stimmer, that's really awesome!

Here's some test results:

Stimmer's optimization:            3.9 us/digit
Hacker's Delight & Rob & Tom:  6.4 us/digit
Original C library divide:           43.1 us/digit

These measurements were made on a Teensy 2.0.  These include the looping and array write overhead inside Print, with the burden of a 32 bit variable in the code holding the start time during the test.  Overhead to actually store the data to the output device buffer is not measured, so you should see very similar results between official Arduino and Teensy.

Here's the sketch I used for testing:

Code: [Select]
`extern uint32_t usec_print;void setup() {  Serial.begin(115200);  Serial.println("Benchmark print integer speed:");  Serial.println(1073741824ul);  Serial.println(1018195846ul);  Serial.println(1820728843ul);  Serial.println(2904720785ul);  Serial.println(3678923723ul);  Serial.println(4152282824ul);  Serial.println(1875627853ul);  Serial.println(2532822983ul);  Serial.println(3519086527ul);  Serial.println(4023424234ul);  Serial.print("Speed = ");  Serial.print((float)usec_print / 100.0);  Serial.println(" us/digit");}void loop() {}`

Here are the ready-to-use Print files.  To select which algorithm is used, comment & uncomment the #defines at line 155.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jun 27, 2013, 10:47 pm
(I think some attinys don't)

Correct.  ATtiny processors do not have a MUL instruction.

Congratulations on the wicked fast code!
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 27, 2013, 11:21 pm
I added another measurement to the test program.

With Stimmer's optimization:

Speed = 3.84 us/digit, 6.16 us/byte

The us/digit measures only the speed Print converts the 32 bit integer to a string.

The us/byte measures the entire time to put all the data into the transmit buffer.

Code: [Select]
`extern uint32_t usec_print;void setup() {  uint32_t begin_us, end_us;  Serial.begin(115200);  Serial.println("Benchmark print integer speed:");  begin_us = micros();  Serial.print(1073741824ul);  Serial.print(1018195846ul);  Serial.print(1820728843ul);  Serial.print(2904720785ul);  Serial.print(3678923723ul);  Serial.print(4152282824ul);  Serial.print(1875627853ul);  Serial.print(2532822983ul);  Serial.println(3519086527ul);  Serial.println(4023424234ul);  end_us = micros();  Serial.print("Speed = ");  Serial.print((float)usec_print / 100.0);  Serial.print(" us/digit, ");  Serial.print((float)(end_us - begin_us) / 100.0);  Serial.println(" us/byte");}void loop() {}`

Print.cpp and Print.h are the same as #77.

I tried to bring use them on Arduino Uno, but it seems my Print and header file arrangement has diverged too far from Arduino's over the years.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: fat16lib on Jun 27, 2013, 11:42 pm
Many users need faster decimal print for logging data to SD cards.  The big problem with Print is this line
Code: [Select]
`  return write(str);`
A class derived from Print may have a fast write for strings but it is not uses since Print defines these members
Code: [Select]
`size_t write(const char *str) {      if (str == NULL) return 0;      return write((const uint8_t *)str, strlen(str));    }size_t Print::write(const uint8_t *buffer, size_t size){  size_t n = 0;  while (size--) {    n += write(*buffer++);  }  return n;}`
So printNumber uses the above write(const char *str) and the string is written character at a time.

I did a test with the new version of print.  It is faster but not as much as one would expect.

I timed this loop
Code: [Select]
`  for (int i = 0; i < 20000; i++) {    file.println(i);  }`

The result for the old print is:
Quote

Time 9.38 sec
File size 128.89 KB
Write 13.74 KB/sec

For the new print with Stimmer's optimization.
Quote

Time 6.00 sec
File size 128.89 KB
Write 21.50 KB/sec

For a simple printField() function with no optimized divide that uses SdFat's write(const char*)
Quote

Time 2.66 sec
File size 128.89 KB
Write 48.44 KB/sec

So the improvement in Print will not be fully realized as long as printNumber uses Print::write(const char*).

I believe there is a way to fix this in a class derived from Print since size_t Print::write(const char *str) is not pure virtual.  Correct me if I am wrong.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 27, 2013, 11:47 pm

So printNumber uses the above write(const char *str) and the string is written character at a time.

If the SD library implements write(buf, len), then the it's supposed to be called instead of the 1-byte-at-a-time fallback in Print.cpp.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jun 27, 2013, 11:59 pm

I just tried it here with SD and it's terribly slow.  Because I don't have the exact program you used, here's what I tried.

Code: [Select]
`#include <SD.h>const int chipSelect = 0;void setup(){  Serial.begin(115200);  while (!Serial) ;  Serial.print("Initializing SD card...");  if (!SD.begin(chipSelect)) {    Serial.println("Card failed, or not present");    return;  }  Serial.println("card initialized.");  File dataFile = SD.open("datalog.txt", FILE_WRITE);  if (dataFile) {    uint32_t begin_us, end_us;    begin_us = micros();    for (int i = 0; i < 20000; i++) {      dataFile.println(i);    }    dataFile.close();    end_us = micros();    Serial.print("seconds =  ");    Serial.print((float)(end_us - begin_us) / 1000000.0);  }  }void loop() {}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: fat16lib on Jun 28, 2013, 12:23 am
Yes but if I declared Print::write(const char*) virtual and put the default definition in Print.cpp I get this improvement.

Old Print
Quote

Time 6.85 sec
File size 128.89 KB
Write 18.83 KB/sec

The new Print uses the size_t Print::write(const uint8_t *buffer, size_t size) but SdFat defines

Code: [Select]
`  int write(const void* buf, size_t nbyte);`

If I override the Print def of write for the new Print I get:
Quote

Time 2.61 sec
File size 128.89 KB
Write 49.40 KB/sec

A big improvement over the old print and no overrides:
Quote

Time 9.47 sec
File size 128.89 KB
Write 13.61 KB/sec

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: fat16lib on Jun 28, 2013, 12:27 am
I am using a very new version of SdFat, not the 3-4 year old version in SD.h.

I have some extra stuff in my loop to time the max and min print time but commenting it out doesn't change much.

The new SdFat is much faster, it allows flash programming to happen in the SD while the next block buffer is being filled by the Arduino program.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: fat16lib on Jun 28, 2013, 12:49 am
Paul,

I tried your sketch on a Uno with 1.0.5 and get a fast write with SD.h and the new Print.  About the same as with the new SdFat.  Looks like you don't even need the SdFat improvements at 50 KB/sec.

Quote
Initializing SD card...card initialized.
seconds =  2.53
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: stimmer on Jun 28, 2013, 08:23 pm
There is another speed optimization to make when you are using divmod10 to convert a long into base-10 digits: Only the first few digits of a 10-digit number need full 32-bit precision. Once the high byte is 0 you only need 24 bit precision (saves at least 10 cycles), then when the two highest bytes are 0 you only need a 16 bit divmod10 (saves over a microsecond). Once the three highest bytes are 0 the 8-bit divmod10 only takes a microsecond, and for the very last digit you don't need to do a division at all.

Code: [Select]
`   while(n & 0xff000000){divmod10_asm32(n, digit, tmp8);*--p = digit + '0';}   while(n & 0xff0000){divmod10_asm24(n, digit, tmp8);*--p = digit + '0';}   while(n & 0xff00){divmod10_asm16(n, digit, tmp8);*--p = digit + '0';}   while((n & 0xff)>9){divmod10_asm8(n, digit, tmp8);*--p = digit + '0';}   *--p = n + '0';`

The downside is an increase in code size of about 200 bytes. I've attached a modified Print.cpp including all the assembler macros, my profiling indicates it's 0.5us-1.5us faster per digit on average (which adds up to 7us faster for 5-10 digit numbers). I have not fully tested for accuracy yet.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jun 29, 2013, 04:05 pm
a well implemented trick - trading space for speed -

There are other tricks possible, not investigated though

Consider the fact that the largest uint32_t  < 50000 * 10000 * 10000;
with a divmod10000() one could divide the uint32 in 4digit chunks which can be processed by divmod10_asm16

a divmod1000() might be broader usable

another trick would use a divmod100() to extract 2 digits at a time => fit in a byte. => divmod10_asm8() to split it
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: fat16lib on Jul 08, 2013, 05:56 pm
I have used the ideas here to improve the speed of logging data to SD cards.  I wrote functions to print a numerical field to an SD card that are as much as five times faster than the standard Arduino print.

I am now able to log data from four analog pins on an Uno to an SD at 2000 Hz.  This is 8000 ADC values per second.  This is near the limit since each conversion takes about 110 microseconds.

I am using a small RTOS, Nil RTOS, written by Giovanni Di Sirio.  I ported this RTOS to avr arduinos http://code.google.com/p/rtoslibs/ (http://code.google.com/p/rtoslibs/).

I wrote a version of analogRead() that sleeps while the ADC conversion occurs.  This saves about 90 microseconds of CPU time per conversion.

I have also modified SdFat so it no longer busy waits for flash programming in the SD to complete.

Here are test results for 16-bit unsigned fields and 32-bit float fields.
Quote

This test compares the standard print function
with printField.  Each test prints 20,000 values.

Test of println(uint16_t)
Time 6.51 sec
File size 128.89 KB
Write 19.80 KB/sec

Test of printField(uint16_t, char)
Time 1.27 sec
File size 128.89 KB
Write 101.33 KB/sec

Test of println(double)
Time 17.17 sec
File size 160.00 KB
Write 9.32 KB/sec

Test of printField(double, char)
Time 3.56 sec
File size 160.00 KB
Write 44.94 KB/sec

Here are the test loops for the writes:
Code: [Select]
`   switch(test) {    case 0:      for (uint16_t i = 0; i < N_PRINT; i++) {        file.println(i);      }      break;    case 1:      for (uint16_t i = 0; i < N_PRINT; i++) {        file.printField(i, '\n');      }      break;    case 2:      for (uint16_t i = 0; i < N_PRINT; i++) {        file.println((double)0.0001*i, 4);      }      break;    case 3:      for (uint16_t i = 0; i < N_PRINT; i++) {        file.printField((double)0.0001*i, '\n', 4);      }   }`

I have attached the implementation of printField().
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jul 08, 2013, 09:47 pm
Impressive numbers, really good !

think divmod10() should become part of the core seeing these and the above numbers.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jul 08, 2013, 10:07 pm

think divmod10() should become part of the core seeing these and the above numbers.

I'm planning to release it in the next version of Teensyduino.

Stimmer, is Arduino Print LGPL ok with you?  Would you like your name to appear, other than "Stimmer" and link to this forum topic?  You deserve credit.... just let me know what should be in the comments?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jul 08, 2013, 10:22 pm
Stimmer, is Arduino Print LGPL ok with you?  Would you like your name to appear, other than "Stimmer" and link to this forum topic?

Credit is not enough.  Open source licenses require a copyright claim by Stimmer.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jul 08, 2013, 10:27 pm
Ok then.  I'll hold off publishing anything until Stimmer replies.

I'm currently working on a Mac-related USB problem, so it's not like there's any hurry.  But it would be nice to publish this code to a wider audience, if that's allowed?
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Jantje on Jul 08, 2013, 11:44 pm
@Paul
;) Shouldn't you be working on a space related issue with the Teensy uploader  ]:D
Best regards
Jantje
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: drjiohnsmith on Jul 09, 2013, 09:35 am
agree these should be standard libs of the core,

but the arduino team wont,
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: stimmer on Jul 09, 2013, 03:36 pm

think divmod10() should become part of the core seeing these and the above numbers.

I'm planning to release it in the next version of Teensyduino.

Stimmer, is Arduino Print LGPL ok with you?  Would you like your name to appear, other than "Stimmer" and link to this forum topic?  You deserve credit.... just let me know what should be in the comments?

Yes that's OK. Consider what I wrote to be in the public domain so anyone can use it under any license. Just put something like "this section of code based on public domain code by Stimmer, see post at forum.arduino.cc/whatever-the-url-is ".
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jul 09, 2013, 07:55 pm
Quote
agree these should be standard libs of the core,
but the arduino team wont,

I'm definitely not sure of the latter.
The code is not core lib ready yet. For Arduino-team the portability of core libs is a very important issue. The divmod10() function should be decorated with #ifdefs to select the C (portable to DUE?) or the assembly (328 only?) implementation .

So the fun part is over, now the real work starts ;)

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Jul 09, 2013, 08:26 pm
Quote

So the fun part is over, now the real work starts

It ought to be pretty simple to surround this code with  #ifdef __AVR__, or perhaps a check for the AVR architecture with multiply?  In fact, the Print.cpp I posted has #ifdef checks to use either the original C code, Stimmer's optimization, or your version of the Hackers Delight algorithm.  Yes, it requires some careful attention, but this part really isn't very difficult.

Whether the Arduino Team will accept this for the Print.cpp that ships for all Arduino boards is a good question.  Historically, they've been pretty uninterested in speed optimizations.

I definitely do plan to publish this in the next version of Teensyduino, likely within the next 6 weeks.  Of course, if Coding Badly is satisfied with the licensing?  ... Thanks Stimmer!  :D  You definitely did the heavy lifting with an amazing optimization!

Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jul 09, 2013, 08:50 pm
Of course, if Coding Badly is satisfied with the licensing?

Good enough for me.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jul 11, 2013, 10:31 pm
update on printfloat version using divmod10()       (follows - http://forum.arduino.cc/index.php?topic=167414.msg1295109#msg1295109 - )

a new version to print the remainder (limited to 10 decimals) of a float
- timing is slightly shorter
- code is more straightforward than the previous one

insert in Print.cpp  ==> size_t Print::printFloat()
Code: [Select]
`  // Print the decimal point, but only if there are digits beyond  if (digits > 0)  {    n += print(".");    uint32_t t = 1;    for (uint8_t i=0; i<digits; i++) t = ((t<<2) + t)<<1;  // t *= 10    uint32_t rem = remainder * t;    char z="0000000000";  // max 10 decimals    z[digits]='\0';      uint32_t d = 0;    uint8_t m = 0;    while (rem > 10)    {      divmod10(rem, d, m);      z[--digits] = m + '0';      rem = d;    }    z[--digits] = rem + '0';  // last digit    n += print(z);  }`

testing
10737.4179
1.0182
107.37
Time=1104      << original Print.cpp 2144;      almost 50% off (for this particular test)
done

testsketch
Code: [Select]
`unsigned long start = 0;unsigned long stop = 0;volatile unsigned long q;void setup(){  Serial.begin(115200);  Serial.println("testing");    byte backup = TIMSK0;    TCCR1A = 0;  TCCR1B = 4;  TIMSK1 |= _BV(TOIE1);    Serial.flush(); //wait for serial buffer to clear  TIMSK0 = 0; //disable millis;  TCNT1 = 0;    Serial.println(10737.41824, 4);  Serial.println(1.01819584, 4);  Serial.println(107.37, 2);    stop = TCNT1; //how many clock cycles.  TIMSK0 = backup; //renable millis;  stop*=16;//There are 16us per clock cycle with a 1:256 prescaler.  Serial.print("Time=");  Serial.println(stop);  Serial.println("done");  TIMSK1 &= ~_BV(TOIE1);}void loop(){}ISR(TIMER1_OVF_vect) {  Serial.println("I Overflowed!"); //just to make sure we can tell if this happens.}`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jul 11, 2013, 10:36 pm
update
replace n += print(".");   with    n += print('.');

Time=1072  << original 2144 is 50% off
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Coding Badly on Jul 12, 2013, 12:12 am

Does this make any difference...
Code: [Select]
`n += write('.');`
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jul 12, 2013, 08:09 pm

Does this make any difference...
Code: [Select]
`n += write('.');`

not measurable with the test script, but removing some "stacked calls" and use write() where possible may add up.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Jul 21, 2013, 04:30 pm
update on printfloat version using divmod10() . Code see- http://forum.arduino.cc/index.php?topic=167414.msg1312959#msg1312959 -

Incorporated the Stimmer ASM divmod10_asm into the floating point test. - http://forum.arduino.cc/index.php?topic=167414.msg1293679#msg1293679 -
It strips of another 3% for printing floats

output:
testing
10737.4179
1.0182
107.37
Time=1008  << original 2144 is 53% off
done

1 millisecond for 19 digits is about 50+ uSec per digit (iso 100)

update: printFloat continues in its own thread here - http://forum.arduino.cc/index.php?topic=179111.0 -
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Oct 18, 2013, 04:55 pm
I published a Teensyduino release candidate today, which includes Stimmer's optimization.

Thanks to everyone who worked and contributed to this awesome speedup.  Soon it'll be in widespread use on Teensy 2.0 and Teensy++ 2.0 boards.  :)
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Oct 18, 2013, 08:17 pm
Thanks Paul,

We can propose on the developers list that the divmod10 code (including #ifdef  to select asm or C version) becomes a separate .h file within the core.  I've seen " x/10 x%10" code constructs in several libs
(recently - http://forum.arduino.cc//index.php?topic=190472.0 - )

what do you think?

update: the divmod.h (?) could include work discussed here - http://forum.arduino.cc//index.php?topic=172635.0 -
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Oct 18, 2013, 08:27 pm
I think it's a good idea.

But I also find proposing stuff on the developer mail list to be exhausting.  I'm not personally planning to propose it and respond to the many "bike shedding" replies that are likely to result.

If you propose it, I'll chime in with a "+1".  When/if Arduino publishes it, I'll adopt whatever naming and header file convention they use.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: pjrc on Oct 18, 2013, 10:42 pm
If you do write up a good proposal on the developer mail list and Cristian replies favorably (really, anything other than saying no or probably not), perhaps I'll adopt it for Teensyduino 1.17 or 1.18?  I just posted the first release candidate today.  I don't have a deadline for 1.17, but unless there's any problems or other urgent stuff to add, I'll probably release sometime next week.

I can't see the harm in placing this into a header where it can be used by libraries.  There just needs to be some consensus from Arduino (where really, Cristian's voice is the only one that matters) so I don't end up publishing this in a way that will later break when/if they do it.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: odometer on Nov 03, 2013, 05:23 pm
Personally, I would love to see it wrapped in a function which, for purposes of this post, I shall call digit().

Code: [Select]
`Example use of digit():long n = 4216738;byte ones = digit(n, 0);  // now, ones == 8byte tens = digit(n, 1);  // now, tens == 3byte hund = digit(n, 2);  // now, hund == 7byte thou = digit(n, 3);  // now, thou == 6byte myri = digit(n, 4);  // now, myri == 1byte lakh = digit(n, 5);  // now, lakh == 2byte mill = digit(n, 6);  // now, mill == 4byte cror = digit(n, 7);  // now, cror == 0`
This would be extremely useful for displays.

I'm not sure how it should behave for negative numbers, though.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: Tom Carpenter on Nov 03, 2013, 05:28 pm
for negatives you would have to first negate to make them positive, then use divmod10 then negate the resulting division value (the mod value I believe would stay positive). If you are doing it multiple times in a row then you would leave the results all positive and then at the very end negate the remaining division part.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: robtillaart on Nov 03, 2013, 07:09 pm

Personally, I would love to see it wrapped in a function which, for purposes of this post, I shall call digit().

Code: [Select]
`Example use of digit():long n = 4216738;byte ones = digit(n, 0);  // now, ones == 8byte tens = digit(n, 1);  // now, tens == 3byte hund = digit(n, 2);  // now, hund == 7byte thou = digit(n, 3);  // now, thou == 6byte myri = digit(n, 4);  // now, myri == 1byte lakh = digit(n, 5);  // now, lakh == 2byte mill = digit(n, 6);  // now, mill == 4byte cror = digit(n, 7);  // now, cror == 0`
This would be extremely useful for displays.

I'm not sure how it should behave for negative numbers, though.

@odometer
Have you seen this thread, - http://forum.arduino.cc/index.php?topic=179111.0 -
it uses divmod10 to increase the speed of the print class from which display classes are derived. But also the Serial class is derived from Print.
Title: Re: divmod10() : a fast replacement for /10 and %10 (unsigned)
Post by: samepaul on Sep 12, 2019, 10:59 pm
hi Rob

I know that Arduino code mainly uses LGPL license, but since this topic is not source code it can have less restrictive licensing.
Unfortunately for my project (not related to Arduino) I cannot use any license requiring exposure of proprietary sources.