Go Down

Topic: divmod10() : a fast replacement for /10 and %10 (unsigned) (Read 72142 times)previous topic - next topic

pYro_65

Sorry for this. I don't want to abuse this thread for progress reports on senseless optimization. But, please excuse me ... I'm in a kind of flow.

I don't think you're in danger of that, I'm sure robtillaart will enjoy testing your approach. Its not senseless either, someone one day will need to squeeze in a few extra repetitions that the standard method can't do in time.

robtillaart

#16
Jun 11, 2013, 05:52 pmLast Edit: Jun 11, 2013, 10:17 pm by robtillaart Reason: 1
Well done genom2!!

On my "reference UNO" it clocks 12.4760

I looked at the code and I could remove 2 lines and it still passes the random test..UPdate : false positve see below

timing becomes a staggering 11.4680
Code: [Select]
`void divmod10_4(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = (0.75*in) >> 3  uint32_t q= (x>>4) + x;  // 0.796875 *in  x= q;  q= (q>>8) + x;     // 0.799987793 * in  q= (q>>8) + x;     // 0.803112745 * in  //q= (q>>8) + x;  //q= (q>>8) + x;  div = (q >> 3);  mod = in - (((div << 2) + div) << 1);}`

did a extended 1000K test

divmod10       11.4680
--------------
test div 0..65535
test mod 0..65535
test divmod 0..maxlong (1000K random samples)
done

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

genom2

#17
Jun 11, 2013, 09:49 pmLast Edit: Jun 11, 2013, 10:21 pm by genom2 Reason: 1
No, you can't remove the 2 lines! As you said, a random test is no proof!

Take this number 7031721 as a counter example:
7031721/10 expected 703172 but your modification returns 703171

But an inconspicuous modification on calculating mod (in the last line) does the trick and another ~8% off:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){ uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8 uint32_t q= (x>>4) + x; x= q; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x;        q= (q>>8) + x; div = (q >> 3); mod = in - ((q & ~0x7) + (div << 1));}`

robtillaart

Great counter example!
and a very smart way to mod --> 11.7200  impressive !
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

ajk_

#19
Jun 12, 2013, 07:19 amLast Edit: Jun 12, 2013, 07:21 am by ajk_ Reason: 1
I don't have that exact Arduino model, but I'm curious -- if rotates happen one bit at a time, and take 2 instructions per bit, or 16 operations to rotate, perhaps a copy is faster? What do you get with this?

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){        union {                uint32_t q;                uint8_t qq[4];        };        union {                uint32_t r;                uint8_t rr[4];        };        r = 0;        uint32_t x = (in|1) - (in >> 2); // div = in/10 <~~> div = (0.75*in) >> 3        q = (x >> 4) + x;        x = q;        rr[0] = qq[1];        rr[1] = qq[2];        rr[2] = qq[3];        q = r +x        rr[0] = qq[1];        rr[1] = qq[2];        rr[2] = qq[3];        q = r + x        rr[0] = qq[1];        rr[1] = qq[2];        rr[2] = qq[3];        q = r + x        rr[0] = qq[1];        rr[1] = qq[2];        rr[2] = qq[3];        q = r + x        div = (q >> 3);        mod = in - (((div << 2) + div) << 1);}`

robtillaart

fixed some missing ;

divmod10       14.5480

Problem might be that the index[] costs extra, with pointer math one can solve that. Have no time to try ...
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Problem might be that the index[] costs extra, with pointer math one can solve that.

Costs twice.  Pointers / arrays tie up a register pair that could otherwise be used to hold a 16 bit value.  In the extreme case, this forces intermediate values to be stored on the stack.  Adding pointers / arrays typically causes minor problems for the optimizer (data has to be moved between registers).

Shifts by multiples of 4 bits (4, 8, 12, 16, ...) are generally not very expensive on AVR processors with the GCC compiler.  The compiler "knows" the results of a multiple-of-8 shift are ready-to-use in a another register (essentially free; basically what @ajk_ is doing).  The AVR processor has a "nibble swap" instruction that helps with multiple-of-4 shifts (not free but about the same price as a shift by 2 bits).

ajk_

Bummer, that is possibly why doing the same in assembler whips.
I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.
Looking at the AVR GCC optimizer 'bug list' there are a pile of missed opportunities to optimize.

I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.

Apparently I did not word my post very well.  The point is that the compiler does optimize 8, 16, 24 bit shifts.  They are optimized to the point of being essentially free.

Tom Carpenter

#24
Jun 14, 2013, 09:45 pmLast Edit: Jun 14, 2013, 11:59 pm by Tom Carpenter Reason: 1
at the cost of 22bytes of flash:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8  uint32_t x;    asm volatile(    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );    q += x;    x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  x = (q >> 2);  mod = in - ((q & ~0x7) + (x & ~0x1)); //the 'x' here used to be: ((q>>3)<<1), but now it is just ((q>>2) & 0xFFFFFFFE) which result in the same value  div = (x >> 1); //div used to be (q>>3). It is now ((q>>2)>>1) which is the same.}`

I've basically targeted the '>>4' which was being done in a loop of 4 shifts.

I have also modified the last lines to reduce the number of bit shifts.

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   10.2680
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

Thats a further 12.39% reduction.

EDIT

For no extra cost on my last mod:

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint32_t &mod){  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8  uint32_t x;    asm volatile(    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );    q += x;    x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q &= ~0x7;  x = (q >> 2);  mod = in - (q + x);  div = (x >> 1);}`

I've modified where the (q & ~0x7) is done to remove the need for the (x & ~0x1) that I had to add.

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   10.0800
--------------
test div 0..MAX
test mod 0..MAX
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

Thats a further 1.83% reduction on my last one.
~Tom~

robtillaart

Wow!

didn't expect it would come so close to 10 usec - on my UNO it clocks even 10.2080 !

worked on those last lines too, but I had
Code: [Select]
`  x = (q >> 2);  div = (x >> 1);   mod = in - ((q & ~0x7) + (x & ~0x1));`
which took longer (12++) ==> so the order of the lines does matter. (it needs to preserve the previous value of x in the pre-last line)

Can we win time by the fact that mod is always between 0..9 so in fact an uint8_t should be large enough, is an uint32_t overkill?
Code: [Select]
`void divmod10_4_1(uint32_t in, uint32_t &div, uint8_t &mod)   // made mod a byte{  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8  uint32_t x;  asm volatile(  "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t":     "=r" (x):     "r" (q):    );  q += x;  x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  x = (q >> 2);  mod = in - (q & ~0x7) - (x & ~0x1);   div = (x >> 1);}`

Quote
testing divmod10()

i/10   38.6920
i%10   38.6120
--------------
divmod10       10.1440
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (1000K random samples)
done

The diff is only 0.064 usec ( 0.6% or just 1 instruction )  . So we are 3 instructions from the 10 microsecond limit..
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

aarondc

This thread deserves an award or something. Lovin' it.

Tom Carpenter

#27
Jun 15, 2013, 12:01 pmLast Edit: Jun 15, 2013, 12:05 pm by Tom Carpenter Reason: 1
While its good to reduce the mod to 'byte', it causes the compiler to drop one of its beautiful optimisations, plus miss another obvious one.

Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){  uint32_t* divPtr;  uint8_t* modPtr;  asm volatile(    "movw %0, %2  \n\t"    "movw %1, %3  \n\t"    : "=z" (divPtr), "=x" (modPtr)    : "r" (&div), "r" (&mod)    :  );//force the compiler to use the x register for mod and z for div.    //This is because the compiler can then use st Z, st Z+1, st Z+2 ...    //for div, and just st X for mod. before it used st Z for mod and had    //to do: st X+, st X+, st X+, st X+, X=X-3. So we save a 2clock instruction.    uint32_t q = (in|1);// div = in/10 <~~> div = 0.75*in/8  uint32_t x;  asm volatile( //x = in >> 2    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (in)    :  );    q = q - x;    asm volatile( //x = q >> 4    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );    q += x;    x= q;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;  q= (q>>8) + x;    q &= ~0x7;  asm volatile( //x = q >> 2    "movw %A0, %A1  \n\t"    "movw %C0, %C1  \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    "lsr %D0       \n\t"    "ror %C0       \n\t"    "ror %B0       \n\t"    "ror %A0       \n\t"    : "=r" (x)    : "r" (q)    :  );  *modPtr = (in - (q + x));  *divPtr = (x >> 1);}`

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   9.2000
--------------
test div 0..MAX
test mod 0..MAX
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done
~Tom~

genom2

#28
Jun 15, 2013, 01:23 pmLast Edit: Jun 15, 2013, 01:37 pm by genom2 Reason: 1
To cite robtillaart's motto:
"In theory there is no difference between theory and practice, however in practice there are many..."

So, I tried this:
Code: [Select]
`void divmod10(uint32_t in, uint32_t &div, uint8_t &mod){ uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8 uint32_t q= (x>>4) + x; x= q; q= (q>>16) + (x>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q &= ~0x7; x = (q >> 2); mod = in - (q + x); div = (x >> 1);}`
At least in theory there should be some (at least one) saveings on assembler instruction. Because this:
Code: [Select]
` q= (q>>16) + (x>>8) + x; q= (q>>8) + x; q= (q>>8) + x;`
is a valid substitute for this:
Code: [Select]
` q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x; q= (q>>8) + x;`

However in practice it made performance worse.

I agree, the (avr-)gcc does a beautiful job. But, from this result I have the feeling that shift 16 is not in its optimization scope.

Can manually coded assembler do the job? (Which is not in my scope.)

BTW: This is not a valid substitute:
Code: [Select]
` q= (q>>16) + (x>>8) + x; q= (q>>16) + (x>>8) + x;`

Tom Carpenter

#29
Jun 15, 2013, 02:17 pmLast Edit: Jun 15, 2013, 02:20 pm by Tom Carpenter Reason: 1
I'm not sure that:
Code: [Select]
`x = q;q= (q>>16) + (x>>8) + x;`
is a valid substitution for
Code: [Select]
`x = qq= (q>>8) + x;q= (q>>8) + x;`

take q = 0x00000FFF

Code: [Select]
`x = 0x00000FFFq = 0x00000000 + 0x0000000F + 0x00000FFF = 0x0000100E`

In the original:
Code: [Select]
`x = 0x00000FFFq = 0x0000000F + 0x00000FFF = 0x0000100Eq = 0x00000010 + 0x00000FFF = 0x0000100F`

Notice the discrepancy? Its because you lost a vital carried bit.

Can manually coded assembler do the job? (Which is not in my scope.)

I have been
~Tom~

Go Up