Pages: 1 [2] 3 4 ... 8   Go Down
Author Topic: divmod10() : a fast replacement for /10 and %10 (unsigned)  (Read 14307 times)
0 Members and 2 Guests are viewing this topic.
North Queensland, Australia
Offline Offline
Edison Member
*
Karma: 53
Posts: 1799
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Sorry for this. I don't want to abuse this thread for progress reports on senseless optimization. But, please excuse me ... I'm in a kind of flow.

I don't think you're in danger of that, I'm sure robtillaart will enjoy testing your approach. Its not senseless either, someone one day will need to squeeze in a few extra repetitions that the standard method can't do in time.
Logged


Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 170
Posts: 12482
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Well done genom2!!

On my "reference UNO" it clocks 12.4760


I looked at the code and I could remove 2 lines and it still passes the random test..UPdate : false positve see below

timing becomes a staggering 11.4680
Code:
void divmod10_4(uint32_t in, uint32_t &div, uint32_t &mod)
{
  uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = (0.75*in) >> 3
  uint32_t q= (x>>4) + x;  // 0.796875 *in
  x= q;
  q= (q>>8) + x;     // 0.799987793 * in
  q= (q>>8) + x;     // 0.803112745 * in
  //q= (q>>8) + x;
  //q= (q>>8) + x;

  div = (q >> 3);
  mod = in - (((div << 2) + div) << 1);
}

did a extended 1000K test

divmod10       11.4680
--------------
test div 0..65535
test mod 0..65535
test divmod 0..maxlong (1000K random samples)
done


« Last Edit: June 11, 2013, 03:17:46 pm by robtillaart » Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Offline Offline
Newbie
*
Karma: 2
Posts: 12
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

No, you can't remove the 2 lines! As you said, a random test is no proof!

Take this number 7031721 as a counter example:
7031721/10 expected 703172 but your modification returns 703171

But an inconspicuous modification on calculating mod (in the last line) does the trick and another ~8% off:
Code:
void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
uint32_t q= (x>>4) + x;
x= q;
q= (q>>8) + x;
q= (q>>8) + x;
q= (q>>8) + x;
        q= (q>>8) + x;

div = (q >> 3);
mod = in - ((q & ~0x7) + (div << 1));
}
« Last Edit: June 11, 2013, 03:21:22 pm by genom2 » Logged

Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 170
Posts: 12482
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Great counter example!
and a very smart way to mod --> 11.7200  impressive !
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Offline Offline
Newbie
*
Karma: 0
Posts: 48
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I don't have that exact Arduino model, but I'm curious -- if rotates happen one bit at a time, and take 2 instructions per bit, or 16 operations to rotate, perhaps a copy is faster? What do you get with this?  smiley-mr-green

Code:
void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{

        union {
                uint32_t q;
                uint8_t qq[4];
        };

        union {
                uint32_t r;
                uint8_t rr[4];
        };

        r = 0;
        uint32_t x = (in|1) - (in >> 2); // div = in/10 <~~> div = (0.75*in) >> 3
        q = (x >> 4) + x;
        x = q;
        rr[0] = qq[1];
        rr[1] = qq[2];
        rr[2] = qq[3];
        q = r +x

        rr[0] = qq[1];
        rr[1] = qq[2];
        rr[2] = qq[3];
        q = r + x

        rr[0] = qq[1];
        rr[1] = qq[2];
        rr[2] = qq[3];
        q = r + x

        rr[0] = qq[1];
        rr[1] = qq[2];
        rr[2] = qq[3];
        q = r + x


        div = (q >> 3);
        mod = in - (((div << 2) + div) << 1);
}


« Last Edit: June 12, 2013, 12:21:53 am by ajk_ » Logged

Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 170
Posts: 12482
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

fixed some missing ;

divmod10       14.5480

Problem might be that the index[] costs extra, with pointer math one can solve that. Have no time to try ...
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Global Moderator
Dallas
Offline Offline
Shannon Member
*****
Karma: 176
Posts: 12286
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Problem might be that the index[] costs extra, with pointer math one can solve that.

Costs twice.  Pointers / arrays tie up a register pair that could otherwise be used to hold a 16 bit value.  In the extreme case, this forces intermediate values to be stored on the stack.  Adding pointers / arrays typically causes minor problems for the optimizer (data has to be moved between registers).

Shifts by multiples of 4 bits (4, 8, 12, 16, ...) are generally not very expensive on AVR processors with the GCC compiler.  The compiler "knows" the results of a multiple-of-8 shift are ready-to-use in a another register (essentially free; basically what @ajk_ is doing).  The AVR processor has a "nibble swap" instruction that helps with multiple-of-4 shifts (not free but about the same price as a shift by 2 bits).
Logged

Offline Offline
Newbie
*
Karma: 0
Posts: 48
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Bummer, that is possibly why doing the same in assembler whips.
I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.
Looking at the AVR GCC optimizer 'bug list' there are a pile of missed opportunities to optimize.
Logged

Global Moderator
Dallas
Offline Offline
Shannon Member
*****
Karma: 176
Posts: 12286
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.

Apparently I did not word my post very well.  The point is that the compiler does optimize 8, 16, 24 bit shifts.  They are optimized to the point of being essentially free.
Logged

Leeds, UK
Offline Offline
Edison Member
*
Karma: 72
Posts: 1642
Once the magic blue smoke is released, it won't go back in!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

at the cost of 22bytes of flash:
Code:
void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
  uint32_t x;
  
  asm volatile(
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  
  q += x;
  
  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;

  x = (q >> 2);
  mod = in - ((q & ~0x7) + (x & ~0x1)); //the 'x' here used to be: ((q>>3)<<1), but now it is just ((q>>2) & 0xFFFFFFFE) which result in the same value
  div = (x >> 1); //div used to be (q>>3). It is now ((q>>2)>>1) which is the same.
}

I've basically targeted the '>>4' which was being done in a loop of 4 shifts.

I have also modified the last lines to reduce the number of bit shifts.

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   10.2680
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

Thats a further 12.39% reduction.


EDIT

For no extra cost on my last mod:

Code:
void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
  uint32_t x;
  
  asm volatile(
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  
  q += x;
  
  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;

  q &= ~0x7;
  x = (q >> 2);
  mod = in - (q + x);
  div = (x >> 1);
}

I've modified where the (q & ~0x7) is done to remove the need for the (x & ~0x1) that I had to add.

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   10.0800
--------------
test div 0..MAX
test mod 0..MAX
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done

Thats a further 1.83% reduction on my last one.
« Last Edit: June 14, 2013, 04:59:08 pm by Tom Carpenter » Logged

~Tom~

Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 170
Posts: 12482
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Wow!

didn't expect it would come so close to 10 usec - on my UNO it clocks even 10.2080 !

worked on those last lines too, but I had
Code:
  x = (q >> 2);
  div = (x >> 1);
  mod = in - ((q & ~0x7) + (x & ~0x1));
which took longer (12++) ==> so the order of the lines does matter. (it needs to preserve the previous value of x in the pre-last line)

Can we win time by the fact that mod is always between 0..9 so in fact an uint8_t should be large enough, is an uint32_t overkill?
Code:
void divmod10_4_1(uint32_t in, uint32_t &div, uint8_t &mod)   // made mod a byte
{
  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
  uint32_t x;

  asm volatile(
  "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
:
    "=r" (x)
:
    "r" (q)
:
    );

  q += x;

  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;

  x = (q >> 2);
  mod = in - (q & ~0x7) - (x & ~0x1);
  div = (x >> 1);
}

Quote
testing divmod10()

i/10   38.6920
i%10   38.6120
--------------
divmod10       10.1440
--------------
test div 0..65535
test mod 0..65535
test div 0..maxlong (1000K random samples)
done

The diff is only 0.064 usec ( 0.6% or just 1 instruction )  . So we are 3 instructions from the 10 microsecond limit..
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Melbourne, Australia
Offline Offline
God Member
*****
Karma: 8
Posts: 567
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

This thread deserves an award or something. Lovin' it.
Logged

Windows serial port monitor: Tellurium | Arduino serial port debugging library: DBG | Cusom LCD char generator | Technical questions will only be answered in forum threads

Leeds, UK
Offline Offline
Edison Member
*
Karma: 72
Posts: 1642
Once the magic blue smoke is released, it won't go back in!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

While its good to reduce the mod to 'byte', it causes the compiler to drop one of its beautiful optimisations, plus miss another obvious one.

I have added them:
Code:
void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
  uint32_t* divPtr;
  uint8_t* modPtr;
  asm volatile(
    "movw %0, %2  \n\t"
    "movw %1, %3  \n\t"
    : "=z" (divPtr), "=x" (modPtr)
    : "r" (&div), "r" (&mod)
    :
  );//force the compiler to use the x register for mod and z for div.
    //This is because the compiler can then use st Z, st Z+1, st Z+2 ...
    //for div, and just st X for mod. before it used st Z for mod and had
    //to do: st X+, st X+, st X+, st X+, X=X-3. So we save a 2clock instruction.
  
  uint32_t q = (in|1);// div = in/10 <~~> div = 0.75*in/8
  uint32_t x;
  asm volatile( //x = in >> 2
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (in)
    :
  );
  
  q = q - x;
  
  asm volatile( //x = q >> 4
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  
  q += x;
  
  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  
  q &= ~0x7;
  asm volatile( //x = q >> 2
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  *modPtr = (in - (q + x));
  *divPtr = (x >> 1);
}

Quote
testing divmod10()

i/10   38.8360
i%10   38.7400
--------------
divmod10   9.2000
--------------
test div 0..MAX
test mod 0..MAX
test div 0..maxlong (20K random samples)
test mod 0..maxlong (20K random samples)
done
« Last Edit: June 15, 2013, 05:05:29 am by Tom Carpenter » Logged

~Tom~

Offline Offline
Newbie
*
Karma: 2
Posts: 12
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

To cite robtillaart's motto:
"In theory there is no difference between theory and practice, however in practice there are many..."

So, I tried this:
Code:
void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
uint32_t q= (x>>4) + x;
x= q;
q= (q>>16) + (x>>8) + x;
q= (q>>8) + x;
q= (q>>8) + x;

q &= ~0x7;
x = (q >> 2);
mod = in - (q + x);
div = (x >> 1);
}
At least in theory there should be some (at least one) saveings on assembler instruction. Because this:
Code:
q= (q>>16) + (x>>8) + x;
q= (q>>8) + x;
q= (q>>8) + x;
is a valid substitute for this:
Code:
q= (q>>8) + x;
q= (q>>8) + x;
q= (q>>8) + x;
q= (q>>8) + x;


However in practice it made performance worse.

I agree, the (avr-)gcc does a beautiful job. But, from this result I have the feeling that shift 16 is not in its optimization scope.

Can manually coded assembler do the job? (Which is not in my scope.)

BTW: This is not a valid substitute:
Code:
q= (q>>16) + (x>>8) + x;
q= (q>>16) + (x>>8) + x;
« Last Edit: June 15, 2013, 06:37:43 am by genom2 » Logged

Leeds, UK
Offline Offline
Edison Member
*
Karma: 72
Posts: 1642
Once the magic blue smoke is released, it won't go back in!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

I'm not sure that:
Code:
x = q;
q= (q>>16) + (x>>8) + x;
is a valid substitution for
Code:
x = q
q= (q>>8) + x;
q= (q>>8) + x;

take q = 0x00000FFF

In your substitution:
Code:
x = 0x00000FFF
q = 0x00000000 + 0x0000000F + 0x00000FFF = 0x0000100E

In the original:
Code:
x = 0x00000FFF
q = 0x0000000F + 0x00000FFF = 0x0000100E
q = 0x00000010 + 0x00000FFF = 0x0000100F

Notice the discrepancy? Its because you lost a vital carried bit.


Can manually coded assembler do the job? (Which is not in my scope.)
I have been smiley
« Last Edit: June 15, 2013, 07:20:58 am by Tom Carpenter » Logged

~Tom~

Pages: 1 [2] 3 4 ... 8   Go Up
Jump to: