divmod10() : a fast replacement for /10 and %10 (unsigned)

fixed some missing ;

divmod10 14.5480

Problem might be that the index[] costs extra, with pointer math one can solve that. Have no time to try ...

robtillaart: Problem might be that the index[] costs extra, with pointer math one can solve that.

Costs twice. Pointers / arrays tie up a register pair that could otherwise be used to hold a 16 bit value. In the extreme case, this forces intermediate values to be stored on the stack. Adding pointers / arrays typically causes minor problems for the optimizer (data has to be moved between registers).

Shifts by multiples of 4 bits (4, 8, 12, 16, ...) are generally not very expensive on AVR processors with the GCC compiler. The compiler "knows" the results of a multiple-of-8 shift are ready-to-use in a another register (essentially free; basically what @ajk_ is doing). The AVR processor has a "nibble swap" instruction that helps with multiple-of-4 shifts (not free but about the same price as a shift by 2 bits).

Bummer, that is possibly why doing the same in assembler whips.
I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.
Looking at the AVR GCC optimizer ‘bug list’ there are a pile of missed opportunities to optimize.

ajk_: I am kind of disappointed that the AVR GCC misses the obvious shift-by 8/16/24 optimizations.

Apparently I did not word my post very well. The point is that the compiler [u]does[/u] optimize 8, 16, 24 bit shifts. They are optimized to the point of being essentially free.

at the cost of 22bytes of flash:

void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
  uint32_t x;
  
  asm volatile(
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  
  q += x;
  
  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;

  x = (q >> 2);
  mod = in - ((q & ~0x7) + (x & ~0x1)); //the 'x' here used to be: ((q>>3)<<1), but now it is just ((q>>2) & 0xFFFFFFFE) which result in the same value
  div = (x >> 1); //div used to be (q>>3). It is now ((q>>2)>>1) which is the same.
}

I’ve basically targeted the ‘>>4’ which was being done in a loop of 4 shifts.

I have also modified the last lines to reduce the number of bit shifts.

testing divmod10()

i/10 38.8360
i%10 38.7400

divmod10 10.2680

test div 0…65535
test mod 0…65535
test div 0…maxlong (20K random samples)
test mod 0…maxlong (20K random samples)
done

Thats a further 12.39% reduction.

EDIT

For no extra cost on my last mod:

void divmod10(uint32_t in, uint32_t &div, uint32_t &mod)
{
  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
  uint32_t x;
  
  asm volatile(
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  
  q += x;
  
  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;

  q &= ~0x7;
  x = (q >> 2);
  mod = in - (q + x);
  div = (x >> 1);
}

I’ve modified where the (q & ~0x7) is done to remove the need for the (x & ~0x1) that I had to add.

testing divmod10()

i/10 38.8360
i%10 38.7400

divmod10 10.0800

test div 0…MAX
test mod 0…MAX
test div 0…maxlong (20K random samples)
test mod 0…maxlong (20K random samples)
done

Thats a further 1.83% reduction on my last one.

Wow!

didn’t expect it would come so close to 10 usec - on my UNO it clocks even 10.2080 !

worked on those last lines too, but I had

  x = (q >> 2);
  div = (x >> 1); 
  mod = in - ((q & ~0x7) + (x & ~0x1));

which took longer (12++) ==> so the order of the lines does matter. (it needs to preserve the previous value of x in the pre-last line)

Can we win time by the fact that mod is always between 0…9 so in fact an uint8_t should be large enough, is an uint32_t overkill?

void divmod10_4_1(uint32_t in, uint32_t &div, uint8_t &mod)   // made mod a byte
{
  uint32_t q= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
  uint32_t x;

  asm volatile(
  "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
: 
    "=r" (x)
: 
    "r" (q)
:
    );

  q += x;

  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;

  x = (q >> 2);
  mod = in - (q & ~0x7) - (x & ~0x1); 
  div = (x >> 1);
}

testing divmod10()

i/10 38.6920
i%10 38.6120

divmod10 10.1440

test div 0…65535
test mod 0…65535
test div 0…maxlong (1000K random samples)
done

The diff is only 0.064 usec ( 0.6% or just 1 instruction ) . So we are 3 instructions from the 10 microsecond limit…

This thread deserves an award or something. Lovin' it.

While its good to reduce the mod to ‘byte’, it causes the compiler to drop one of its beautiful optimisations, plus miss another obvious one.

I have added them:

void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
  uint32_t* divPtr;
  uint8_t* modPtr;
  asm volatile( 
    "movw %0, %2  \n\t"
    "movw %1, %3  \n\t"
    : "=z" (divPtr), "=x" (modPtr)
    : "r" (&div), "r" (&mod)
    :
  );//force the compiler to use the x register for mod and z for div.
    //This is because the compiler can then use st Z, st Z+1, st Z+2 ...
    //for div, and just st X for mod. before it used st Z for mod and had
    //to do: st X+, st X+, st X+, st X+, X=X-3. So we save a 2clock instruction. 
  
  uint32_t q = (in|1);// div = in/10 <~~> div = 0.75*in/8
  uint32_t x;
  asm volatile( //x = in >> 2
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (in)
    :
  );
  
  q = q - x;
  
  asm volatile( //x = q >> 4
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  
  q += x;
  
  x= q;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  q= (q>>8) + x;
  
  q &= ~0x7;
  asm volatile( //x = q >> 2
    "movw %A0, %A1  \n\t"
    "movw %C0, %C1  \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    "lsr %D0       \n\t"
    "ror %C0       \n\t"
    "ror %B0       \n\t"
    "ror %A0       \n\t"
    : "=r" (x)
    : "r" (q)
    :
  );
  *modPtr = (in - (q + x));
  *divPtr = (x >> 1);
}

testing divmod10()

i/10 38.8360
i%10 38.7400

divmod10 9.2000

test div 0…MAX
test mod 0…MAX
test div 0…maxlong (20K random samples)
test mod 0…maxlong (20K random samples)
done

To cite robtillaart’s motto:
“In theory there is no difference between theory and practice, however in practice there are many…”

So, I tried this:

void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
	uint32_t x= (in|1) - (in>>2); // div = in/10 <~~> div = 0.75*in/8
	uint32_t q= (x>>4) + x;
	x= q;
	q= (q>>16) + (x>>8) + x;
	q= (q>>8) + x;
	q= (q>>8) + x;

	q &= ~0x7;
	x = (q >> 2);
	mod = in - (q + x);
	div = (x >> 1);
}

At least in theory there should be some (at least one) saveings on assembler instruction. Because this:

	q= (q>>16) + (x>>8) + x;
	q= (q>>8) + x;
	q= (q>>8) + x;

is a valid substitute for this:

	q= (q>>8) + x;
	q= (q>>8) + x;
	q= (q>>8) + x;
	q= (q>>8) + x;

However in practice it made performance worse.

I agree, the (avr-)gcc does a beautiful job. But, from this result I have the feeling that shift 16 is not in its optimization scope.

Can manually coded assembler do the job? (Which is not in my scope.)

BTW: This is not a valid substitute:

	q= (q>>16) + (x>>8) + x;
	q= (q>>16) + (x>>8) + x;

I'm not sure that:

x = q;
q= (q>>16) + (x>>8) + x;

is a valid substitution for

x = q
q= (q>>8) + x;
q= (q>>8) + x;

take q = 0x00000FFF

In your substitution:

x = 0x00000FFF
q = 0x00000000 + 0x0000000F + 0x00000FFF = 0x0000100E

In the original:

x = 0x00000FFF
q = 0x0000000F + 0x00000FFF = 0x0000100E
q = 0x00000010 + 0x00000FFF = 0x0000100F

Notice the discrepancy? Its because you lost a vital carried bit.

genom2: Can manually coded assembler do the job? (Which is not in my scope.)

I have been :)

There is one more MASSIVE optimisation which can be acheived, but I can't make gcc do it.

The solution... write the entire function manually in assembler.

void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));
void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)
  //and that in arrives in r22:r25 quad
   asm volatile(
    "movw r30, %2   \n\t"  //uint32_t* divPtr = ÷
    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;
    
    "mov   r0, %A0  \n\t"  //byte temp = in
    "movw r18, %A0  \n\t"  //uint32_t q = in;
    "movw r20, %C0  \n\t"  
    "ori  r18, 0x01 \n\t"  //q |= 1;
    
    "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"  
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    
    "sub  r18, r22  \n\t" //q = q - x;
    "sbc  r19, r23  \n\t"
    "sbc  r20, r24  \n\t"
    "sbc  r21, r25  \n\t"
    
    "movw r22, r18  \n\t" //x = q;
    "movw r24, r20  \n\t"
    "lsr  r25       \n\t" //x = x >> 4;
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    
    "add  r22, r18  \n\t" //x = x + q
    "adc  r23, r19  \n\t"
    "adc  r24, r20  \n\t"
    "adc  r25, r21  \n\t"
    
    "movw r18, r22  \n\t" //q = x
    "movw r20, r24  \n\t"
    "add  r18, r23  \n\t" //q = q + (x >> 8)
    "adc  r19, r24  \n\t"
    "adc  r20, r25  \n\t"
    "adc  r21, r1   \n\t"

    "mov  r18, r19  \n\t" //q = q >> 8
    "mov  r19, r20  \n\t"
    "mov  r20, r21  \n\t"
    "eor  r21, r21  \n\t"
    "add  r18, r22  \n\t" //q = q + x
    "adc  r19, r23  \n\t"
    "adc  r20, r24  \n\t"
    "adc  r21, r25  \n\t"

    "mov  r18, r19  \n\t" //q = q >> 8
    "mov  r19, r20  \n\t"
    "mov  r20, r21  \n\t"
    "eor  r21, r21  \n\t"
    "add  r18, r22  \n\t" //q = q + x
    "adc  r19, r23  \n\t"
    "adc  r20, r24  \n\t"
    "adc  r21, r25  \n\t"
     
    "mov  r18, r19  \n\t" //q = q >> 8
    "mov  r19, r20  \n\t"
    "mov  r20, r21  \n\t"
    "eor  r21, r21  \n\t"
    "add  r18, r22  \n\t" //q = q + x
    "adc  r19, r23  \n\t"
    "adc  r20, r24  \n\t"
    "adc  r21, r25  \n\t"
    
    "andi r18, 0xF8 \n\t" //q = q & ~0x7
    
    "sub   r0, r18  \n\t" //in = in - q
    
    "lsr  r21       \n\t" //q = q >> 2 
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    "lsr  r21       \n\t"
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    
    "sub  r0, r18   \n\t" //in = in - q
    "st    X, r0    \n\t" //mod = in;
    
    "lsr  r21       \n\t" //q = q >> 1
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    
    "st       Z, r18  \n\t" //div = q
    "std  Z+1, r19  \n\t"
    "std  Z+2, r20  \n\t"
    "std  Z+3, r21  \n\t"
    
    : 
    : "r" (in), "r" (&mod), "r" (&div)
    : "r0", "r26", "r27", "r31", "r31"
  );  
}

And the results speak for themselves:

i/10 38.8360

i%10 38.7400

divmod10 7.8800

test div 0..65535 test mod 0..65535 test div 0..maxlong (20K random samples) test mod 0..maxlong (20K random samples) done

divmod10 7.8800

Including the call and ret it takes 101 clock cycles to execute the code which comes out to 101 cycles / 16 cycles per microsecond = 6.3125 microseconds. The measurement code needs some work. Excellent job on the assembly!

call			2
movw	r30,	2%	1
movw	r26,	1%	1
mov	r0,	%A0	1
movw	r18,	%A0	1
movw	r20,	%C0	1
ori	r18,	0x01	1
lsr	r25		1
ror	r24		1
ror	r23		1
ror	r22		1
lsr	r25		1
ror	r24		1
ror	r23		1
ror	r22		1
sub	r18,	r22	1
sbc	r19,	r23	1
sbc	r20,	r24	1
sbc	r21,	r25	1
movw	r22,	r18	1
movw	r24,	r20	1
lsr	r25		1
ror	r24		1
ror	r23		1
ror	r22		1
lsr	r25		1
ror	r24		1
ror	r23		1
ror	r22		1
lsr	r25		1
ror	r24		1
ror	r23		1
ror	r22		1
lsr	r25		1
ror	r24		1
ror	r23		1
ror	r22		1
add	r22,	r18	1
adc	r23,	r19	1
adc	r24,	r20	1
adc	r25,	r21	1
movw	r18,	r22	1
movw	r20,	r24	1
add	r18,	r23	1
adc	r19,	r24	1
adc	r20,	r25	1
adc	r21,	r1	1
mov	r18,	r19	1
mov	r19,	r20	1
mov	r20,	r21	1
eor	r21,	r21	1
add	r18,	r22	1
adc	r19,	r23	1
adc	r20,	r24	1
adc	r21,	r25	1
mov	r18,	r19	1
mov	r19,	r20	1
mov	r20,	r21	1
eor	r21,	r21	1
add	r18,	r22	1
adc	r19,	r23	1
adc	r20,	r24	1
adc	r21,	r25	1
mov	r18,	r19	1
mov	r19,	r20	1
mov	r20,	r21	1
eor	r21,	r21	1
add	r18,	r22	1
adc	r19,	r23	1
adc	r20,	r24	1
adc	r21,	r25	1
andi	r18,	0xF8	1
sub	r0,	r18	1
lsr	r21		1
ror	r20		1
ror	r19		1
ror	r18		1
lsr	r21		1
ror	r20		1
ror	r19		1
ror	r18		1
sub	r0,	r18	1
st	X,	r0	2
lsr	r21		1
ror	r20		1
ror	r19		1
ror	r18		1
st	Z,	r18	2
std	Z+1,	r19	2
std	Z+2,	r20	2
std	Z+3,	r21	2
ret			4

I wonder why Tom Carpenter didn't took the chance to implement my suggested algorithmic improvement given in reply #28?

So, I coded it myself into the given assembler template and indeed found the expected performance gain. (I don't want to bother you with this solution because I found an even better one, see below.)

Meanwhile I found that:

    q= (q>>8) + x;
    q= (q>>16) + (x>>8) + x;
    q= (q>>8) + x;

is a valid substitute for this:

    q= (q>>8) + x;
    q= (q>>8) + x;
    q= (q>>8) + x;
    q= (q>>8) + x;

as well.

This last substitute offers even more optimization choices at assembler level and I (temporary) end up with this:

void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));
void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)
  //and that in arrives in r22:r25 quad
   asm volatile(
    "movw r30, %2   \n\t"  //uint32_t* divPtr = ÷
    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;

    "mov   r0, %A0  \n\t"  //byte temp = in
    "movw r18, %A0  \n\t"  //uint32_t q = in;
    "movw r20, %C0  \n\t"
    "ori  r18, 0x01 \n\t"  //q |= 1;

    "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"

    "sub  r18, r22  \n\t" //q = q - x;
    "sbc  r19, r23  \n\t"
    "sbc  r20, r24  \n\t"
    "sbc  r21, r25  \n\t"

    "movw r22, r18  \n\t" //x = q;
    "movw r24, r20  \n\t"
    "lsr  r25       \n\t" //x = x >> 4;
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"

    "add  r22, r18  \n\t" //x = x + q
    "adc  r23, r19  \n\t"
    "adc  r24, r20  \n\t"
    "adc  r25, r21  \n\t"

    //q= x + (x>>8);
    "movw r18, r22  \n\t" //q = x
    "movw r20, r24  \n\t"
    "add  r18, r23  \n\t" //q = q + (x >> 8)
    "adc  r19, r24  \n\t"
    "adc  r20, r25  \n\t"
    "adc  r21, r1   \n\t"

            //q= (q>>16) + (x>>8) + x;
            "movw r18, r20  \n\t" //q= (q>>16);
            "clr  r20   \n\t"
            "clr  r21   \n\t"
            "add  r18, r23  \n\t" //q += (x >> 8)
            "adc  r19, r24  \n\t"
            "adc  r20, r25  \n\t"
            "adc  r21, r21  \n\t" // we need the carry only
            "add  r18, r22  \n\t" //q += x
            "adc  r19, r23  \n\t"
            "adc  r20, r24  \n\t"
            "adc  r21, r25  \n\t"


    //q= (q>>8) + x;
    "mov  r18, r19  \n\t" //q = q >> 8
    "mov  r19, r20  \n\t"
    "mov  r20, r21  \n\t"
    "clr  r21       \n\t"
    "add  r18, r22  \n\t" //q = q + x
    "adc  r19, r23  \n\t"
    "adc  r20, r24  \n\t"
    "adc  r21, r25  \n\t"

    "andi r18, 0xF8 \n\t" //q = q & ~0x7

    "sub   r0, r18  \n\t" //in = in - q

    "lsr  r21       \n\t" //q = q >> 2
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    "lsr  r21       \n\t"
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"

    "sub  r0, r18   \n\t" //in = in - q
    "st    X, r0    \n\t" //mod = in;

    "lsr  r21       \n\t" //q = q >> 1
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"

    "st       Z, r18  \n\t" //div = q
    "std  Z+1, r19  \n\t"
    "std  Z+2, r20  \n\t"
    "std  Z+3, r21  \n\t"

    :
    : "r" (in), "r" (&mod), "r" (&div)
    : "r0", "r26", "r27", "r31", "r31"
  );
}

With another slice of ~4% off.

genom2: I wonder why Tom Carpenter didn't took the chance to implement my suggested algorithmic improvement given in reply #28?

See reply #29 for why i didn't. I thought that what you suggested in reply #28 was not a valid substitute. But actually having tested further, the discrepancy I found seems to self correct.

Your second seems a better optimisation. It saves 5 clock cycles.

Tidied up a bit:

void divmod10(uint32_t in, uint32_t &div, uint8_t &mod) __attribute__((noinline));
void divmod10(uint32_t in, uint32_t &div, uint8_t &mod)
{
  //assumes that div/mod pointers arrive in r18:r19 and r20:r21 pairs (doesn't matter which way around)
  //and that in arrives in r22:r25 quad
   asm volatile(
    "movw r30, %2   \n\t"  //uint32_t* divPtr = ÷
    "movw r26, %1    \n\t" //uint32_t* modPtr = &mod;
    
    "mov   r0, %A0  \n\t"  //byte temp = in
    "movw r18, %A0  \n\t"  //uint32_t q = in;
    "movw r20, %C0  \n\t"  
    "ori  r18, 0x01 \n\t"  //q |= 1;
    
    "lsr  r25       \n\t"  //x = in >> 2 //note: x reuses registers of 'in', as 'in' was backed up in r0
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"  
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    
    "sub  r18, r22  \n\t" //q = q - x;
    "sbc  r19, r23  \n\t"
    "sbc  r20, r24  \n\t"
    "sbc  r21, r25  \n\t"
    
    "movw r22, r18  \n\t" //x = q;
    "movw r24, r20  \n\t"
    "lsr  r25       \n\t" //x = x >> 4;
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    "lsr  r25       \n\t"
    "ror  r24       \n\t"
    "ror  r23       \n\t"
    "ror  r22       \n\t"
    
    "add  r22, r18  \n\t" //x = x + q
    "adc  r23, r19  \n\t"
    "adc  r24, r20  \n\t"
    "adc  r25, r21  \n\t"
    
    "movw r18, r22  \n\t" //q = x
    "movw r20, r24  \n\t"
    "add  r18, r23  \n\t" //q = q + (x >> 8)
    "adc  r19, r24  \n\t"
    "adc  r20, r25  \n\t"
    "adc  r21, r1   \n\t"

    "movw r18, r20  \n\t" //q = q >> 16 
    "eor  r20, r20  \n\t"
    "eor  r21, r21  \n\t"
    "add  r18, r23  \n\t" //q = q + (x>>8)
    "adc  r19, r24  \n\t"
    "adc  r20, r25  \n\t"
    "adc  r21, r1   \n\t" //NOTE: r1 is a known 0.
    "add  r18, r22  \n\t" //q = q + x
    "adc  r19, r23  \n\t"
    "adc  r20, r24  \n\t"
    "adc  r21, r25  \n\t"
     
    "mov  r18, r19  \n\t" //q = q >> 8
    "mov  r19, r20  \n\t"
    "mov  r20, r21  \n\t"
    "eor  r21, r21  \n\t"
    "add  r18, r22  \n\t" //q = q + x
    "adc  r19, r23  \n\t"
    "adc  r20, r24  \n\t"
    "adc  r21, r25  \n\t"
    
    "andi r18, 0xF8 \n\t" //q = q & ~0x7
    
    "sub   r0, r18  \n\t" //in = in - q
    
    "lsr  r21       \n\t" //q = q >> 2 
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    "lsr  r21       \n\t"
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    
    "sub  r0, r18   \n\t" //in = in - q
    "st    X, r0    \n\t" //mod = in;
    
    "lsr  r21       \n\t" //q = q >> 1
    "ror  r20       \n\t"
    "ror  r19       \n\t"
    "ror  r18       \n\t"
    
    "st       Z, r18  \n\t" //div = q
    "std  Z+1, r19  \n\t"
    "std  Z+2, r20  \n\t"
    "std  Z+3, r21  \n\t"
    
    : 
    : "r" (in), "r" (&mod), "r" (&div)
    : "r0", "r26", "r27", "r31", "r31"
  );
}

in one word - respect!

It becomes almost an assembler course! - I can read it but almost no experience writing avr asm.

@Tom & genom2 A question arises: is porting the fastest C to asm automatically the fastest asm version? I mean, the previous C versions that had less statements, can they be more efficient when hand-coded to asm?

robtillaart:
@Tom & genom2
A question arises: is porting the fastest C to asm automatically the fastest asm version?
I mean, the previous C versions that had less statements, can they be more efficient when hand-coded to asm?

Most things can be optimised by manually writing the assembly, at least in part.

A lot of the asm that gcc generated for the C versions of divmod10 was fine and very well optimised, in fact much I’ve just used. The trouble is that there are optimisations get missed, partly because to see them requires a bit of creativity.
The optimiser is very good, there are some times I look at what it comes up with and think, “wow… that was clever!”. But there are other times I think, “why don’t you just do this, it would be far quicker”.

Really it is about trying to write the C in just the right order and manner that the compiler can optimise well. That is something which takes practice, patience and observing how different code ordering affects the compiler output.

Most of the time now I am using a lot of inline assembler in time critical areas, such as this interrupt routine:

ISR(TIMER3_CAPT_vect) {
  byte timeSegment = distributionSegment(DC);
  byte index = (distributionWidth[DC] << 1) & timeSegment;
  interruptOVFCount(DC) = *(int*)((byte*)timerOVF[DC] + index); //move to next pwm period
  distributionSegment(DC) = timeSegment + 1;
  
  unsigned int count = interruptCount(DC)-1; //OCR1B is used as it is an unused register which affords us quick access.
  if (count == 0) {
    register byte port asm("r18") = stepPort(DC);
    register unsigned int startVal asm("r24") = currentMotorSpeed(DC);
    //byte port = STEP0PORT;
    //unsigned int startVal = currentMotorSpeed(DC);
    interruptCount(DC) = startVal; //start val is used for specifying how many interrupts between steps. This tends to IVal after acceleration
    if (port & stepHigh(DC)){
      stepPort(DC) = port & stepLow(DC);
      unsigned long jVal = synta.cmd.jVal[DC];
      jVal = jVal + synta.cmd.stepDir[DC];
      synta.cmd.jVal[DC] = jVal;
      if(synta.cmd.gotoEn[DC]){
        if (gotoPosn[DC] == jVal){ //reached the decelleration marker. Note that this line loads gotoPosn[] into r24:r27
          //will decellerate the rest of the way. So first move gotoPosn[] to end point.
          if (decelerationSteps(DC) & _BV(7)) {
            /*
            unsigned long gotoPos = gotoPosn[DC];
            if (synta.cmd.stepDir[DC] < 0){
              gotoPosn[DC] = gotoPos + decelerationSteps(DC);
            } else {
              gotoPosn[DC] = gotoPos - decelerationSteps(DC);
            }
            */
            //--- This is a heavily optimised version of the code commented out just above ------------
            //During compare of jVal and gotoPosn[], gotoPosn[] was loaded into r24 to r27
            register char stepDir asm("r19") = synta.cmd.stepDir[DC];
            asm volatile(
              "in   %A0, %2  \n\t"
              "cp   %1, __zero_reg__  \n\t"
              "brlt .+4      \n\t"
              "neg  %A0      \n\t"
              "eor  %B0, %B0   \n\t"
              "mov  %C0, %B0   \n\t"
              "mov  %D0, %B0   \n\t"
              "add  %A0, r24  \n\t" //add previously loaded gotoPosn[] registers to new gotoPosn[] registers.
              "adc  %B0, r25  \n\t"
              "adc  %C0, r26  \n\t"
              "adc  %D0, r27  \n\t"
              : "=a" (gotoPosn[DC]) //goto selects r18:r21. This adds sts commands for all 4 bytes
              : "r" (stepDir),"I" (_SFR_IO_ADDR(decelerationSteps(DC)))       //stepDir is in r19 to match second byte of goto.
              :
            );
            //-----------------------------------------------------------------------------------------
            decelerationSteps(DC) = 0; //say we are stopping
            synta.cmd.IVal[DC] = stopSpeed[DC]; //decellerate to min speed. This is possible in at most 80 steps.
          } else {
            goto stopMotorISR3;
          }
        }
      } 
      if (currentMotorSpeed(DC) > stopSpeed[DC]) {
stopMotorISR3:
        synta.cmd.gotoEn[DC] = 0; //Not in goto mode.
        synta.cmd.stopped[DC] = 1; //mark as stopped
        timerDisable(DC);
      }
    } else {
      stepPort(DC) = port | stepHigh(DC);
      register unsigned int iVal asm("r20") = synta.cmd.IVal[DC];
      //unsigned int iVal = synta.cmd.IVal[RA];
      register byte increment asm("r0") = synta.cmd.stepIncrement[DC];
      asm volatile(
        "movw r18, %0   \n\t"
        "sub  %A0, %2    \n\t"
        "sbc  %B0, __zero_reg__    \n\t"
        "cp   %A0, %A1   \n\t"
        "cpc  %B0, %B1   \n\t"
        "brge 1f         \n\t" //branch if iVal <= (startVal-increment)
        "movw  %0, r18   \n\t"
        "add  %A0, %2    \n\t"
        "adc  %B0, __zero_reg__    \n\t"
        "cp   %A1, %A0   \n\t"
        "cpc  %B1, %B0   \n\t"
        "brge 1f         \n\t" //branch if (startVal+increment) <= iVal
        "movw  %0, %1   \n\t"  //copy iVal to startVal
        "1:             \n\t"
        : "=&w" (startVal) //startVal is in r24:25
        : "a" (iVal), "t" (increment) //iVal is in r20:21
        : 
      );
      currentMotorSpeed(DC) = startVal; //store startVal
      /*
      if (startVal - increment >= iVal) { // if (iVal <= startVal-increment)
        startVal = startVal - increment;
      } else if (startVal + increment <= iVal){
        startVal = startVal + increment;
      } else {
        startVal = iVal;
      }
      currentMotorSpeed(DC) = startVal;
      */
    }
  } else {
    interruptCount(DC) = count;
  }
}

There is much that needs to be done in as short a space of time as possible. There are places where looking at the assembler I can see areas which I can see a better way. Sometimes that is just arranging if statements in a different way, others changing the use of variables.
Notice for example this line:

  interruptOVFCount(DC) = *(int*)((byte*)timerOVF[DC] + index); //move to next pwm period

I could have written:

  interruptOVFCount(DC) = timerOVF[DC][index]; //move to next pwm period

Which makes it a lot more readable, but results in 2 extra clock cycles.
Over the whole interrupt, two days of optimisation brought it down from 500+ clock cycles to a little over 170.

I’m not able to make a profound statement, because I have only little experience with assembler. So far I don’t like assembler because its always specific to CPUs. (What about divmod10(…) for ARM based Arduinos?). On the other side I like programming with JAVA (which is - in theory - an interpreter). All my algorithmic improvements made for this thread are originally developed with JAVA on a PC (using ‘>>>’ for unsigned right shift) and manually “compiled” to C, to make the proof on my Duemilanove.

In the past I used C to implement my ideas, just because there was no JAVA for microcontrollers like Arduinos. But there is light at the end of the tunnel: I’m the author of HaikuVM, a full featured open source JAVA VM for AVRs (works even for an ATmega8). Hoping HaikuVM becomes more popular in the future.

[off topic]

Hoping HaikuVM becomes more popular in the future.

You can always discuss it in the exhibition gallery when it runs on an Arduino !

Wow, pretty amazing work!

This may be a dumb or overly obvious question, but would it be ok to include this in Arduino's Print.cpp file, which is LGLP licensed?

Currently that file only has attribution to David Mellis. If I make a copy including this amazing code, who should I add to the attribution and how should your names appear?

This may be a dumb or overly obvious question, but would it be ok to include this in Arduino's Print.cpp file, which is LGLP licensed?

Questions are never dumb ;)

Print.cpp is one I had in mind too, printing numbers could be (estimate) 10-30% faster. imho the divmod10() should be usable from user code too, so encapsulate in a lib fastmath.h or arduino.h (?).

I do not know the legal details of the LGPL (can you explain it simply?), but my version of the code can be used.

Still todo: test on several platforms: 168, 268, tiny, teensy, due etc.

If I make a copy including this amazing code, who should I add to the attribution and how should your names appear?

imho: reference to this thread and optionally the forum names.