Go Down

Topic: divmod10() : a fast replacement for /10 and %10 (unsigned) (Read 72468 times) previous topic - next topic

pjrc

To reassure myself about this algorithm, I started an exhaustive test.  At 16 MHz, it should check all 2^32 inputs to divmod10 against the C library divide and modulus in 4 days and 3 hours.

To integrate this into the Print class, an even more efficient way might involve an inline assembly macro rather than a function call.  Not only does that eliminate the call and return instructions, but it also allows the compiler to directly replace the registers in the asm with whatever it allocated to the variables.  It avoids the overhead of the called function allocating lots of the registers, which the compiler would save if they're in use, and even if not were to be saved, the compiler's register allocator is forced to rearrange register usage in the surrounding code a function is called.

In Print, the original 32 bit input is replaced with the quotient, which seems to fit pretty well to this code.

Tom Carpenter

#46
Jun 24, 2013, 01:56 pm Last Edit: Jun 24, 2013, 02:02 pm by Tom Carpenter Reason: 1
I know this might seem a daft question, but does the print class actually use a straight /10 ? I thought it had to work for different bases (BIN, HEX, DEC, OCT) etc.
I suppose you could have a different printNumber() funtion for base 10.

Once you have done it with calling the divmod10() function directly, I'll look into inlining the function.
~Tom~

robtillaart

#47
Jun 24, 2013, 07:02 pm Last Edit: Jun 24, 2013, 07:04 pm by robtillaart Reason: 1
Quote
I know this might seem a daft question, but does the print class actually use a straight /10 ?


no it uses base, but I estimate 95% is done the DECimal way, then another 4% as HEX, 1% BIN, other bases < 1%

The printNumber could be implemented this way (optimized speed on cost of size)

Code: [Select]

size_t Print::printNumber(unsigned long n, uint8_t base) {
 char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.
 char *str = &buf[sizeof(buf) - 1];

 *str = '\0';

 // prevent crash if called with base == 1
 if (base < 2) base = 10;

 do {
   unsigned long m = n;
   n /= base;
   char c = m - base * n;
   *--str = c < 10 ? c + '0' : c + 'A' - 10;
 } while(n);

 return write(str);
}


would become something like this, with optimized versions for DEC, HEX, BIN and OCT. (imho OCT is used to few to optimize, but OK)

Code: [Select]

size_t Print::printNumber(unsigned long n, uint8_t base) {
 char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.
 char *str = &buf[sizeof(buf) - 1];

 *str = '\0';

 char c;
 uint32_t d;

 // prevent crash if called with base == 1
 if (base < 2) base = 10;

 switch(base)
 {
   case HEX:
     do {
       c = n & 0x0F;
       n >>= 4;
       *--str = c < 10 ? c + '0' : c + 'A' - 10;
     } while(n);
     break;

   case DEC:
     do {
       divmod10(n, &d, &c)
       n = d;
       *--str = c + '0';
     } while(n);
     break;

   case OCT:
     // *--str = '0';   //leading 0 to indicate octal ?
     do {
       c = n & 0x07;
       n >>= 3;
       *--str = c + '0';
     } while(n);
     break;

   case BIN:
     // *--str = 'b';  // leading b to indicate binary ?
     do {
       c = n & 0x01;
       n >>= 1;
       *--str = c + '0';
     } while(n);
     break;

   default:
     do {
       d = n;
       n /= base;
       c = d - base * n;
       *--str = c < 10 ? c + '0' : c + 'A' - 10;
     } while(n);
   break;
 }
 return write(str);
}

(code not tested)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

pjrc

I was planning to make different printNumber functions, like printNumberDec, printNumberHex, printNumberGeneric.  Then printNumber would be an inline function defined in Print.h.

Nearly all Print usage is base 10, but even when it's others, the 2nd parameter is virtually always a compile time constant.  Using an inline function will allow the compiler to optimize away the test and directly insert the call to a specific version.  If the others aren't used in the entire program, the linker will not put their code into the final executable.

pjrc

I put divmod10 into Print.cpp.  I did some quick testing and it seems to work quite well.

This is Teensy's implementation of Print, which differs slightly from Arduino's.  It has several other optimizations....


pjrc

Ideally, divmod10 could be an inline asm macro that modifies "n" in place and returns the remainder in a register.  That would eliminate the function call+return, the pointer overhead, and the compiler having to reallocate registers during the loop because a function is called.

It'll probably require a 32 bit temporary, if I understand the existing asm code?


Tom Carpenter

#51
Jun 24, 2013, 10:14 pm Last Edit: Jun 24, 2013, 10:22 pm by Tom Carpenter Reason: 1
Hi Paul,

I have done the same mod to the standard Arduino Print class, which is attached.

There are a couple of improvements I have made:
(1)
 The no base print calls can just call printNumberDec() directly rather than going through printNumber().
Code: [Select]
   size_t print(unsigned char b) { return printNumberDec(b, 0); }
   size_t print(int n) { return print((long)n); }
   size_t print(unsigned int n) { return printNumberDec(n, 0); }
   size_t print(long n)      {if (n < 0) {return printNumberDec(-n, 1);} else {return printNumberDec(n, 0);} }
   size_t print(unsigned long n) { return printNumberDec(n, 0); }


(2)
 As suggested, I have made an inline version of divmod10 for the printNumberDec() function. You should be able to merge this back into the Teensy version fairly readily.
~Tom~

robtillaart

@Paul Stoffregen

a quick look at print.cpp

in the generic printNumber you have this code

Code: [Select]
} else if (base == 1) {
base = 10;
}

could be
Code: [Select]
} else if (base == 1) {
printNumberDEC(n, sign);
}


count += printNumber(int_part, sign, 10);  
==>
count += printNumberDEC(int_part, sign);

Did you measure timing improvements?
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

pjrc

I converted it to inline asm too, slightly differently but very similar.

Code: [Select]

#define divmod10_asm(in32, tmp32, mod8)                         \
asm volatile (                                                  \
        "mov    %2, %A0         \n\t" /* mod = in */            \
        "ori    %A0, 1          \n\t" /* q = in | 1 */          \
        "movw   %A1, %A0        \n\t" /* x = q */               \
        "movw   %C1, %C0        \n\t"                           \
        "lsr    %D1             \n\t"  /* x = x >> 2 */         \
        "ror    %C1             \n\t"                           \
        "ror    %B1             \n\t"                           \
        "ror    %A1             \n\t"                           \
        "lsr    %D1             \n\t"                           \
        "ror    %C1             \n\t"                           \
        "ror    %B1             \n\t"                           \
        "ror    %A1             \n\t"                           \
        "sub    %A0, %A1        \n\t" /* q = q - x  */          \
        "sbc    %B0, %B1        \n\t"                           \
        "sbc    %C0, %C1        \n\t"                           \
        "sbc    %D0, %D1        \n\t"                           \
        "movw   %A1, %A0        \n\t" /* x = q  */              \
        "movw   %C1, %C0        \n\t"                           \
        "lsr    %D1             \n\t" /* x = x >> 4  */         \
        "ror    %C1             \n\t"                           \
        "ror    %B1             \n\t"                           \
        "ror    %A1             \n\t"                           \
        "lsr    %D1             \n\t"                           \
        "ror    %C1             \n\t"                           \
        "ror    %B1             \n\t"                           \
        "ror    %A1             \n\t"                           \
        "lsr    %D1             \n\t"                           \
        "ror    %C1             \n\t"                           \
        "ror    %B1             \n\t"                           \
        "ror    %A1             \n\t"                           \
        "lsr    %D1             \n\t"                           \
        "ror    %C1             \n\t"                           \
        "ror    %B1             \n\t"                           \
        "ror    %A1             \n\t"                           \
        "add    %A1, %A0        \n\t" /* x = x + q */           \
        "adc    %B1, %B0        \n\t"                           \
        "adc    %C1, %C0        \n\t"                           \
        "adc    %D1, %D0        \n\t"                           \
        "movw   %A0, %A1        \n\t" /* q = x */               \
        "movw   %C0, %C1        \n\t"                           \
        "add    %A0, %B1        \n\t" /* q = q + (x >> 8) */            \
        "adc    %B0, %C1        \n\t"                           \
        "adc    %C0, %D1        \n\t"                           \
        "adc    %D0, r1         \n\t"                           \
        "mov    %A0, %B0        \n\t" /* q = q >> 8 */          \
        "mov    %B0, %C0        \n\t"                           \
        "mov    %C0, %D0        \n\t"                           \
        "eor    %D0, %D0        \n\t"                           \
        "add    %A0, %A1        \n\t" /* q = q + x */           \
        "adc    %B0, %B1        \n\t"                           \
        "adc    %C0, %C1        \n\t"                           \
        "adc    %D0, %D1        \n\t"                           \
        "mov    %A0, %B0        \n\t" /* q = q >> 8 */          \
        "mov    %B0, %C0        \n\t"                           \
        "mov    %C0, %D0        \n\t"                           \
        "eor    %D0, %D0        \n\t"                           \
        "add    %A0, %A1        \n\t" /* q = q + x */           \
        "adc    %B0, %B1        \n\t"                           \
        "adc    %C0, %C1        \n\t"                           \
        "adc    %D0, %D1        \n\t"                           \
        "mov    %A0, %B0        \n\t" /* q = q >> 8 */          \
        "mov    %B0, %C0        \n\t"                           \
        "mov    %C0, %D0        \n\t"                           \
        "eor    %D0, %D0        \n\t"                           \
        "add    %A0, %A1        \n\t" /* q = q + x */           \
        "adc    %B0, %B1        \n\t"                           \
        "adc    %C0, %C1        \n\t"                           \
        "adc    %D0, %D1        \n\t"                           \
        "andi   %A0, 0xF8       \n\t" /* q = q & ~0x7 */        \
        "sub    %2, %A0         \n\t" /* mod = mod - q */       \
        "lsr    %D0             \n\t" /* q = q >> 2  */         \
        "ror    %C0             \n\t"                           \
        "ror    %B0             \n\t"                           \
        "ror    %A0             \n\t"                           \
        "lsr    %D0             \n\t"                           \
        "ror    %C0             \n\t"                           \
        "ror    %B0             \n\t"                           \
        "ror    %A0             \n\t"                           \
        "sub    %2, %A0         \n\t" /* mod = mod - q */       \
        "lsr    %D0             \n\t" /* q = q >> 1 */          \
        "ror    %C0             \n\t"                           \
        "ror    %B0             \n\t"                           \
        "ror    %A0             \n\t"                           \
        :  "+d" (in32), "=r" (tmp32), "=r" (mod8) : :           \
)

pjrc

#54
Jun 24, 2013, 11:28 pm Last Edit: Jun 25, 2013, 12:18 am by Paul Stoffregen Reason: 1

Did you measure timing improvements?


Yes, just now, printing several integers with micros() read inside the printNumberXXX functions.

There are a couple caveats, but basically:

With printNumberAny(): 1412 us   -- single call to C library divide
With printNumberDec(): 218 us     -- using divmod10_asm

Caveat #1: Adding the storage of micro() causes the compiler to allocate many more registers.  I didn't analyze the generated printNumberDec code in detail, but it is clear from a casual look at the assembly listing that just adding the test causes the whole function to be slower.

Caveat #2: The timer0 interrupt is at play, so the measurement include extra time and sometimes changes slightly from run to run.

Here's the (very arbitrary) test code:

Code: [Select]

extern uint32_t usec_print;

void setup()
{
 for (int i=0; i<100; i++) {
   Serial.println(12345678l);
   Serial.println(0);
   Serial.println(792942ul);
   Serial.println(12345);
   Serial.println(2586173071ul);
   Serial.println(234);
 }
 Serial.print("usec = ");
 Serial.println((float)usec_print / 100.0);
}

void loop()
{
}


Here are the Print files.  These have the test code inside and printNumberAny is enabled...

Tom Carpenter

As a simple comparison, take this code:
Code: [Select]
unsigned long start = 0;
unsigned long stop = 0;
volatile unsigned long q;

void setup()
{

  Serial.begin(115200);
  Serial.println("testing");
 
  byte backup = TIMSK0;
 
  TCCR1A = 0;
  TCCR1B = 4;
  TIMSK1 |= _BV(TOIE1);
 
  Serial.flush(); //wait for serial buffer to clear
  TIMSK0 = 0; //disable millis;
  TCNT1 = 0;
 
  Serial.println(1073741824ul);
  Serial.println(101819584ul);
  Serial.println(10737);
 
  stop = TCNT1; //how many clock cycles.
  TIMSK0 = backup; //renable millis;
  stop*=16;//There are 16us per clock cycle with a 1:256 prescaler.
  Serial.print("Time=");
  Serial.println(stop);
  Serial.println("done");
  TIMSK1 &= ~_BV(TOIE1);
}

void loop()
{
}

ISR(TIMER1_OVF_vect) {
  Serial.println("I Overflowed!"); //just to make sure we can tell if this happens.
}


The old print functions:
Quote
testing
1073741824
101819584
10737
Time=1408
done


The new functions:
Quote
testing
1073741824
101819584
10737
Time=480
done


A rather arbitrary test I know, but there is a clear difference, almost 1ms is knocked off the time it takes to fill the Serial buffer.
~Tom~

pjrc

I ran this on Teensy 2.0.  It takes 224... so a good portion of that 480 on normal Arduino is probably the slow hardware serial code which uses a vtable function call and circular buffer manipulation overhead for each individual byte.

robtillaart

for long numbers:

original = 1408/24 = ~59us / digit to print
"tom" = 480/24  = ~20 us/digit to print    (33%)
"paul" = 224/24 = ~ 9.5uS /digit to print  (16%)


A quick arbitrary float test  (same prog as above, slightly different printnumber)
Code: [Select]
  Serial.println(10737.41824, 4);
  Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);


original printnumber
testing
10737.4179
1.0182
107.37
Time=2144
done

divmod10 printnumber
testing
10737.4179
1.0182
107.37
Time=1472
done

so for floats there is also serious gain ~30% gain

(note this gain is only due to the integral part, the remainder part is a loop using a float mul per digit
Code: [Select]

  // Extract digits from the remainder one at a time
  while (digits-- > 0)
  {
    remainder *= 10.0;
    int toPrint = int(remainder);
    n += print(toPrint);
    remainder -= toPrint;
  }

this can be optimized to something like:
Code: [Select]

  // Extract digits from the remainder one at a time
  long t = 1;
  while (digits-- > 0) t *= 10;
  remainder *= t;
  n += print(long(remainder));


a quick test exposes the flaw in the proposed code but also indicates there is room for improvement.

testing
10737.4179
1.182  <<<<<< FAIL !!!
107.37
Time=1152  <<<  (another 25%)
done
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

pjrc

#58
Jun 25, 2013, 11:15 am Last Edit: Jun 25, 2013, 11:24 am by Paul Stoffregen Reason: 1

"tom" = 480/24  = ~20 us/digit to print    (33%)
"paul" = 224/24 = ~ 9.5uS /digit to print  (16%)


At 16 MHz, the divmod10_asm() macro is 5.2 us.

It's exactly 83 instructions, all of them single-cycle.  There's no call-return, no pointers, no load-store instructions.  It operates entirely on values in only 9 registers, which are chosen by the compiler's register allocator based on the needs of the surrounding code.  Unless there's more algorithm magic, I believe it's safe to say this 83 cycle version is the fastest possible implementation.

Obviously there's some room for improvement in actually moving the data....

drjiohnsmith

fast.

and all are much of an improvement on the original aruino,

'all' we need to do is document up and make available in a form every one can use easily.

thats that hard part,

now I wonder if a pre parser could notice the opportunity to use these in 'normal' arduino code and substitute !

OK OK OK

thats another task..

well done to all


Go Up