divmod10() : a fast replacement for /10 and %10 (unsigned)

I restarted my exhaustive test, using the asm macro version. In a few days I should be able to confirm if it matches the C library version for all 2^32 possible inputs....

Unless there's more algorithm magic, I believe it's safe to say this 83 cycle version is the fastest possible implementation.

That's why I asked several posts back if one of the earlier C version - shorter in terms of C - could be faster is asm as it might be less statements.

In a few days I should be able to confirm if it matches the C library version for all 2^32 possible inputs....

how do you do this test? Generate strings and compare them?

robtillaart:
how do you do this test? Generate strings and compare them?

Compute the quotient and modulus using both and check if either is different. Basically, like this:

                q1 = n / 10;
                m1 = n - q1 * 10;
                q2 = n;
                divmod10_asm(q2, tmp, m2);
                if (q1 != q2 || m1 != m2) {
                        errorcount++;
                }

Here is the complete code. It should run on any AVR-based Arduino compatible board.

divmod10_asm_test.zip (1.68 KB)

Opps, forgot this. To run it on non-Teensy boards, you'll also have to add this .h file. Or you could comment out the stuff that prints dots and blinks the LED, but that would not be any fun.

elapsedMillis.h (4.2 KB)

would it not be faster to

for(unsigned long(n = 0; n < 0xFFFFFFFF; n++)
{
  divmod10_asm(n, d, m); 
  if (n != (10*d + m)) Serial.println(i);
}

as division is quite expensive as we know :wink:

That would be much faster, aside from a couple minor code problems. But does only comparing the output to itself really form a valid test?

The slow way will complete in a few days, which really isn't very long in the grand scheme of things.

But does only comparing the output to itself really form a valid test?

no there is a small chance in theory that there exist a (d, m) pair that would also solve the n = 10*d + m equation (e.g. n = 100, d = 0, m = 100)
but given the fact that the code is a white box and we know how the code works this chance is minimal.

To prevent this theoretical issue one could add a test for m to be smaller than 10.

for(unsigned long(n = 0; n < 0xFFFFFFFF; n++)
{
  divmod10_asm(n, d, m); 
  if (n != (10*d + m) || m >= 10 ) Serial.println(n);
}

Even still, using a formula to prove itself is circular reasoning, which is not a proof at all. (Circular reasoning - Wikipedia)

The verification only needs to be run once, so it's probably not worth optimizing. It's already been running over 12 hours and will be done in a couple days.

I believe a much better optimization effort would involve looking into why Arduino's Print takes 15 to actually do something with each digit that's generated in only 5 us.

Is there any reason for avoiding using mul in the assembly code? I know not all atmels have mul (I think some attinys don't) but it's there on the Uno and Mega.

Here's a version using mul and fixed point arithmetic:

void setup() {
  Serial.begin(115200);
}

void loop() {
  
  uint32_t i=1;
  uint32_t q=0;
  uint8_t r=1;
  while(1){
    uint32_t j=i;
    uint8_t k; 
    uint8_t t;
    asm volatile(
    
    // this code fragment works out i/10 and i%10 by calculating i*(51/256)*(256/255)/2 == i*51/510 == i/10
    
    // by "j.k" I mean 32.8 fixed point, j is integer part, k is fractional part
    // j.k = ((j+1.0)*51.0)/256.0
    // (we add 1 because we will be using the floor of the result later)
    
      " ldi %2,51     \n\t"
      " mul %A0,%2    \n\t"
      " clr %A0       \n\t"
      " add r0,%2     \n\t"      
      " adc %A0,r1    \n\t"
      " mov %1,r0     \n\t"
      " mul %B0,%2    \n\t"
      " clr %B0       \n\t"
      " add %A0,r0    \n\t"
      " adc %B0,r1    \n\t"
      " mul %C0,%2    \n\t"
      " clr %C0       \n\t"
      " add %B0,r0    \n\t"
      " adc %C0,r1    \n\t"
      " mul %D0,%2    \n\t"
      " clr %D0       \n\t"
      " add %C0,r0    \n\t"
      " adc %D0,r1    \n\t"
      " clr r1        \n\t"  
     
     
    // j.k = j.k*256.0/255.0 
    // (actually the result will always be slightly low, which we need because we use the floor of the result)
      
      " add %1,%A0    \n\t"
      " adc %A0,%B0   \n\t"
      " adc %B0,%C0   \n\t"
      " adc %C0,%D0   \n\t"
      " adc %D0,r1    \n\t"
      " add %1,%B0    \n\t"
      " adc %A0,%C0   \n\t"
      " adc %B0,%D0   \n\t"
      " adc %C0,r1    \n\t"
      " adc %D0,r1    \n\t"
      " add %1,%D0    \n\t"
      " adc %A0,r1    \n\t"
      " adc %B0,r1    \n\t"
      " adc %C0,r1    \n\t"
      " adc %D0,r1    \n\t"      
      
    
    // j.k = j.k / 2.0    
    
      " lsr %D0       \n\t"
      " ror %C0       \n\t"
      " ror %B0       \n\t"
      " ror %A0       \n\t"
      " ror %1        \n\t"   
   
   
   // now i*51.0/510.0 < j.k < (i+1.0)*51.0/510.0
   // therefore j == i/10 (since j == floor(j.k))
   // and (k*10)>>8 == i % 10
      
      " ldi %2,10     \n\t"
      " mul %1,%2     \n\t"
      " mov %1,r1     \n\t"
      
      " clr r1        \n\t"


      
      :"+r"(j),"=d"(k),"=d"(t)
      : 
      :  );
      
      if((j!=q) || (k!=r)){
        Serial.print(i);Serial.print(" ");
        Serial.print(j);Serial.print(" ");
        Serial.print(k);Serial.println(" ");
      }
      if((i & 16777215)==0){Serial.print(i);Serial.print(" ");Serial.println(millis());
        if(i==0)return;}
      i++;
      r++;if(r==10){r=0;q++;}
      
  }
}

I've tested it on various values but the full test will take 6 hours to run.

My way of testing for correctness is to keep track of the expected quotient and remainder, incrementing the remainder each time the divisor is incremented and when the remainder gets to 10, increase the quotient:

  uint32_t i=1;
  uint32_t q=0;
  uint8_t r=1;
  while(1){

 [[ work out div/mod 10 of i here and check against q and r]]

      i++;
      r++;if(r==10){r=0;q++;}
      
  }

This is fast and simple, and I hope it's agreed that it is obviously a correct way of checking.

@stimmer
Can you wrap your code in a similar function as divmod10() and run a performance test for comparison?

please use Toms code as starter - divmod10() : a fast replacement for /10 and %10 (unsigned) - #44 by TCWORLD - Libraries - Arduino Forum -

As a normal function it would be dominated by overhead and marshalling. As an inline function or macro it should take 3us. The calculation is 43 instructions long, 5 of which are mul, so it takes 48 cycles in total. In a quick real-world test the result I got was 3.1245us as an inline function.

The full 32-bit range test has now completed without error :slight_smile:

Lol. This is starting to get insane.

Credit where credit is due, thats an amazing optimisation stimmer :D.

In a quick real-world test the result I got was 3.1245us as an inline function.

speechless ...

@Paul, there is a serious new candidate for Print.cpp :wink:

Recall we came from

   i/10      38.6920 us
   i%10      38.6120 us

IN a follow up on my own message

robtillaart:
A quick arbitrary float test (same prog as above, slightly different printnumber)

  Serial.println(10737.41824, 4);

Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);




original print.cpp:

testing
10737.4179
1.0182
107.37
Time=2144
done

divmod10 version:

testing
10737.4179
1.0182
107.37
Time=1472
done

so for floats there is also serious gain ~30% gain 

(note this gain is only due to the integral part, the remainder part is a loop using a float mul per digit)


// Extract digits from the remainder one at a time
  while (digits-- > 0)
  {
    remainder *= 10.0;
    int toPrint = int(remainder);
    n += print(toPrint);
    remainder -= toPrint;
  }

Created a faster snippet to print the decimal part of a float (replaces the snippet above)
// not tested yet

  //  print decimal part
  uint32_t t = 1;
  while (digits-- > 0) t = ((t<<2) + t)<<1;       // t *= 10;
  uint32_t rem1 = remainder * t;       // only float mul left
  uint32_t rem2 = ((rem1<<2) + rem1)<<1;      // rem2 = rem1*10;
  // loop for the leading zero's
  while (t > rem2)
  {
    n += print('0');
    rem2 = ((rem2 <<2) + rem2)<<1;      // rem2 *=10;
  }
  n += print(rem1);      // uses the divmod10() for the remaining digits

results

testing
10737.4179
1.0182
107.37
Time=1136 << 22% improved wrt 1472
done

Wow Stimmer, that's really awesome!

Here's some test results:

Stimmer's optimization: 3.9 us/digit
Hacker's Delight & Rob & Tom: 6.4 us/digit
Original C library divide: 43.1 us/digit

These measurements were made on a Teensy 2.0. These include the looping and array write overhead inside Print, with the burden of a 32 bit variable in the code holding the start time during the test. Overhead to actually store the data to the output device buffer is not measured, so you should see very similar results between official Arduino and Teensy.

Here's the sketch I used for testing:

extern uint32_t usec_print;

void setup() {
  Serial.begin(115200);
  Serial.println("Benchmark print integer speed:");
  Serial.println(1073741824ul);
  Serial.println(1018195846ul);
  Serial.println(1820728843ul);
  Serial.println(2904720785ul);
  Serial.println(3678923723ul);
  Serial.println(4152282824ul);
  Serial.println(1875627853ul);
  Serial.println(2532822983ul);
  Serial.println(3519086527ul);
  Serial.println(4023424234ul);
  Serial.print("Speed = ");
  Serial.print((float)usec_print / 100.0);
  Serial.println(" us/digit");
}

void loop() {}

Here are the ready-to-use Print files. To select which algorithm is used, comment & uncomment the #defines at line 155.

Print.cpp (10.7 KB)

Print.h (6.77 KB)

stimmer:
(I think some attinys don't)

Correct. ATtiny processors do not have a MUL instruction.

Congratulations on the wicked fast code!

I added another measurement to the test program.

With Stimmer's optimization:

Speed = 3.84 us/digit, 6.16 us/byte

The us/digit measures only the speed Print converts the 32 bit integer to a string.

The us/byte measures the entire time to put all the data into the transmit buffer.

extern uint32_t usec_print;

void setup() {
  uint32_t begin_us, end_us;
  Serial.begin(115200);
  Serial.println("Benchmark print integer speed:");
  begin_us = micros();
  Serial.print(1073741824ul);
  Serial.print(1018195846ul);
  Serial.print(1820728843ul);
  Serial.print(2904720785ul);
  Serial.print(3678923723ul);
  Serial.print(4152282824ul);
  Serial.print(1875627853ul);
  Serial.print(2532822983ul);
  Serial.println(3519086527ul);
  Serial.println(4023424234ul);
  end_us = micros();
  Serial.print("Speed = ");
  Serial.print((float)usec_print / 100.0);
  Serial.print(" us/digit, ");
  Serial.print((float)(end_us - begin_us) / 100.0);
  Serial.println(" us/byte");
}

void loop() {}

Print.cpp and Print.h are the same as #77.

I tried to bring use them on Arduino Uno, but it seems my Print and header file arrangement has diverged too far from Arduino's over the years.

Many users need faster decimal print for logging data to SD cards. The big problem with Print is this line

  return write(str);

A class derived from Print may have a fast write for strings but it is not uses since Print defines these members

size_t write(const char *str) {
      if (str == NULL) return 0;
      return write((const uint8_t *)str, strlen(str));
    }

size_t Print::write(const uint8_t *buffer, size_t size)
{
  size_t n = 0;
  while (size--) {
    n += write(*buffer++);
  }
  return n;
}

So printNumber uses the above write(const char *str) and the string is written character at a time.

I did a test with the new version of print. It is faster but not as much as one would expect.

I timed this loop

  for (int i = 0; i < 20000; i++) {
    file.println(i);
  }

The result for the old print is:

Time 9.38 sec
File size 128.89 KB
Write 13.74 KB/sec

For the new print with Stimmer's optimization.

Time 6.00 sec
File size 128.89 KB
Write 21.50 KB/sec

For a simple printField() function with no optimized divide that uses SdFat's write(const char*)

Time 2.66 sec
File size 128.89 KB
Write 48.44 KB/sec

So the improvement in Print will not be fully realized as long as printNumber uses Print::write(const char*).

I believe there is a way to fix this in a class derived from Print since size_t Print::write(const char *str) is not pure virtual. Correct me if I am wrong.

fat16lib:
So printNumber uses the above write(const char *str) and the string is written character at a time.

If the SD library implements write(buf, len), then the it's supposed to be called instead of the 1-byte-at-a-time fallback in Print.cpp.