divmod10() : a fast replacement for /10 and %10 (unsigned)

There is another speed optimization to make when you are using divmod10 to convert a long into base-10 digits: Only the first few digits of a 10-digit number need full 32-bit precision. Once the high byte is 0 you only need 24 bit precision (saves at least 10 cycles), then when the two highest bytes are 0 you only need a 16 bit divmod10 (saves over a microsecond). Once the three highest bytes are 0 the 8-bit divmod10 only takes a microsecond, and for the very last digit you don't need to do a division at all.

	  while(n & 0xff000000){divmod10_asm32(n, digit, tmp8);*--p = digit + '0';}
	  while(n & 0xff0000){divmod10_asm24(n, digit, tmp8);*--p = digit + '0';}
	  while(n & 0xff00){divmod10_asm16(n, digit, tmp8);*--p = digit + '0';}
	  while((n & 0xff)>9){divmod10_asm8(n, digit, tmp8);*--p = digit + '0';}
	  *--p = n + '0';

The downside is an increase in code size of about 200 bytes. I've attached a modified Print.cpp including all the assembler macros, my profiling indicates it's 0.5us-1.5us faster per digit on average (which adds up to 7us faster for 5-10 digit numbers). I have not fully tested for accuracy yet.

Print.cpp (13.6 KB)

a well implemented trick - trading space for speed -

There are other tricks possible, not investigated though

Consider the fact that the largest uint32_t < 50000 * 10000 * 10000;
with a divmod10000() one could divide the uint32 in 4digit chunks which can be processed by divmod10_asm16

a divmod1000() might be broader usable

another trick would use a divmod100() to extract 2 digits at a time => fit in a byte. => divmod10_asm8() to split it

I have used the ideas here to improve the speed of logging data to SD cards. I wrote functions to print a numerical field to an SD card that are as much as five times faster than the standard Arduino print.

I am now able to log data from four analog pins on an Uno to an SD at 2000 Hz. This is 8000 ADC values per second. This is near the limit since each conversion takes about 110 microseconds.

I am using a small RTOS, Nil RTOS, written by Giovanni Di Sirio. I ported this RTOS to avr arduinos Google Code Archive - Long-term storage for Google Code Project Hosting..

I wrote a version of analogRead() that sleeps while the ADC conversion occurs. This saves about 90 microseconds of CPU time per conversion.

I have also modified SdFat so it no longer busy waits for flash programming in the SD to complete.

Here are test results for 16-bit unsigned fields and 32-bit float fields.

This test compares the standard print function
with printField. Each test prints 20,000 values.

Test of println(uint16_t)
Time 6.51 sec
File size 128.89 KB
Write 19.80 KB/sec

Test of printField(uint16_t, char)
Time 1.27 sec
File size 128.89 KB
Write 101.33 KB/sec

Test of println(double)
Time 17.17 sec
File size 160.00 KB
Write 9.32 KB/sec

Test of printField(double, char)
Time 3.56 sec
File size 160.00 KB
Write 44.94 KB/sec

Here are the test loops for the writes:

   switch(test) {
    case 0:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.println(i);
      }
      break;

    case 1:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.printField(i, '\n');
      }
      break;

    case 2:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.println((double)0.0001*i, 4);
      }
      break;

    case 3:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.printField((double)0.0001*i, '\n', 4);
      }
   }

I have attached the implementation of printField().

PrintField.cpp (8.85 KB)

Impressive numbers, really good !

think divmod10() should become part of the core seeing these and the above numbers.

robtillaart:
think divmod10() should become part of the core seeing these and the above numbers.

I'm planning to release it in the next version of Teensyduino.

Stimmer, is Arduino Print LGPL ok with you? Would you like your name to appear, other than "Stimmer" and link to this forum topic? You deserve credit.... just let me know what should be in the comments?

Credit is not enough. Open source licenses require a copyright claim by Stimmer.

Ok then. I'll hold off publishing anything until Stimmer replies.

I'm currently working on a Mac-related USB problem, so it's not like there's any hurry. But it would be nice to publish this code to a wider audience, if that's allowed?

@Paul
:wink: Shouldn't you be working on a space related issue with the Teensy uploader ]:smiley:
Best regards
Jantje

agree these should be standard libs of the core,

but the arduino team wont,

Yes that's OK. Consider what I wrote to be in the public domain so anyone can use it under any license. Just put something like "this section of code based on public domain code by Stimmer, see post at forum.arduino.cc/whatever-the-url-is ".

agree these should be standard libs of the core,
but the arduino team wont,

I'm definitely not sure of the latter.
The code is not core lib ready yet. For Arduino-team the portability of core libs is a very important issue. The divmod10() function should be decorated with #ifdefs to select the C (portable to DUE?) or the assembly (328 only?) implementation .

So the fun part is over, now the real work starts :wink:

So the fun part is over, now the real work starts

It ought to be pretty simple to surround this code with #ifdef AVR, or perhaps a check for the AVR architecture with multiply? In fact, the Print.cpp I posted has #ifdef checks to use either the original C code, Stimmer's optimization, or your version of the Hackers Delight algorithm. Yes, it requires some careful attention, but this part really isn't very difficult.

Whether the Arduino Team will accept this for the Print.cpp that ships for all Arduino boards is a good question. Historically, they've been pretty uninterested in speed optimizations.

I definitely do plan to publish this in the next version of Teensyduino, likely within the next 6 weeks. Of course, if Coding Badly is satisfied with the licensing? ... Thanks Stimmer! :smiley: You definitely did the heavy lifting with an amazing optimization!

Good enough for me.

update on printfloat version using divmod10() (follows - divmod10() : a fast replacement for /10 and %10 (unsigned) - #77 by robtillaart - Libraries - Arduino Forum - )

a new version to print the remainder (limited to 10 decimals) of a float

  • timing is slightly shorter
  • code is more straightforward than the previous one

insert in Print.cpp ==> size_t Print::printFloat()

  // Print the decimal point, but only if there are digits beyond
  if (digits > 0)
  {
    n += print("."); 

    uint32_t t = 1;
    for (uint8_t i=0; i<digits; i++) t = ((t<<2) + t)<<1;  // t *= 10
    uint32_t rem = remainder * t;
    char z[11]="0000000000";  // max 10 decimals
    z[digits]='\0';
  
    uint32_t d = 0;
    uint8_t m = 0;
    while (rem > 10)
    {
      divmod10(rem, d, m);
      z[--digits] = m + '0';
      rem = d;
    }
    z[--digits] = rem + '0';  // last digit
    n += print(z);
  }

testing
10737.4179
1.0182
107.37
Time=1104 << original Print.cpp 2144; almost 50% off (for this particular test)
done

testsketch

unsigned long start = 0;
unsigned long stop = 0;
volatile unsigned long q;

void setup()
{

  Serial.begin(115200);
  Serial.println("testing");
  
  byte backup = TIMSK0;
  
  TCCR1A = 0;
  TCCR1B = 4; 
  TIMSK1 |= _BV(TOIE1);
  
  Serial.flush(); //wait for serial buffer to clear
  TIMSK0 = 0; //disable millis;
  TCNT1 = 0;
  
  Serial.println(10737.41824, 4);
  Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);
  
  stop = TCNT1; //how many clock cycles.
  TIMSK0 = backup; //renable millis;
  stop*=16;//There are 16us per clock cycle with a 1:256 prescaler.
  Serial.print("Time=");
  Serial.println(stop);
  Serial.println("done");
  TIMSK1 &= ~_BV(TOIE1);
}

void loop()
{
}

ISR(TIMER1_OVF_vect) {
  Serial.println("I Overflowed!"); //just to make sure we can tell if this happens.
}

update
replace n += print("."); with n += print('.');

Time=1072 << original 2144 is 50% off

Does this make any difference...

n += write('.');

not measurable with the test script, but removing some "stacked calls" and use write() where possible may add up.

update on printfloat version using divmod10() . Code see- divmod10() : a fast replacement for /10 and %10 (unsigned) - #100 by robtillaart - Libraries - Arduino Forum -

Incorporated the Stimmer ASM divmod10_asm into the floating point test. - divmod10() : a fast replacement for /10 and %10 (unsigned) - #72 by stimmer - Libraries - Arduino Forum -
It strips of another 3% for printing floats

output:
testing
10737.4179
1.0182
107.37
Time=1008 << original 2144 is 53% off
done

1 millisecond for 19 digits is about 50+ uSec per digit (iso 100)

update: printFloat continues in its own thread here - faster printing of floats by divmod10() and others - Libraries - Arduino Forum -

I published a Teensyduino release candidate today, which includes Stimmer's optimization.

Thanks to everyone who worked and contributed to this awesome speedup. Soon it'll be in widespread use on Teensy 2.0 and Teensy++ 2.0 boards. :slight_smile:

Thanks Paul,

We can propose on the developers list that the divmod10 code (including #ifdef to select asm or C version) becomes a separate .h file within the core. I've seen " x/10 x%10" code constructs in several libs
(recently - TM1638 display library - Libraries - Arduino Forum - )

what do you think?

update: the divmod.h (?) could include work discussed here - divu5() and divu3(). Fast replacements for unsigned division by 3, 5, 6, 10, 15 - Libraries - Arduino Forum -