Pages: 1 ... 4 5 [6] 7 8   Go Down
Author Topic: divmod10() : a fast replacement for /10 and %10 (unsigned)  (Read 19664 times)
0 Members and 1 Guest are viewing this topic.
Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 217
Posts: 13705
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
In a quick real-world test the result I got was 3.1245us as an inline function.
speechless ...

@Paul, there is a serious new candidate for Print.cpp smiley-wink

Recall we came from
   i/10      38.6920 us
   i%10      38.6120 us
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 217
Posts: 13705
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

IN a follow up on my own message

A quick arbitrary float test  (same prog as above, slightly different printnumber)
Code:
  Serial.println(10737.41824, 4);
  Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);

original print.cpp:

testing
10737.4179
1.0182
107.37
Time=2144
done

divmod10 version:

testing
10737.4179
1.0182
107.37
Time=1472
done

so for floats there is also serious gain ~30% gain

(note this gain is only due to the integral part, the remainder part is a loop using a float mul per digit)
Code:
  // Extract digits from the remainder one at a time
  while (digits-- > 0)
  {
    remainder *= 10.0;
    int toPrint = int(remainder);
    n += print(toPrint);
    remainder -= toPrint;
  }

Created a faster snippet to print the decimal part of a float (replaces the snippet above)
// not tested yet
Code:
  //  print decimal part
  uint32_t t = 1;
  while (digits-- > 0) t = ((t<<2) + t)<<1;       // t *= 10;
  uint32_t rem1 = remainder * t;       // only float mul left
  uint32_t rem2 = ((rem1<<2) + rem1)<<1;      // rem2 = rem1*10;
  // loop for the leading zero's
  while (t > rem2)
  {
    n += print('0');
    rem2 = ((rem2 <<2) + rem2)<<1;      // rem2 *=10;
  }
  n += print(rem1);      // uses the divmod10() for the remaining digits

results

testing
10737.4179
1.0182
107.37
Time=1136  << 22% improved wrt 1472
done
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

0
Offline Offline
God Member
*****
Karma: 26
Posts: 610
Always making something...
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Wow Stimmer, that's really awesome!

Here's some test results:

Stimmer's optimization:            3.9 us/digit
Hacker's Delight & Rob & Tom:  6.4 us/digit
Original C library divide:           43.1 us/digit

These measurements were made on a Teensy 2.0.  These include the looping and array write overhead inside Print, with the burden of a 32 bit variable in the code holding the start time during the test.  Overhead to actually store the data to the output device buffer is not measured, so you should see very similar results between official Arduino and Teensy.

Here's the sketch I used for testing:

Code:
extern uint32_t usec_print;

void setup() {
  Serial.begin(115200);
  Serial.println("Benchmark print integer speed:");
  Serial.println(1073741824ul);
  Serial.println(1018195846ul);
  Serial.println(1820728843ul);
  Serial.println(2904720785ul);
  Serial.println(3678923723ul);
  Serial.println(4152282824ul);
  Serial.println(1875627853ul);
  Serial.println(2532822983ul);
  Serial.println(3519086527ul);
  Serial.println(4023424234ul);
  Serial.print("Speed = ");
  Serial.print((float)usec_print / 100.0);
  Serial.println(" us/digit");
}

void loop() {}

Here are the ready-to-use Print files.  To select which algorithm is used, comment & uncomment the #defines at line 155.

* Print.cpp (10.71 KB - downloaded 21 times.)
* Print.h (6.77 KB - downloaded 38 times.)
« Last Edit: June 27, 2013, 03:04:55 pm by Paul Stoffregen » Logged

Global Moderator
Dallas
Offline Offline
Shannon Member
*****
Karma: 207
Posts: 12903
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

(I think some attinys don't)

Correct.  ATtiny processors do not have a MUL instruction.

Congratulations on the wicked fast code!
Logged

0
Offline Offline
God Member
*****
Karma: 26
Posts: 610
Always making something...
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

I added another measurement to the test program.

With Stimmer's optimization:

Speed = 3.84 us/digit, 6.16 us/byte

The us/digit measures only the speed Print converts the 32 bit integer to a string.

The us/byte measures the entire time to put all the data into the transmit buffer.

Code:
extern uint32_t usec_print;

void setup() {
  uint32_t begin_us, end_us;
  Serial.begin(115200);
  Serial.println("Benchmark print integer speed:");
  begin_us = micros();
  Serial.print(1073741824ul);
  Serial.print(1018195846ul);
  Serial.print(1820728843ul);
  Serial.print(2904720785ul);
  Serial.print(3678923723ul);
  Serial.print(4152282824ul);
  Serial.print(1875627853ul);
  Serial.print(2532822983ul);
  Serial.println(3519086527ul);
  Serial.println(4023424234ul);
  end_us = micros();
  Serial.print("Speed = ");
  Serial.print((float)usec_print / 100.0);
  Serial.print(" us/digit, ");
  Serial.print((float)(end_us - begin_us) / 100.0);
  Serial.println(" us/byte");
}

void loop() {}

Print.cpp and Print.h are the same as #77.

I tried to bring use them on Arduino Uno, but it seems my Print and header file arrangement has diverged too far from Arduino's over the years.
Logged

0
Offline Offline
Edison Member
*
Karma: 64
Posts: 1634
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Many users need faster decimal print for logging data to SD cards.  The big problem with Print is this line
Code:
  return write(str);
A class derived from Print may have a fast write for strings but it is not uses since Print defines these members
Code:
size_t write(const char *str) {
      if (str == NULL) return 0;
      return write((const uint8_t *)str, strlen(str));
    }

size_t Print::write(const uint8_t *buffer, size_t size)
{
  size_t n = 0;
  while (size--) {
    n += write(*buffer++);
  }
  return n;
}
So printNumber uses the above write(const char *str) and the string is written character at a time.

I did a test with the new version of print.  It is faster but not as much as one would expect.

I timed this loop
Code:
  for (int i = 0; i < 20000; i++) {
    file.println(i);
  }

The result for the old print is:
Quote
Time 9.38 sec
File size 128.89 KB
Write 13.74 KB/sec
For the new print with Stimmer's optimization.
Quote
Time 6.00 sec
File size 128.89 KB
Write 21.50 KB/sec

For a simple printField() function with no optimized divide that uses SdFat's write(const char*)
Quote
Time 2.66 sec
File size 128.89 KB
Write 48.44 KB/sec

So the improvement in Print will not be fully realized as long as printNumber uses Print::write(const char*).

I believe there is a way to fix this in a class derived from Print since size_t Print::write(const char *str) is not pure virtual.  Correct me if I am wrong.
Logged

0
Offline Offline
God Member
*****
Karma: 26
Posts: 610
Always making something...
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

So printNumber uses the above write(const char *str) and the string is written character at a time.

If the SD library implements write(buf, len), then the it's supposed to be called instead of the 1-byte-at-a-time fallback in Print.cpp.
Logged

0
Offline Offline
God Member
*****
Karma: 26
Posts: 610
Always making something...
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Are you using your newer SdFat?

I just tried it here with SD and it's terribly slow.  Because I don't have the exact program you used, here's what I tried.

Code:
#include <SD.h>

const int chipSelect = 0;

void setup()
{
  Serial.begin(115200);
  while (!Serial) ;
  Serial.print("Initializing SD card...");
  if (!SD.begin(chipSelect)) {
    Serial.println("Card failed, or not present");
    return;
  }
  Serial.println("card initialized.");
  File dataFile = SD.open("datalog.txt", FILE_WRITE);
  if (dataFile) {
    uint32_t begin_us, end_us;
    begin_us = micros();
    for (int i = 0; i < 20000; i++) {
      dataFile.println(i);
    }
    dataFile.close();
    end_us = micros();
    Serial.print("seconds =  ");
    Serial.print((float)(end_us - begin_us) / 1000000.0);
  }  
}

void loop() {}
Logged

0
Offline Offline
Edison Member
*
Karma: 64
Posts: 1634
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Yes but if I declared Print::write(const char*) virtual and put the default definition in Print.cpp I get this improvement.

Old Print
Quote
Time 6.85 sec
File size 128.89 KB
Write 18.83 KB/sec

The new Print uses the size_t Print::write(const uint8_t *buffer, size_t size) but SdFat defines

Code:
 int write(const void* buf, size_t nbyte);

If I override the Print def of write for the new Print I get:
Quote
Time 2.61 sec
File size 128.89 KB
Write 49.40 KB/sec

A big improvement over the old print and no overrides:
Quote
Time 9.47 sec
File size 128.89 KB
Write 13.61 KB/sec

« Last Edit: June 27, 2013, 05:30:45 pm by fat16lib » Logged

0
Offline Offline
Edison Member
*
Karma: 64
Posts: 1634
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I am using a very new version of SdFat, not the 3-4 year old version in SD.h.

I have some extra stuff in my loop to time the max and min print time but commenting it out doesn't change much.

The new SdFat is much faster, it allows flash programming to happen in the SD while the next block buffer is being filled by the Arduino program.
« Last Edit: June 27, 2013, 05:36:47 pm by fat16lib » Logged

0
Offline Offline
Edison Member
*
Karma: 64
Posts: 1634
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Paul,

I tried your sketch on a Uno with 1.0.5 and get a fast write with SD.h and the new Print.  About the same as with the new SdFat.  Looks like you don't even need the SdFat improvements at 50 KB/sec.

Quote
Initializing SD card...card initialized.
seconds =  2.53
Logged

Offline Offline
God Member
*****
Karma: 32
Posts: 507
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

There is another speed optimization to make when you are using divmod10 to convert a long into base-10 digits: Only the first few digits of a 10-digit number need full 32-bit precision. Once the high byte is 0 you only need 24 bit precision (saves at least 10 cycles), then when the two highest bytes are 0 you only need a 16 bit divmod10 (saves over a microsecond). Once the three highest bytes are 0 the 8-bit divmod10 only takes a microsecond, and for the very last digit you don't need to do a division at all.

Code:
  while(n & 0xff000000){divmod10_asm32(n, digit, tmp8);*--p = digit + '0';}
  while(n & 0xff0000){divmod10_asm24(n, digit, tmp8);*--p = digit + '0';}
  while(n & 0xff00){divmod10_asm16(n, digit, tmp8);*--p = digit + '0';}
  while((n & 0xff)>9){divmod10_asm8(n, digit, tmp8);*--p = digit + '0';}
  *--p = n + '0';

The downside is an increase in code size of about 200 bytes. I've attached a modified Print.cpp including all the assembler macros, my profiling indicates it's 0.5us-1.5us faster per digit on average (which adds up to 7us faster for 5-10 digit numbers). I have not fully tested for accuracy yet.

* Print.cpp (13.58 KB - downloaded 15 times.)
Logged


Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 217
Posts: 13705
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

a well implemented trick - trading space for speed -

There are other tricks possible, not investigated though

Consider the fact that the largest uint32_t  < 50000 * 10000 * 10000;
with a divmod10000() one could divide the uint32 in 4digit chunks which can be processed by divmod10_asm16

a divmod1000() might be broader usable

another trick would use a divmod100() to extract 2 digits at a time => fit in a byte. => divmod10_asm8() to split it
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

0
Offline Offline
Edison Member
*
Karma: 64
Posts: 1634
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I have used the ideas here to improve the speed of logging data to SD cards.  I wrote functions to print a numerical field to an SD card that are as much as five times faster than the standard Arduino print.

I am now able to log data from four analog pins on an Uno to an SD at 2000 Hz.  This is 8000 ADC values per second.  This is near the limit since each conversion takes about 110 microseconds.

I am using a small RTOS, Nil RTOS, written by Giovanni Di Sirio.  I ported this RTOS to avr arduinos http://code.google.com/p/rtoslibs/

I wrote a version of analogRead() that sleeps while the ADC conversion occurs.  This saves about 90 microseconds of CPU time per conversion.

I have also modified SdFat so it no longer busy waits for flash programming in the SD to complete.

Here are test results for 16-bit unsigned fields and 32-bit float fields.
Quote
This test compares the standard print function
with printField.  Each test prints 20,000 values.

Test of println(uint16_t)
Time 6.51 sec
File size 128.89 KB
Write 19.80 KB/sec

Test of printField(uint16_t, char)
Time 1.27 sec
File size 128.89 KB
Write 101.33 KB/sec

Test of println(double)
Time 17.17 sec
File size 160.00 KB
Write 9.32 KB/sec

Test of printField(double, char)
Time 3.56 sec
File size 160.00 KB
Write 44.94 KB/sec

Here are the test loops for the writes:
Code:
   switch(test) {
    case 0:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.println(i);
      }
      break;

    case 1:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.printField(i, '\n');
      }
      break;

    case 2:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.println((double)0.0001*i, 4);
      }
      break;

    case 3:
      for (uint16_t i = 0; i < N_PRINT; i++) {
        file.printField((double)0.0001*i, '\n', 4);
      }
   }

I have attached the implementation of printField().
 

* PrintField.cpp (8.85 KB - downloaded 20 times.)
Logged

Global Moderator
Netherlands
Offline Offline
Shannon Member
*****
Karma: 217
Posts: 13705
In theory there is no difference between theory and practice, however in practice there are many...
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Impressive numbers, really good !

think divmod10() should become part of the core seeing these and the above numbers.
Logged

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Pages: 1 ... 4 5 [6] 7 8   Go Up
Jump to: