Go Down

Topic: faster printing of floats by divmod10() and others (Read 48287 times) previous topic - next topic

robtillaart

Jul 24, 2013, 02:10 pm Last Edit: Jul 24, 2013, 02:34 pm by robtillaart Reason: 1
This is a continuation of my experiments of improving the performance of printing floats in the Print class by means of the fast divmod10() algorithm and some additional improvements. This discussion started in the divmod10() thread but this topic is broader that's why I started a new thread.

Some of printFloat() related posts: (timings refer to the test script)
- http://forum.arduino.cc/index.php?topic=167414.msg1291487#msg1291487 - initial test (2144 uSec for reference Print.cpp, 1472 uSec for improved version)
- http://forum.arduino.cc/index.php?topic=167414.msg1295109#msg1295109 - improved performance of remainder (1136 uSec)
- http://forum.arduino.cc/index.php?topic=167414.msg1312959#msg1312959 - new remainder code, (1104 uSec, followed by a faster '.' => 1072uSec)
- http://forum.arduino.cc/index.php?topic=167414.msg1324061#msg1324061 - used Stimmer's  ASM divmod10_asm (1008 uSec)

I have been adapting the printFloat() function for some time now, not only speeding up performance but also include support for Engineering and Scientific notation. This latter is a breaking change in the interface which is discussed here - http://forum.arduino.cc/index.php?topic=166041.0

The current version of printFloat takes 720 uSec for the test, almost 3x as fast as the original print.cpp [IDE 1.0.4]
This is mainly due to new remainder + rounding code.

test-code (definitely not complete) with a small check to see if 0..6 digits works OK.
Code: [Select]

unsigned long start = 0;
unsigned long stop = 0;
volatile unsigned long q;

void setup()
{
 Serial.begin(115200);
 Serial.println("testing");

 Serial.println(1.5, 6);
 Serial.println(1.49999, 6);
 Serial.println(1.9999, 6);
 Serial.println(1.5, 5);
 Serial.println(1.49999, 5);
 Serial.println(1.9999, 5);
 Serial.println(1.5, 4);
 Serial.println(1.49999, 4);
 Serial.println(1.9999, 4);
 Serial.println(1.5, 3);
 Serial.println(1.49999, 3);
 Serial.println(1.9999, 3);
 Serial.println(1.5, 2);
 Serial.println(1.49999, 2);
 Serial.println(1.9999, 2);
 Serial.println(1.5, 1);
 Serial.println(1.49999, 1);
 Serial.println(1.9999, 1);
 Serial.println(1.5, 0);
 Serial.println(1.49999, 0);
 Serial.println(1.9999, 0);
 Serial.println();

 byte backup = TIMSK0;

 TCCR1A = 0;
 TCCR1B = 4;
 TIMSK1 |= _BV(TOIE1);

 Serial.flush(); //wait for serial buffer to clear
 TIMSK0 = 0;     //disable millis;
 TCNT1 = 0;

 Serial.println(10737.41824, 4);
 Serial.println(1.01819584, 4);
 Serial.println(107.37, 2);
 // test odd #digits
 //  Serial.println(10737.41824, 5);
 //  Serial.println(1.01819584, 3);
 //  Serial.println(107.37, 1);

 // code to check the overhead float vs long ~~ same amount of digits.
 //  Serial.println(10737);
 //  Serial.println(1);
 //  Serial.println(107);
 //  Serial.println(4182);
 //  Serial.println(9181);  // 0181 would be octal :)
 //  Serial.println(37);

 stop = TCNT1;     //how many clock cycles.
 TIMSK0 = backup;  //renable millis;
 stop *= 16;       //There are 16us per clock cycle with a 1:256 prescaler.

 Serial.print("\nTime=");
 Serial.println(stop);
 Serial.print("per char incl .\\r\\n : ");
 Serial.println(stop/28.0);    // depends on test run to be right
 Serial.println("done");

 TIMSK1 &= ~_BV(TOIE1);
}

void loop()
{
}

ISR(TIMER1_OVF_vect) {
 // reentrancy ...
 // Serial.println("I Overflowed!"); //just to make sure we can tell if this happens.
}


The output:
Code: [Select]
testing
...

10737.4179
1.0182
107.37

Time=720     <<<<<<<<<<<<<<<<<<<<<<<<< came from 2144 -> factor 3
per char incl .\r\n : 25.71
done



My current print.h and print.cpp file are attached (warning contains experimental code)

The last improvements are
- new strategy (again) for remainder code    // original + prev version in comments, (several intermediate version rejected ;)
- optimized the rounding loop    // yes, a lookup table for rounding is faster, ==> ~40 additional bytes

as always, improvements and remarks are welcome
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darrylp

#1
Jul 24, 2013, 09:37 pm Last Edit: Jul 24, 2013, 09:45 pm by darryl Reason: 1
do floats / doubles on the AVR actually have as much as 10 digits of precision ?  if only accurate down to say 8, then its another 20% saving building the part after the decimal point.

I too have looked and thought about trying to speed up float printing.

i'd say the divmod10_asm could be made a routine to save inlining it twice, and then accept the call overhead from printNumber using it as well.

EDIT:
also replacing the test for <0 at the at the start to print the leading minu sign ,is quicker if swapped for <code> if ( signbit(number) )  </code> this allows for -0.0 which is legal for floating point numbers,  and can code reduce the infinite test to come after this check as well.
--
 Darryl

robtillaart

Thanks Darryl,

a reaction per item:
- IEE754 has only 7 significant digits so you have a point there (in practice). However if a number is really small e.g. 3E-15 then even 20 digits would work (in theory) with the original code . With the SCIentific notation the number would be normalized to the significant digits only and then 7 or 8 is enough.
I do not know if the DUE implements a 64 bit double? Code should work there too, (so it should be 16 digits max?)

- I'll check if supporting max 7 digits gives some gain. => I expect rounding code can be optimized with 3 bit masks instead of a loop

- I'll add making the inline a routine to my TODO => I expect it will save only memory

- The double minus check was already on my TODO,  did not consider the  signbit()  function. I have written some sign manipulation code that directly affects the internal representation, it is fast but obscure and will therefore not use it. I expect signbit() does something similar.

I am also looking into a test that can detect NAN and INF in one step as both values have 0xFF as exponent. That would separate the printable numbers from NAN/INF in one compare instead of two tests.  byte getExponent(float n);   Removing these "special value tests" conditionally would definitely give a performance gain.

Finally I think there can be some gain in creating the factor t that is used to make an uint32_t of the decimal part. This loop has the same # iterations as the rounding loop, which I optimized by taking 2 digits per iteration (reduced the FP multiplies). Also for smaller decimal parts an uint16_t might work too?

In short there are still some unexplored corners.

Thanks for your remarks,
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darrylp

#3
Jul 25, 2013, 09:43 am Last Edit: Jul 25, 2013, 09:52 am by darryl Reason: 1
here is the top of printFloat checks i changed it to.

Code: [Select]

 // is really a number ?
 if (isnan(number)) return print("NaN");

 // Handle negative numbers - as in floating point -0.0 is valid !  idiots !
 if ( signbit(number) ) {
prn_cnt += print('-');
number = -number;
 }
 // is it too big ?
 if (isinf(number)) return print("Inf");



oops, can see i've changed name of var 'n' into 'prn_cnt'.  Surely something like  ' byte getExponent(float n); @ wont allow returing correctly if NaN or Inf ( or its -Inf ) counter part ?


as for the number of digitis on small numbers, i.e.  1E-15
then a similar check as for the large ones to switch into SCI notation format ?

Code: [Select]

 // print numbers bigger than maxfloat in E notation, if asking for default
 // small numbers will still be printed as 0.00 as the "number string" in E notation is longer for small numbers.
 // WARNING - may cause broken layout on LCD display if user was expecting a certain format
 if ( (Enotation == DEF) && (number > 1000000000.0) ) Enotation = SCI;

--
 Darryl

robtillaart

#4
Jul 25, 2013, 11:17 am Last Edit: Jul 25, 2013, 11:19 am by robtillaart Reason: 1
replaced the tests for NAN and INF with a single test if the binary exponent is 0xFF, as this indicates both

patch for print.cpp
Code: [Select]

// check IEEE754 bit layout to understand code below.
// note the exponent is a power of 2
byte getExponentByte(float number)
{
   uint8_t e = (*(((byte*) &number)+3) & 0x7F) << 1;
   if (*(((byte*) &number)+2) & 0x80) e++;
   return e;
}

size_t Print::printFloat(double number, uint8_t digits, uint8_t notation)
{
   size_t n = 0;
   int exponent = 0;

   byte e = getExponentByte(number);
   // Recognize NAN and INF in single test
   if (e == 0xFF)
   {
       if (isnan(number))
       {
           n += print("nan");
           return n;
       }
       // if not NAN => INF
       if (number < 0.0) n += write('-');
       n += print("inf");
       return n;
   }

   // handle negative values
   if (number < 0.0)
   {
       n += write('-');
       number = -number;
   }

....


the time dropped from 720 to 704   gain ~2%,  

I need to check the timer code in the test program, as the step size becomes relative too big.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

Some note about the timing.
The test sketch clocks 704 uS now (~25/char). If I do the test 100 times in a loop the time becomes about 2380 uSec (~85/char).
That is the time it takes to sent the char at 115200 baud, not the time to store it in the buffer.
As in the test the serial buffer is not filled up it mainly measures the time it takes to convert which is not blocked by a full buffer.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

Small test with #digits in the original Print.cpp

  f = 1.23456789E-15;
  Serial.println(f, 25);

outputs: 0.0000000000000012345678806    // 8 significant digits, the 9th digit is wrong (not much)

even possible to print 60 digits:  0.000000000000001234567880630493164062500000000000000000000000   

This quick test learns two things:
- keeping space for 10 significant digits is 1 digit too much as above example failed at the 9th digit (I want to be able to see the "error")
- limiting the decimal part to 10 (or less) digits breaks above functionality. Although I expect it is seldom used it is still a break in the interface.

back to drawing board?
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart


adjusted the test sketch for finer timing granularity, steps of 16 uSecs are too big
Code: [Select]

....
  TCCR1A = 0;
  TCCR1B = 2;   // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< changed prescaler
  TIMSK1 |= _BV(TOIE1);

  Serial.flush(); //wait for serial buffer to clear
  TIMSK0 = 0;     //disable millis;
  TCNT1 = 0;

  Serial.println(10737.41824, 4);
  Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);

  stop = TCNT1;     //how many clock cycles.
  TIMSK0 = backup;  //renable millis;
  stop /= 2;       // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< changed conversion to uSecs

  Serial.print("\nTime=");
  Serial.print(stop);
...


latest timing goes from 704 to 698 due to increased accuracy.   (24.93 /char)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

@Darryl
replaced the negative number test in Print.cpp with signbit() as you proposed
Code: [Select]

    // handle negative values
    // if (number < 0.0)
    if (signbit(number))
    {
        n += write('-');
        number = -number;
    }

Time=687

that's about 10 uS off (~3.5 uS per number)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#9
Jul 25, 2013, 10:24 pm Last Edit: Jul 25, 2013, 10:27 pm by robtillaart Reason: 1
Changed type of temp array for leading zeros, and the call to write() at the end of the decimal part.
allowed me to give the size param directly (=digits) so it skipped one loop over the array.
Code: [Select]

   // Print the decimal part
   if (digits > 0)
   {
       n += write('.');
       double remainder = number - int_part;

       // make an unsigned long of the decimal part
       uint32_t t = 10;     // at least one decimal
       for (uint8_t i=1; i < digits; i++) t = ((t<<2) + t)<<1;  // t *= 10  
       uint32_t rem = remainder * t;
       
       // handle preceding zero's !
       uint8_t z[10]="000000000";  // max 9 decimals
       uint8_t *str = &z[digits];
       *str-- = '\0';

       uint32_t d;
       uint8_t m;
       while (rem >= 10)
       {
           divmod10_asm(rem, m, d);
           *str-- = m + '0';
       }
       *str = rem + '0';   // last digit can directly be 'printed'
       n += write(z, digits);
   }

Time=673
per char incl .\r\n : 24.04

This might work in printNumber() too.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#10
Jul 25, 2013, 10:37 pm Last Edit: Jul 25, 2013, 10:50 pm by robtillaart Reason: 1
Yes, works for  Print::printNumber() too, only need one pointer to track the end of the string.

Code: [Select]
size_t Print::printNumber(unsigned long n, uint8_t base) {
   char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.
   char *str = &buf[sizeof(buf) - 1];
   char *estr = str;
   *str = '\0';

   uint8_t c;
   uint32_t d;

   // prevent crash if called with base <= 1
   if (base < 2) base = 10;

   switch(base)
   {
   case HEX:
       do {
           c = n & 0x0F;
           n >>= 4;
           *--str = c < 10 ? c + '0' : c + ('A' - 10);
       } while(n > 0);
       break;

   case DEC:
       do {
           divmod10_asm(n, c, d);
           *--str = c + '0';
       } while(n > 0);
       break;

   case OCT:
       // *--str = '0';   //leading 0 to indicate octal ?
       do {
           c = n & 0x07;
           n >>= 3;
           *--str = c + '0';
       } while(n > 0);
       break;

   case BIN:
       // *--str = 'b';  // leading b to indicate binary ?
       do {
           c = n & 0x01;
           n >>= 1;
           *--str = c + '0';
       } while(n > 0);
       break;

   default:
       do {
           d = n;
           n /= base;
           c = d - base * n;
           *--str = c < 10 ? c + '0' : c + 'A' - 10;
       } while(n);
       break;
   }
   return write((const uint8_t*)str, (size_t)(estr-str));
}


As printFloat() uses printNumber() for the integer part its timing is improved too :)

Time=665       
per char incl .\r\n : 23.75
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#11
Jul 25, 2013, 10:54 pm Last Edit: Jul 25, 2013, 11:21 pm by robtillaart Reason: 1
saw another small gain in printNumber() -- comments are the "original" code

patch
Code: [Select]

   case DEC:
       while (n >= 10)
       {
           divmod10_asm(n, c, d);
           *--str = c + '0';
       }
       *--str = n + '0';' // last digit is right by definition -> one call less to divmod10_asm()
       // do {
           // divmod10_asm(n, c, d);
           // *--str = c + '0';
       // } while(n > 0);
       break;

gain 2uSec per number printed.  (this step can be applied to other bases too)

Time=659
per char incl .\r\n : 23.54

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darrylp

#12
Jul 26, 2013, 02:16 pm Last Edit: Jul 26, 2013, 02:50 pm by darryl Reason: 1
nice,  its getting much quicker.

however the test line

 Serial.println(10737.41824, 4);

is actually outputting

 10737.4179

that's a fail, we've gone over 7 total digits....  I guess we need to combine the digits >0 to the remaining ( requested ) decimal places and cut them down, instead of showing trailing zero, i'd think a space in that place would be better.


So, we are hitting the 7 or so significant digits.  going on that line of thought.  i'm using a MAX_FLOAT_DIGITS set to 6
and instead of calculating the the multiplier, use a lookup,  its still quicker if you go for 8 digits max on  floats, but the space saving goes then.

so using the following in the section where we calculate the remainder

Code: [Select]

 const uint32_t remMult[FLOAT_MAX_DIGITS] = { 10, 100, 1000, 10000, 100000, 1000000 }; //, 10000000, 100000000 };

....

 uint32_t rem = remainder * remMult[digits - 1];

--
 Darryl

darrylp

oops, forgot to say, the times,   my system, doesn't have all the changes you've done

but the reduced max number of decimal digits, and the lookup give me 664 for the test, and 23.71 per char.

that's down from 696 before i'm fairly sure it was.
as most of my projects are quite code space hungry,  i'm keeping a close eye on the end program size. hence not taking all of your mods,  the trimming of max number of digits ever being printed etc.

So, Rob, do a quick try with a lookup rather than looping for the remainder multiplier calc. its a definite speed up, and at 6 values, a code saving as well. 8 digits its fractionally bigger.... but as i said earlier, we have already moved into wrong digits coming our after seven digits being printed ( unless the first digit being printed is a zero.
--
 Darryl

darrylp

#14
Jul 26, 2013, 03:20 pm Last Edit: Jul 26, 2013, 03:26 pm by darryl Reason: 1

that's a fail, we've gone over 7 total digits....  I guess we need to combine the digits >0 to the remaining ( requested ) decimal places and cut them down, instead of showing trailing zero, i'd think a space in that place would be better.


Code: [Select]

 // Extract the integer part of the number and print it
 uint32_t int_part = number;
 uint8_t tmp = prn_cnt;
 prn_cnt += print(int_part, DEC);
 uint8_t digits_available = 7 - ( prn_cnt - tmp );
 if ( digits > digits_available ) digits = digits_available;


down from my 664 to 641,  and 22.89 per char. ( this value isnt right as we no longer print 28 chars in the test file )
argh, as we are printing less, of course its going to be quicker.... but for streaming data to SD or out a serial port etc. then its a win surely ?


* of course i'm using prn_cnt  instead of 'n'

the 10737.41824  requested to print with 4 decimal places now print  10737.41,   as the rounding happened before we knew the integer part of the number, we couldn't ( does it seem worth the time to do the calc twice ?  to see the count of the significant digits we will use for the integer part ) before we do some rounding on the actual number.  hmm. not sure what to suggest as ideally we should see 10737.42 i think. or maybe the truncated value is better ?

ps. I think we need a fuller set of test figures, which will show real improvements for a better range of cases.
--
 Darryl

Go Up