faster printing of floats by divmod10() and others

robtillaart · July 24, 2013, 12:10pm

This is a continuation of my experiments of improving the performance of printing floats in the Print class by means of the fast divmod10() algorithm and some additional improvements. This discussion started in the divmod10() thread but this topic is broader that's why I started a new thread.

Some of printFloat() related posts: (timings refer to the test script)

divmod10() : a fast replacement for /10 and %10 (unsigned) - #58 by robtillaart - Libraries - Arduino Forum - initial test (2144 uSec for reference Print.cpp, 1472 uSec for improved version)
divmod10() : a fast replacement for /10 and %10 (unsigned) - #77 by robtillaart - Libraries - Arduino Forum - improved performance of remainder (1136 uSec)
divmod10() : a fast replacement for /10 and %10 (unsigned) - #100 by robtillaart - Libraries - Arduino Forum - new remainder code, (1104 uSec, followed by a faster '.' => 1072uSec)
divmod10() : a fast replacement for /10 and %10 (unsigned) - #104 by robtillaart - Libraries - Arduino Forum - used Stimmer's ASM divmod10_asm (1008 uSec)

I have been adapting the printFloat() function for some time now, not only speeding up performance but also include support for Engineering and Scientific notation. This latter is a breaking change in the interface which is discussed here - Proposed update for the printFloat code of print.cpp - Libraries - Arduino Forum

The current version of printFloat takes 720 uSec for the test, almost 3x as fast as the original print.cpp [IDE 1.0.4]
This is mainly due to new remainder + rounding code.

test-code (definitely not complete) with a small check to see if 0..6 digits works OK.

unsigned long start = 0;
unsigned long stop = 0;
volatile unsigned long q;

void setup()
{
  Serial.begin(115200);
  Serial.println("testing");

  Serial.println(1.5, 6);
  Serial.println(1.49999, 6);
  Serial.println(1.9999, 6);
  Serial.println(1.5, 5);
  Serial.println(1.49999, 5);
  Serial.println(1.9999, 5);
  Serial.println(1.5, 4);
  Serial.println(1.49999, 4);
  Serial.println(1.9999, 4);
  Serial.println(1.5, 3);
  Serial.println(1.49999, 3);
  Serial.println(1.9999, 3);
  Serial.println(1.5, 2);
  Serial.println(1.49999, 2);
  Serial.println(1.9999, 2);
  Serial.println(1.5, 1);
  Serial.println(1.49999, 1);
  Serial.println(1.9999, 1);
  Serial.println(1.5, 0);
  Serial.println(1.49999, 0);
  Serial.println(1.9999, 0);
  Serial.println();

  byte backup = TIMSK0;

  TCCR1A = 0;
  TCCR1B = 4; 
  TIMSK1 |= _BV(TOIE1);

  Serial.flush(); //wait for serial buffer to clear
  TIMSK0 = 0;     //disable millis;
  TCNT1 = 0;

  Serial.println(10737.41824, 4);
  Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);
  // test odd #digits
  //  Serial.println(10737.41824, 5);
  //  Serial.println(1.01819584, 3);
  //  Serial.println(107.37, 1);

  // code to check the overhead float vs long ~~ same amount of digits.
  //  Serial.println(10737);
  //  Serial.println(1);
  //  Serial.println(107);
  //  Serial.println(4182);
  //  Serial.println(9181);  // 0181 would be octal :)
  //  Serial.println(37);

  stop = TCNT1;     //how many clock cycles.
  TIMSK0 = backup;  //renable millis;
  stop *= 16;       //There are 16us per clock cycle with a 1:256 prescaler.

  Serial.print("\nTime=");
  Serial.println(stop);
  Serial.print("per char incl .\\r\\n : ");
  Serial.println(stop/28.0);    // depends on test run to be right
  Serial.println("done");

  TIMSK1 &= ~_BV(TOIE1);
}

void loop()
{
}

ISR(TIMER1_OVF_vect) {
  // reentrancy ...
  // Serial.println("I Overflowed!"); //just to make sure we can tell if this happens.
}

The output:

testing
...

10737.4179
1.0182
107.37

Time=720     <<<<<<<<<<<<<<<<<<<<<<<<< came from 2144 -> factor 3
per char incl .\r\n : 25.71
done

My current print.h and print.cpp file are attached (warning contains experimental code)

The last improvements are

new strategy (again) for remainder code // original + prev version in comments, (several intermediate version rejected
optimized the rounding loop // yes, a lookup table for rounding is faster, ==> ~40 additional bytes

as always, improvements and remarks are welcome

Print.h (2.88 KB)

Print.cpp (14.7 KB)

darrylp · July 24, 2013, 7:37pm

do floats / doubles on the AVR actually have as much as 10 digits of precision ? if only accurate down to say 8, then its another 20% saving building the part after the decimal point.

I too have looked and thought about trying to speed up float printing.

i'd say the divmod10_asm could be made a routine to save inlining it twice, and then accept the call overhead from printNumber using it as well.

EDIT:
also replacing the test for <0 at the at the start to print the leading minu sign ,is quicker if swapped for if ( signbit(number) ) this allows for -0.0 which is legal for floating point numbers, and can code reduce the infinite test to come after this check as well.

robtillaart · July 24, 2013, 9:02pm

Thanks Darryl,

a reaction per item:

IEE754 has only 7 significant digits so you have a point there (in practice). However if a number is really small e.g. 3E-15 then even 20 digits would work (in theory) with the original code . With the SCIentific notation the number would be normalized to the significant digits only and then 7 or 8 is enough.
I do not know if the DUE implements a 64 bit double? Code should work there too, (so it should be 16 digits max?)
I'll check if supporting max 7 digits gives some gain. => I expect rounding code can be optimized with 3 bit masks instead of a loop
I'll add making the inline a routine to my TODO => I expect it will save only memory
The double minus check was already on my TODO, did not consider the signbit() function. I have written some sign manipulation code that directly affects the internal representation, it is fast but obscure and will therefore not use it. I expect signbit() does something similar.

I am also looking into a test that can detect NAN and INF in one step as both values have 0xFF as exponent. That would separate the printable numbers from NAN/INF in one compare instead of two tests. byte getExponent(float n); Removing these "special value tests" conditionally would definitely give a performance gain.

Finally I think there can be some gain in creating the factor t that is used to make an uint32_t of the decimal part. This loop has the same # iterations as the rounding loop, which I optimized by taking 2 digits per iteration (reduced the FP multiplies). Also for smaller decimal parts an uint16_t might work too?

In short there are still some unexplored corners.

Thanks for your remarks,

darrylp · July 25, 2013, 7:43am

here is the top of printFloat checks i changed it to.

  // is really a number ?
  if (isnan(number)) return print("NaN");

  // Handle negative numbers - as in floating point -0.0 is valid !  idiots !
  if ( signbit(number) ) {
	prn_cnt += print('-');
	number = -number;
  }
  // is it too big ?
  if (isinf(number)) return print("Inf");

oops, can see i've changed name of var 'n' into 'prn_cnt'. Surely something like ' byte getExponent(float n); @ wont allow returing correctly if NaN or Inf ( or its -Inf ) counter part ?

as for the number of digitis on small numbers, i.e. 1E-15
then a similar check as for the large ones to switch into SCI notation format ?

  // print numbers bigger than maxfloat in E notation, if asking for default
  // small numbers will still be printed as 0.00 as the "number string" in E notation is longer for small numbers.
  // WARNING - may cause broken layout on LCD display if user was expecting a certain format
  if ( (Enotation == DEF) && (number > 1000000000.0) ) Enotation = SCI;

robtillaart · July 25, 2013, 9:17am

replaced the tests for NAN and INF with a single test if the binary exponent is 0xFF, as this indicates both

patch for print.cpp

// check IEEE754 bit layout to understand code below.
// note the exponent is a power of 2
byte getExponentByte(float number)
{
    uint8_t e = (*(((byte*) &number)+3) & 0x7F) << 1;
    if (*(((byte*) &number)+2) & 0x80) e++;
    return e;
}

size_t Print::printFloat(double number, uint8_t digits, uint8_t notation) 
{ 
    size_t n = 0;
    int exponent = 0;

    byte e = getExponentByte(number);
    // Recognize NAN and INF in single test 
    if (e == 0xFF) 
    {
        if (isnan(number))
        {
            n += print("nan");
            return n;
        }
        // if not NAN => INF
        if (number < 0.0) n += write('-');
        n += print("inf");
        return n;
    }

    // handle negative values
    if (number < 0.0)
    {
        n += write('-');
        number = -number;
    }

....

the time dropped from 720 to 704 gain ~2%,

I need to check the timer code in the test program, as the step size becomes relative too big.

robtillaart · July 25, 2013, 10:29am

Some note about the timing.
The test sketch clocks 704 uS now (~25/char). If I do the test 100 times in a loop the time becomes about 2380 uSec (~85/char).
That is the time it takes to sent the char at 115200 baud, not the time to store it in the buffer.
As in the test the serial buffer is not filled up it mainly measures the time it takes to convert which is not blocked by a full buffer.

robtillaart · July 25, 2013, 11:02am

Small test with #digits in the original Print.cpp

f = 1.23456789E-15;
Serial.println(f, 25);

outputs: 0.0000000000000012345678806 // 8 significant digits, the 9th digit is wrong (not much)

even possible to print 60 digits: 0.000000000000001234567880630493164062500000000000000000000000

This quick test learns two things:

keeping space for 10 significant digits is 1 digit too much as above example failed at the 9th digit (I want to be able to see the "error")
limiting the decimal part to 10 (or less) digits breaks above functionality. Although I expect it is seldom used it is still a break in the interface.

back to drawing board?

robtillaart · July 25, 2013, 11:11am

adjusted the test sketch for finer timing granularity, steps of 16 uSecs are too big

....
  TCCR1A = 0;
  TCCR1B = 2;   // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< changed prescaler
  TIMSK1 |= _BV(TOIE1);

  Serial.flush(); //wait for serial buffer to clear
  TIMSK0 = 0;     //disable millis;
  TCNT1 = 0;

  Serial.println(10737.41824, 4);
  Serial.println(1.01819584, 4);
  Serial.println(107.37, 2);

  stop = TCNT1;     //how many clock cycles.
  TIMSK0 = backup;  //renable millis;
  stop /= 2;       // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< changed conversion to uSecs

  Serial.print("\nTime=");
  Serial.print(stop);
...

latest timing goes from 704 to 698 due to increased accuracy. (24.93 /char)

robtillaart · July 25, 2013, 7:44pm

@Darryl
replaced the negative number test in Print.cpp with signbit() as you proposed

    // handle negative values
    // if (number < 0.0)
    if (signbit(number))
    {
        n += write('-');
        number = -number;
    }

Time=687

that's about 10 uS off (~3.5 uS per number)

robtillaart · July 25, 2013, 8:24pm

Changed type of temp array for leading zeros, and the call to write() at the end of the decimal part.
allowed me to give the size param directly (=digits) so it skipped one loop over the array.

    // Print the decimal part
    if (digits > 0) 
    {
        n += write('.');
        double remainder = number - int_part;

        // make an unsigned long of the decimal part
        uint32_t t = 10;     // at least one decimal
        for (uint8_t i=1; i < digits; i++) t = ((t<<2) + t)<<1;  // t *= 10  
        uint32_t rem = remainder * t;
        
        // handle preceding zero's !
        uint8_t z[10]="000000000";  // max 9 decimals
        uint8_t *str = &z[digits];
        *str-- = '\0'; 

        uint32_t d;
        uint8_t m;
        while (rem >= 10)
        {
            divmod10_asm(rem, m, d);
            *str-- = m + '0';
        }
        *str = rem + '0';   // last digit can directly be 'printed'
        n += write(z, digits);
    }

Time=673
per char incl .\r\n : 24.04

This might work in printNumber() too.

robtillaart · July 25, 2013, 8:37pm

Yes, works for Print::printNumber() too, only need one pointer to track the end of the string.

size_t Print::printNumber(unsigned long n, uint8_t base) {
    char buf[8 * sizeof(long) + 1]; // Assumes 8-bit chars plus zero byte.
    char *str = &buf[sizeof(buf) - 1];
    char *estr = str;
    *str = '\0';

    uint8_t c;
    uint32_t d;

    // prevent crash if called with base <= 1
    if (base < 2) base = 10;

    switch(base)
    {
    case HEX:
        do {
            c = n & 0x0F;
            n >>= 4;
            *--str = c < 10 ? c + '0' : c + ('A' - 10);
        } while(n > 0);
        break;

    case DEC:
        do {
            divmod10_asm(n, c, d);
            *--str = c + '0';
        } while(n > 0);
        break;

    case OCT:
        // *--str = '0';   //leading 0 to indicate octal ?
        do {
            c = n & 0x07;
            n >>= 3;
            *--str = c + '0';
        } while(n > 0);
        break;

    case BIN:
        // *--str = 'b';  // leading b to indicate binary ?
        do {
            c = n & 0x01;
            n >>= 1;
            *--str = c + '0';
        } while(n > 0);
        break;

    default:
        do {
            d = n;
            n /= base;
            c = d - base * n;
            *--str = c < 10 ? c + '0' : c + 'A' - 10;
        } while(n);
        break;
    }
    return write((const uint8_t*)str, (size_t)(estr-str));
}

As printFloat() uses printNumber() for the integer part its timing is improved too

Time=665
per char incl .\r\n : 23.75

robtillaart · July 25, 2013, 8:54pm

saw another small gain in printNumber() -- comments are the "original" code

patch

    case DEC:
        while (n >= 10)
        {
            divmod10_asm(n, c, d);
            *--str = c + '0';
        }
        *--str = n + '0';' // last digit is right by definition -> one call less to divmod10_asm()
        // do {
            // divmod10_asm(n, c, d);
            // *--str = c + '0';
        // } while(n > 0);
        break;

gain 2uSec per number printed. (this step can be applied to other bases too)

Time=659
per char incl .\r\n : 23.54

darrylp · July 26, 2013, 12:16pm

nice, its getting much quicker.

however the test line

Serial.println(10737.41824, 4);

is actually outputting

10737.4179

that's a fail, we've gone over 7 total digits.... I guess we need to combine the digits >0 to the remaining ( requested ) decimal places and cut them down, instead of showing trailing zero, i'd think a space in that place would be better.

So, we are hitting the 7 or so significant digits. going on that line of thought. i'm using a MAX_FLOAT_DIGITS set to 6
and instead of calculating the the multiplier, use a lookup, its still quicker if you go for 8 digits max on floats, but the space saving goes then.

so using the following in the section where we calculate the remainder

  const uint32_t remMult[FLOAT_MAX_DIGITS] = { 10, 100, 1000, 10000, 100000, 1000000 }; //, 10000000, 100000000 };

....

  uint32_t rem = remainder * remMult[digits - 1];

darrylp · July 26, 2013, 12:48pm

oops, forgot to say, the times, my system, doesn't have all the changes you've done

but the reduced max number of decimal digits, and the lookup give me 664 for the test, and 23.71 per char.

that's down from 696 before i'm fairly sure it was.
as most of my projects are quite code space hungry, i'm keeping a close eye on the end program size. hence not taking all of your mods, the trimming of max number of digits ever being printed etc.

So, Rob, do a quick try with a lookup rather than looping for the remainder multiplier calc. its a definite speed up, and at 6 values, a code saving as well. 8 digits its fractionally bigger.... but as i said earlier, we have already moved into wrong digits coming our after seven digits being printed ( unless the first digit being printed is a zero.

darrylp · July 26, 2013, 1:20pm

darryl:
that's a fail, we've gone over 7 total digits.... I guess we need to combine the digits >0 to the remaining ( requested ) decimal places and cut them down, instead of showing trailing zero, i'd think a space in that place would be better.

  // Extract the integer part of the number and print it
  uint32_t int_part = number;
  uint8_t tmp = prn_cnt;
  prn_cnt += print(int_part, DEC);
  uint8_t digits_available = 7 - ( prn_cnt - tmp );
  if ( digits > digits_available ) digits = digits_available;

down from my 664 to 641, and 22.89 per char. ( this value isnt right as we no longer print 28 chars in the test file )
argh, as we are printing less, of course its going to be quicker.... but for streaming data to SD or out a serial port etc. then its a win surely ?

of course i'm using prn_cnt instead of 'n'

the 10737.41824 requested to print with 4 decimal places now print 10737.41, as the rounding happened before we knew the integer part of the number, we couldn't ( does it seem worth the time to do the calc twice ? to see the count of the significant digits we will use for the integer part ) before we do some rounding on the actual number. hmm. not sure what to suggest as ideally we should see 10737.42 i think. or maybe the truncated value is better ?

ps. I think we need a fuller set of test figures, which will show real improvements for a better range of cases.

robtillaart · July 26, 2013, 1:48pm

start with the first one.

however the test line
Serial.println(10737.41824, 4);
is actually outputting
10737.4179
that's a fail, we've gone over 7 total digits....

Yep a known behaviour which I wouldn't call fail as we reach the borders of the precision of the (UNO) double. I do not know if the DUE has a real 64bit double.

Changing the meaning of the parameter ("N decimal places" to "significant decimal places only with a max of N") would be a breaking interface change. I understand the advantages of your proposal but imho it is the responsibility of the programmer to define what to print.

I guess we need to combine the digits >0 to the remaining ( requested ) decimal places and cut them down, instead of showing trailing zero, i'd think a space in that place would be better.

Trailing zero's are/can be significant. Scientifically there is a big difference between 2 Celsius and 2.0 Celsius. The standard error is defined as max 0.5 the size of the last significant digit.

const uint32_t remMult[FLOAT_MAX_DIGITS] = { 1, 10, 100, 1000, 10000, 100000, 1000000 }; //, 10000000, 100000000 };
....
  uint32_t rem = remainder * remMult[digits];

I would add one value to remove the -1, one value takes 4 bytes a subtraction maybe more + a tiny bit faster?.

#define FLOAT_MAX_DIGITS is a good one for readability.

robtillaart · July 26, 2013, 1:53pm

So, Rob, do a quick try with a lookup rather than looping for the remainder multiplier calc. its a definite speed up, and at 6 values, a code saving as well. 8 digits its fractionally bigger.... but as i said earlier, we have already moved into wrong digits coming our after seven digits being printed ( unless the first digit being printed is a zero.

Have you read my "intermezzo" about the original printing of 1E-15 with 25 digits? There are a lot of leading zero's, and yes it is a bit awkward maybe but still it was working code that is broken. Opinion?

I will give the lookup a try and let you know results asap.

robtillaart · July 26, 2013, 2:26pm

with lookup table for the multiply factor I get:

Time=648
per char incl .\r\n : 23.14
done

added a lookup table for the rounding too, because it involves floating point multiplications it gives bigger gains

     const float rounding[] = { 5E-1, 5E-2, 5E-3, 5E-4, 5E-5, 5E-6, 5E-7, 5E-8, 5E-9, 5E-10 };
    number += rounding[digits];

Time=602
per char incl .\r\n : 21.50
done

WRT the size of the lib, support for SCI and ENG should be conditional, just as the test for NAN/INF so I included:

#ifdef PRINT_NAN_INF
---
#endif

#ifdef PRINT_SCIENTIFIC  
...
#endif

A test removing all SCI/ ENG code and the tests for NAN / INF gained another 22 uSec.

Time=580
per char incl .\r\n : 20.71
done

(the code size reduced quite a bit too, no figures as it is not clean code )

robtillaart · July 26, 2013, 2:38pm

 // Extract the integer part of the number and print it
  uint32_t int_part = number;
  uint8_t tmp = prn_cnt;
  prn_cnt += print(int_part, DEC);
  uint8_t digits_available = 7 - ( prn_cnt - tmp );              //  (prn_cnt - tmp) can be 8  causing an overflow which is caught in the next if
  if ( digits > digits_available ) digits = digits_available;

looks "80%" good, => it does not catch the case where the int part has more digits than significant.

In the end I think the solution to get significant digits right is the SCI notation. Used with max 6 decimals if the number is bigger than 999,999 OR smaller than 0.01

darrylp · July 26, 2013, 3:32pm

robtillaart:

 // Extract the integer part of the number and print it

uint32_t int_part = number;
uint8_t tmp = prn_cnt;
prn_cnt += print(int_part, DEC);
uint8_t digits_available = 7 - ( prn_cnt - tmp ); // (prn_cnt - tmp) can be 8 causing an overflow which is caught in the next if
if ( digits > digits_available ) digits = digits_available;



looks "80%" good, => it does not catch the case where the int part has more digits than significant. 

In the end I think the solution to get significant digits right is the SCI notation. Used with max 6 decimals if the number is bigger than 999,999 OR smaller than 0.01

this is where I was going with the 7 significant digits total.... 32 bit doubles, are what 6 bits for the exponent ( inc sign of the exponent ) thus we are down to 25 bits for all other actual real digits ( as again sign needed ) the 5 bit ( 2^5 == 32 ) works upto approx 37 / 38 as we can we dont have to hold the number in the remaining bits as SCI format. thats lower third of 8 actual digits being stored.

i've built in a check when asking for more than max_float_digits, to limit to the max ( 6 ) and then if the number is less than 0.000001, then switch into SCI format for output, thus you can print 0.0000000000001234567 it comes out as 1.234567e-13. sure maybe wrap in conitionals to get a default behaviour, but the ongoing changes to speed up printing, are going to take lot of convincing to get merged into mainstream code.
just as well offer an accurate output, and auto switch into SCI ( ENG ) notation as the numbers get small. as the actual number stored in the float/double var is to that form, and will have lost information the moment its been stored in code ( memory / register )

Topic		Replies	Views
divmod10() : a fast replacement for /10 and %10 (unsigned) Libraries	114	81770	May 6, 2021
Performance enhancements to Print.cpp Libraries	1	724	May 6, 2021
Variable precision printing of floats Bugs & Suggestions	6	1778	May 6, 2021
Proposed update for the printFloat code of print.cpp Libraries	22	13515	May 6, 2021
divu5() and divu3(). Fast replacements for unsigned division by 3, 5, 6, 10, 15 Libraries	17	16136	May 6, 2021

faster printing of floats by divmod10() and others

Related Topics