Go Down

Topic: faster printing of floats by divmod10() and others (Read 48263 times) previous topic - next topic

robtillaart

start with the first one.
Quote

however the test line
  Serial.println(10737.41824, 4);
is actually outputting
  10737.4179
that's a fail, we've gone over 7 total digits....


Yep a known behaviour which I wouldn't call fail as we reach the borders of the precision of the (UNO) double. I do not know if the DUE has a real 64bit double.

Changing the meaning of the parameter ("N decimal places" to  "significant decimal places only with a max of N") would be a breaking interface change. I understand the advantages of your proposal but imho it is the responsibility of the programmer to define what to print.

Quote
I guess we need to combine the digits >0 to the remaining ( requested ) decimal places and cut them down, instead of showing trailing zero, i'd think a space in that place would be better.

Trailing zero's are/can be significant. Scientifically there is a big difference between 2 Celsius and 2.0 Celsius. The standard error is defined as max 0.5 the size of the last significant digit.

Code: [Select]
const uint32_t remMult[FLOAT_MAX_DIGITS] = { 1, 10, 100, 1000, 10000, 100000, 1000000 }; //, 10000000, 100000000 };
....
  uint32_t rem = remainder * remMult[digits];

I would add one value to remove the -1, one value takes 4 bytes a subtraction maybe more + a tiny bit faster?.

#define FLOAT_MAX_DIGITS  is a good one for readability.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#16
Jul 26, 2013, 03:53 pm Last Edit: Jul 26, 2013, 04:29 pm by robtillaart Reason: 1
Quote
So, Rob, do a quick try with a lookup rather than looping for the remainder multiplier calc. its a definite speed up, and at 6 values, a code saving as well. 8 digits its fractionally bigger.... but as i said earlier, we have already moved into wrong digits coming our after seven digits being printed ( unless the first digit being printed is a zero.

Have you read my "intermezzo" about the original printing of 1E-15 with 25 digits?  There are a lot of leading zero's, and yes it is a bit awkward maybe but still it was working code that is broken. Opinion?

I will give the lookup a try and let you know results asap.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#17
Jul 26, 2013, 04:26 pm Last Edit: Jul 26, 2013, 04:29 pm by robtillaart Reason: 1
with lookup table for the multiply factor I get:

Time=648
per char incl .\r\n : 23.14
done

added a lookup table for the rounding too, because it involves floating point multiplications it gives bigger gains
Code: (experimental) [Select]
     const float rounding[] = { 5E-1, 5E-2, 5E-3, 5E-4, 5E-5, 5E-6, 5E-7, 5E-8, 5E-9, 5E-10 };
   number += rounding[digits];


Time=602
per char incl .\r\n : 21.50
done

WRT the size of the lib, support for SCI and ENG should be conditional, just as the test for NAN/INF so I included:
Code: (experimental) [Select]

#ifdef PRINT_NAN_INF
---
#endif

#ifdef PRINT_SCIENTIFIC  
...
#endif


A test removing all SCI/ ENG code and the tests for NAN / INF gained another 22 uSec.

Time=580
per char incl .\r\n : 20.71
done

(the code size reduced quite a bit too, no figures as it is not clean code )
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

Code: [Select]

// Extract the integer part of the number and print it
  uint32_t int_part = number;
  uint8_t tmp = prn_cnt;
  prn_cnt += print(int_part, DEC);
  uint8_t digits_available = 7 - ( prn_cnt - tmp );              //  (prn_cnt - tmp) can be 8  causing an overflow which is caught in the next if
  if ( digits > digits_available ) digits = digits_available;

looks "80%" good, => it does not catch the case where the int part has more digits than significant.

In the end I think the solution to get significant digits right is the SCI notation. Used with max 6 decimals if the number is bigger than 999,999 OR smaller than 0.01
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darrylp


Code: [Select]

// Extract the integer part of the number and print it
  uint32_t int_part = number;
  uint8_t tmp = prn_cnt;
  prn_cnt += print(int_part, DEC);
  uint8_t digits_available = 7 - ( prn_cnt - tmp );              //  (prn_cnt - tmp) can be 8  causing an overflow which is caught in the next if
  if ( digits > digits_available ) digits = digits_available;

looks "80%" good, => it does not catch the case where the int part has more digits than significant.

In the end I think the solution to get significant digits right is the SCI notation. Used with max 6 decimals if the number is bigger than 999,999 OR smaller than 0.01



this is where I was going with the 7 significant digits total....  32 bit doubles, are what 6 bits for the exponent ( inc sign of the exponent ) thus we are down to 25 bits for all other actual real digits ( as again sign needed ) the 5 bit ( 2^5 == 32 ) works upto approx 37 / 38 as we can we dont have to hold the number in the remaining bits as SCI format.  thats lower third of 8 actual digits being stored.

i've built in a check when asking for more than max_float_digits, to limit to the max ( 6 ) and then if the number is less than 0.000001, then switch into SCI format for output, thus you can print 0.0000000000001234567 it comes out as 1.234567e-13.  sure maybe wrap in conitionals to get a default behaviour,  but the ongoing changes to speed up printing, are going to take lot of convincing to get merged into mainstream code.
just as well offer an accurate output, and auto switch into SCI ( ENG ) notation as the numbers get small. as the actual number stored in the float/double var is to that form, and will have lost information the moment its been stored in code ( memory / register )
--
 Darryl

darrylp



Code: [Select]
const uint32_t remMult[FLOAT_MAX_DIGITS] = { 1, 10, 100, 1000, 10000, 100000, 1000000 }; //, 10000000, 100000000 };
....
  uint32_t rem = remainder * remMult[digits];

I would add one value to remove the -1, one value takes 4 bytes a subtraction maybe more + a tiny bit faster?.


this one is a freebie from the compiler.....  you get the same execution time, but lose on storing an extra remMult value ( ie wasting 4 bytes )

the compiler will have digits in a register, and use a lookup pointer ( in reg r30 & r31, and then add on an offset to locate where they will be stored in memory. so in this case as digits wont be changing, it will adjust the offset to get the right value.

Code: [Select]

  if (digits > 0) {
842: 11 23        and r17, r17
844: 09 f4        brne .+2      ; 0x848 <_ZN5Print10printFloatEdh11printFormat+0x204>
846: 8c c0        rjmp .+280    ; 0x960 <__stack+0x61>
848: d3 01        movw r26, r6
84a: ed 91        ld r30, X+
84c: fc 91        ld r31, X
84e: 01 90        ld r0, Z+
850: f0 81        ld r31, Z
852: e0 2d        mov r30, r0
854: c3 01        movw r24, r6
856: 6e e2        ldi r22, 0x2E ; 46
858: 09 95        icall
85a: 9b 87        std Y+11, r25 ; 0x0b
85c: 8a 87        std Y+10, r24 ; 0x0a
      prn_cnt += print(buf);
      remainder -= toPrint;
    }
#else
    // make an unsigned long of the decimal part
    uint32_t rem = remainder * remMult[digits - 1];
85e: 21 2f        mov r18, r17
860: 30 e0        ldi r19, 0x00 ; 0
862: 39 87        std Y+9, r19 ; 0x09
864: 28 87        std Y+8, r18 ; 0x08
866: f9 01        movw r30, r18
868: ee 0f        add r30, r30
86a: ff 1f        adc r31, r31
86c: ee 0f        add r30, r30
86e: ff 1f        adc r31, r31
870: e1 5a        subi r30, 0xA1 ; 161
872: fe 4f        sbci r31, 0xFE ; 254
874: 60 81        ld r22, Z
876: 71 81        ldd r23, Z+1 ; 0x01
878: 82 81        ldd r24, Z+2 ; 0x02
87a: 93 81        ldd r25, Z+3 ; 0x03


is the code produced when using digits-1 pay attention to lines at address 870 & 872.

now with the extra value in the remMult array, and no fixed offset on the digits it gives us this code
Code: [Select]

if (digits > 0) {
842: 11 23        and r17, r17
844: 09 f4        brne .+2      ; 0x848 <_ZN5Print10printFloatEdh11printFormat+0x204>
846: 8c c0        rjmp .+280    ; 0x960 <__stack+0x61>
848: d3 01        movw r26, r6
84a: ed 91        ld r30, X+
84c: fc 91        ld r31, X
84e: 01 90        ld r0, Z+
850: f0 81        ld r31, Z
852: e0 2d        mov r30, r0
854: c3 01        movw r24, r6
856: 6e e2        ldi r22, 0x2E ; 46
858: 09 95        icall
85a: 9b 87        std Y+11, r25 ; 0x0b
85c: 8a 87        std Y+10, r24 ; 0x0a
      prn_cnt += print(buf);
      remainder -= toPrint;
    }
#else
    // make an unsigned long of the decimal part
    uint32_t rem = remainder * remMult[digits];
85e: 21 2f        mov r18, r17
860: 30 e0        ldi r19, 0x00 ; 0
862: 39 87        std Y+9, r19 ; 0x09
864: 28 87        std Y+8, r18 ; 0x08
866: f9 01        movw r30, r18
868: ee 0f        add r30, r30
86a: ff 1f        adc r31, r31
86c: ee 0f        add r30, r30
86e: ff 1f        adc r31, r31
870: ed 59        subi r30, 0x9D ; 157
872: fe 4f        sbci r31, 0xFE ; 254
874: 60 81        ld r22, Z
876: 71 81        ldd r23, Z+1 ; 0x01
878: 82 81        ldd r24, Z+2 ; 0x02
87a: 93 81        ldd r25, Z+3 ; 0x03


so I'd say save the space, its also less vars to copy from .data at startup ;-)
--
 Darryl

robtillaart

#21
Jul 27, 2013, 08:10 pm Last Edit: Jul 27, 2013, 09:49 pm by robtillaart Reason: 1
Quote
but the ongoing changes to speed up printing, are going to take lot of convincing to get merged into mainstream code.

That is not my primary goal but it would be great if some parts become mainstream code. More important is to have working optimized code available for those who need it.

At the moment the improvements are becoming rather small. A test with printing similar output with integers gives the upper limit of the code.

with integers
Code: (printing integers) [Select]
10737.4182
1.9181
107.37

Time=456
per char incl .\r\n : 16.29
done


with floats
Code: (printing floats) [Select]

10737.4179
1.0182
107.37

Time=576
per char incl .\r\n : 20.57
done

a difference of 4.3 uSecond per char on average (or relative speaking 20%)
Note: the rounding code (lookuptable) takes ~35uSec for 3 numbers, that counts for ~1.25 uSec per char on average.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#22
Jul 27, 2013, 08:29 pm Last Edit: Jul 27, 2013, 09:52 pm by robtillaart Reason: 1
Quote
i've built in a check when asking for more than max_float_digits, to limit to the max ( 6 ) and then if the number is less than 0.000001, then switch into SCI format for output, thus you can print 0.0000000000001234567 it comes out as 1.234567e-13.  sure maybe wrap in conitionals to get a default behaviour,  but the ongoing changes to speed up printing, ...

just as well offer an accurate output, and auto switch into SCI ( ENG ) notation as the numbers get small. as the actual number stored in the float/double var is to that form, and will have lost information the moment its been stored in code ( memory / register )


Thinking in terms of requirements (esp backwards compatibility even if wrong) I come to the following:
- if the notation is default, the old behaviour should work, the #digits is the #decimals even if 42, it may be implemented faster of course
- if the notation is SCI or ENG the amount of decimals should be maximized to the significant portion (max 6 decimals) => I fully agree with you on that point.
 ( the ENG notation makes it more complex, should we get rid of it? )
- the programmer must explicitly choose for the SCI (ENG) notation, (you get what you ask for),

In short:  the new interpretation of #digits is coupled to the SCI (ENG) notation which is explicitly asked for by the programmer.

What do you think of this :
Code: [Select]

size_t Print::printFloat(double number, uint8_t digits, uint8_t notation)
{
 ... // some common tests
 if (notation != DEF)
 {
   return printFloatSCI(...);
 }
 ... // default code
}


updated (max 6 decimals)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

Quote
so I'd say save the space, its also less vars to copy from .data at startup ;-)

agree!
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#24
Jul 27, 2013, 08:57 pm Last Edit: Jul 27, 2013, 09:54 pm by robtillaart Reason: 1
Found another minor speedup

replaced the long lookup table with equivalent floats so no conversion is needed. The footprint is equal, both 4 bytes per element.
(yeah with hindsight this one was obvious ;)
Code: (experimental) [Select]

       const float remMult[] = { 1E1, 1E2, 1E3, 1E4, 1E5, 1E6, 1E7, 1E8, 1E9 };
       uint32_t rem = remainder * remMult[digits-1];


Code: (timing) [Select]

Time=563
per char incl .\r\n : 20.11
done

again 13 uSecs of, bringing the diff compared to integers to 3.82 uSec / char  

(note this is without SCI support and without NAN testing; with both enabled time = 600 or 21.43 per char)

update: SCI should max support 6 decimals so the remMult[] table can be smaller
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

small test - The forced to SCI mode costs 21uSec for 3 numbers ( mainly the 2 float comparisons)
Code: [Select]

    // if (notation == DEF)
    // {
        // if (number >= 1E9) notation = SCI;
        // else if (number <= 1E-4) notation = SCI;
    // }
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#26
Jul 28, 2013, 12:51 am Last Edit: Jul 28, 2013, 01:49 am by robtillaart Reason: 1
2 optimizations
  • "Inlining" the printNumber(int_part, DEC); good for 14 uSec
  • restrict size of the lookuptables;

    Code: [Select]

    10737.4179
    1.0182
    107.37

    Time=543
    per char incl .\r\n : 19.39
    done

    next action clean up code ...
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

#27
Jul 28, 2013, 02:07 am Last Edit: Jul 28, 2013, 02:09 am by robtillaart Reason: 1
attached the code sofar, cleaned up a bit and merged different strategies.
  • If the user selects SCI/ENG mode the # decimal digits is max 6
  • If the user selects #digits > 6 it will execute slower code paths  (some optimizations still possible but not all) -- only possible in the DEF mode
  • An open issue of automatic switching to SCI mode if number is too large or too small. - I propose an new mode for that as its behaviour is new.
  • Some code "sleeps in the comments"
  • Conditional code for NAN/INF testing and for SCI/ENG support by means of #define - print.h
  • An open issue in printNumber() to prepend O for octal numbers, and b for Binary and 0x for Hex numbers?

    I finally pushed another 1 uSec out of it and ended at

    Time=542
    per char incl .\r\n : 19.36
    done

    Next step is recover it to maintainable code and do more testing (platform, number ranges etc)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart


Did a test with the adjusted printFloat() on a Teensy 2.0 - had to patch the Teensy print.h and print.cpp as these are a bit different.

Test 1 : Teensy reference without optimization
Code: [Select]
10737.4179
1.0182
107.37

Time=1260
per char incl .\r\n : 45.00
done


Test 2 : Teensy optimized
Code: [Select]

10737.4179
1.0182
107.37

Time=369
per char incl .\r\n : 13.18
done


That's an increase of 31.82 uSec or about 70%

The fact that Teensy is faster that the UNO (369 vs 542 --30%) is due to more efficient implementation of write()


Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

(Back to the UNO again)

I changed one implementation of the print class to
Code: (new) [Select]
size_t Print::write(const char *str)
{
    char *p = (char *)str;
    size_t n = 0;
    while (*p != 0) {
        n += write(*p++);
    }
    return n;
}


instead of (print.h)
Code: (original) [Select]

     size_t write(const char *str) {
       if (str == NULL) return 0;
       return write((const uint8_t *)str, strlen(str));
     }

Adjusted two calls from n = write(z, digits); to   n += write((const char*)z);  (one for the int_part and the remainder part)

and the new figures are
Code: [Select]
10737.4179
1.0182
107.37

Time=536
per char incl .\r\n : 19.14
done


It surprised me to find another 6 uSeconds;  that is about one uSec per integer.
The old implementation has 3x as much admin - (once counting the strlen, and a var and pointer in write()) - while the new just increments one pointer.

This means that printNumber() can be improved by this one too. => TODO list.

(looks like it never stops ;)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Go Up