Go Down

Topic: faster printing of floats by divmod10() and others (Read 29438 times) previous topic - next topic

darryl

#30
Jul 29, 2013, 05:04 pm Last Edit: Jul 29, 2013, 05:49 pm by darryl Reason: 1
I've gone now for code size optimising, couple of idears

Code: [Select]

 if ( notation == DEF ) {
   if ( (number > 8388608.0) || (number < 0.000001) ) {
     // as we head into SCI / ENG world add digits for output
     notation = SCI;
     digits += 6;
   }
 }

I've done this as this is the value, that we have to move to E notation, to only print accurate numbers, ie 8388609 actually prints as 8388610 :-(
this still has it listed wrong, but printed as E form allows the user to 'have a guess' that number might not be exactly as shown.

i've enum'd the DEF, SCI and ENG values, and by passing these we get a free compiler sense check on values, not perfect, but useful IMO. ( saves having to take a local copy in Enotation )

I've also changed printNumber, to take a third argument ( the number of digits we must print )  a simple
#define  NO_LEADING_ZERO 0  allows sensible looking code for existing use of the function. ( the compiler for your example program only twice has to reload r16 ( with the extra arg being passed, one time extra for when we use it to get leading zero's )

Code: [Select]

 // Extract the integer part of the number and print it watching how many digits we have
 uint32_t int_part = number;
 uint8_t tmp = prn_cnt;
 prn_cnt += printNumber(int_part, DEC, NO_LEADING_ZERO);

 // see if we are going to be printing too many digits, we can save time doing the decimal half.
 uint8_t digits_available = 7 - ( prn_cnt - tmp );
 if ( digits > digits_available ) digits = digits_available;

 if (digits > 0) {
   prn_cnt += write('.');

   double remainder = number - int_part;

   // make an unsigned long of the decimal part - of a certain length, ie leading zero's !
   uint32_t rem = remainder * remMult[digits - 1];
   prn_cnt += printNumber(rem, DEC, digits);
 }

thie above is the partial code that handles printing of the float now.  and the faster divmod10_asm is handled only once in the printNumber routine.

Code: [Select]

size_t Print::printNumber(uint32_t num, uint8_t base, uint8_t leading_zeros) {
 char buf[33];
 char *str = &buf[sizeof(buf) - 1];
 *str = '\0';

 uint8_t mod, tmp;
 int8_t extra_digits = leading_zeros;

 do {
#ifdef USE_STIMMER_OPTIMIZATION
   if ( base == DEC )
   {
     divmod10_asm32(num, mod, tmp);
     *--str = mod + '0';
     extra_digits -= 1;
   }
   else
#endif
   {
     *--str = '0' + num % base;
     if ( *str > '9' ) *str += 7;
     num /= base;
     extra_digits -= 1;
   }
 } while (num);

 for ( ; extra_digits > 0; extra_digits-- ) *--str = '0';
 return write(str);
}


i'm happy with not having optimised versions of print for HEX, OCT and BINary. and this shrinks down the code added nicely.

so slightly slower than the version you've posted, but a code shrink on it.  numbers printed before the decimal point, are always right, and rounding is cut off after a total of seven digits being printed ( ignoring sign exponent etc )

note, previously i've change a few vars from int down to uint8_t, to allow the compiler to utilise a single register. base being an example.

also, you need to use smaller vars, where possible... ie using int instead of (u)int8_t means lots of extra code checking and using ( ie expoent++ )
thats another example exponent++ and exponent--  and quite often on gcc 4.3.2 on (u)int8_t vars will extend to 16 bits, and thus waste time / cpu cycles.  expoenent -= 1; is faster and smaller cos the compiler produces code for a single register than a pair.



oops a few edits for spelling, and clearer reading text, and here are my times on a UNO

10737.41
1.0182
107.37

Time=448
per char incl .\r\n : 17.23   <----- corrected in the code for only printing 26 not 28 chars.
done



--
 Darryl

robtillaart

#31
Jul 29, 2013, 07:40 pm Last Edit: Jul 29, 2013, 07:46 pm by robtillaart Reason: 1
@Darryl,
good points you make.( +1)

Code: [Select]

if ( notation == DEF ) {
   if ( (number > 8388608.0) || (number < 0.000001) ) {
     // as we head into SCI / ENG world add digits for output
     notation = SCI;
     digits += 6;
   }
 }

1) For this behaviour I would like to add a new format e.g. DEFSCI so that the DEF behavior is 100% backwards compatible.
2) the lower limit should be higher 0.01 as printing with 2 decimals is I think most used.


Can you post your print.h/.cpp so I can give it a try here?  

(My time=536 yours time=448, want to understand why yours is so much faster?)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

robtillaart

My last optimization are in the low level write() functions esp write(const char *str) . This may interfere with overridden implementations of them. [block device vs char device]

some other optimizations.

14 bytes smaller (not measurable in speed)
Code: [Select]
size_t Print::println(void)
{
    write('\r');
    write('\n');
    return 2;
    // size_t n = print('\r');
    // n += print('\n');
    // return n;
}





8 bytes smaller (3 uS faster)
Code: [Select]

size_t Print::write(const uint8_t *buffer, size_t size)
{
    size_t n = 0;
    for (; n < size; n++) {
        write(*buffer++);
    }
    return n;
// size_t n = 0;
    // while (size--) {
        // n += write(*buffer++);
    // }
    // return n;
}


Other print functions can be optimized in similar ways, doing the repeated addition - n+=write(...);  - in a loop costs extra if the size is known in advance.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darryl

#33
Jul 29, 2013, 09:05 pm Last Edit: Jul 29, 2013, 09:13 pm by darryl Reason: 1


1) For this behaviour I would like to add a new format e.g. DEFSCI so that the DEF behavior is 100% backwards compatible.
2) the lower limit should be higher 0.01 as printing with 2 decimals is I think most used.


Can you post your print.h/.cpp so I can give it a try here?  

(My time=536 yours time=448, want to understand why yours is so much faster?)



sure,  make changes as you see fit....  i've attached my files.   you will notice that I've gone over the core library and changed certain parameters to calls reducing size being used.  the int down to 8bit for example.
another speed up is using bool, instead of the stock defined boolean, which is typedef by default on unsigned char ( uint8_t )  duh !

i use a macro based hardware serial, and for writing, i don't buffer... i do however run my serial port at 1,000,000 baud, so a probable speed up, altho on default TX & RX buffering, I think I remember the buffers being 32 byres in size. ( iuse a mega 2560 frequently, so wanted the RAM, so I explicitly define each serial port I want live. and I only buffer RX. so perhaps my faster baud rate ( with busy polling to write the next byte ) doesn't actually gain much over the buffer TX version. I know my hardware serial replacement saves over 500bytes.


--
 Darryl

robtillaart

#34
Jul 29, 2013, 09:21 pm Last Edit: Jul 29, 2013, 09:26 pm by robtillaart Reason: 1
Thanks for sharing, will look at it later this week, was a (32bit) long day ;)
In the 0.22 version the params were all uint8_t where possible.

Your additional mods explains quite a bit!

You should also have a look at the Teensy hardware serial code, it has some optimizations too. (www.pjrc.com)

update: if you want reduce size more you could not count the chars printed, just a 0 or 1 (so you can still use it as boolean) Breaking but smaller ;)




Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darryl


Thanks for sharing, will look at it later this week, was a (32bit) long day ;)
In the 0.22 version the params were all uint8_t where possible.

Your additional mods explains quite a bit!

You should also have a look at the Teensy hardware serial code, it has some optimizations too. (www.pjrc.com)

update: if you want reduce size more you could not count the chars printed, just a 0 or 1 (so you can still use it as boolean) Breaking but smaller ;)



yes, over the years, i've peered at most bits of code commonly out their, this serial i'm happy with, its a good compromise for me, on speed and size taken, and not eating up ram when i only want to use one or two serial ports.

I had thought of returning a boolean from the printed calls, but as they are passed back in a register, its not much use.  the bool vars inside sections of code often get used in the zero or T flag sometimes the carry flag.  i get a bit over the top at times, and should really code in assembler ! ;-)

take this bit of code in wiring.c ( the micros() call )
Code: [Select]

// help the compiler generate some sensible code for the ((m<<8) + t)
__asm__ volatile (
"mov %D0, %C0" "\n\t"
"mov %C0, %B0" "\n\t"
"mov %B0, %A0" "\n\t"
"mov %A0, %[lo_byte]" "\n\t"
: "=r" (m)
: "0" (m), [lo_byte] "r" (t) );
return m * ( 64 / clockCyclesPerMicrosecond() );


I couldnt get the compiler to come up with sensible code for shifting the value in M left 8 bits, so had to give it a helping hand.




--
 Darryl

robtillaart

for size sake ...

Code: [Select]
if (notation != DEF) {
    prn_cnt += write('e');

    if (exponent >= 0) {
      // the print below here, will do the minus sign print for us
      prn_cnt += write('+');
    }

    prn_cnt += print(exponent, DEC);
  }

could be
Code: [Select]
if (notation != DEF) {
    prn_cnt += write('e');
    prn_cnt += print(exponent, DEC);
  }

as the + is implicit and therfor optional ;)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

darryl


for size sake ...
....
as the + is implicit and therfor optional ;)


yes, back when you first pushed the suggestion for the SCI & ENG support, a few messages talked about the output format of the exponent part, upper or lower case E etc. I decided I liked best the + going between the lower case e and the actual value of the exponent. on small lcd displays its nicer reading I think.

guess i should pass a thanks on, as i use your stats library and the running average quite a lot ;-)
--
 Darryl

pito

What is the latest fastest best optimized bug free print.cpp and print.h ?
Thnx.

robtillaart


What is the latest fastest best optimized bug free print.cpp and print.h ?
Thnx.


I have attached my latest print.cpp and print.h. It is substantial faster than the default for almost all datatypes (except char).
I do not claim it is the fastest/best optimized or bug free. I am using this version since this thread started, in fact a bit longer.
In this time I have encountered a few issues and they are all fixed - most are discussed in this thread.
Last month I did not encounter new issues, so I would call it a stable beta (customer trial ready)

Besides the performance it also include SCIentific notation of small and large floats. So any value 32bit float can represent is supported.
check print.h and uncomment appropiate section
Code: [Select]
// uncomment if you want: int64 support, scientific notation and overflow testing
// #define PRINT_LONGLONG
// #define PRINT_SCIENTIFIC
// #define PRINT_NAN_INF


Please test the performance before and after, so you get an indication of the gain.
Please post unexpected things on this thread, I will check almost on a daily basis so I can reproduce/fix asap.

Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

pito

#40
Aug 30, 2013, 03:21 pm Last Edit: Aug 30, 2013, 05:21 pm by pito Reason: 1
Thanks!
I did a brief testing from what I have found here, while printing into a .CSV file on an SdCard (see a typical CSV record, under NilRtos using FIFO, so measured is the elapsed time for file.print() from the FIFO_struct to the Sdcard's buffer, 9 floats and 3 itegers):

Code: [Select]
,1377866878,667,0.1243,70.69,74.39,78.26,80.73,82.29,82.51,87.89,76.89,1720

Original file.print: ~7ms
Darryl file.print:   ~5ms
Rob(the latest) file.print: ~3.1ms
SdFat's file.printField: ~1.7ms

robtillaart

Note: Darryl did not implement all ideas discussed as he did not want the class to grow too much.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

fat16lib

The SdFat printField buffers the number plus the field separator or the CR/LF for end of line.  There is a high overhead for each call when writing to an SD.

I will soon post a version of SdFat that is even faster using ideas in this forum topic.

pito

@rob: it seems the ENG does not work when SCI is not enabled.. (even from the source I can see the ENG is a part of SCI).. :)
Myabe, for clarity, I would do:
Code: [Select]
#define PRINT_LONGLONG
#define PRINT_SCIENTIFIC_AND_ENGINEERING
#define PRINT_NAN_INF

robtillaart

Good point, would it make sense to enable them separately?

(don't know if that's easy in the code as these two (SCI/ENG) are intertwined)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Go Up