Go Down

Topic: why are fp arithmetics so slow on a M0/Zero, compared to AVR or M3 Due? (Read 768 times) previous topic - next topic

dsyleixa

yes, thank you!
yesterday it was a bit late to rewrite my code by my own, but just now in that moment I finished my code update by myself with this new function:

Code: [Select]
float test_float_math32() { // 2,500,000 32bit float mult, transcend.
 volatile float s=(float)PI;
 unsigned long y;

 for(y=0;y<500000UL;y++) {
    s*=sqrtf(s);
    s=sinf(s);
    s=expf(s);
    s*=s;
 }
 return s; // debug
}

and I found out already that on M3 and M0 float32 is 2x as fast as float64!

my results for M3 and M0 to float32 and float64:

Code: [Select]
Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI +32bit float
  0      7746  int_Add
  1     15795  int_Mult
  2     89054  float_op (float)
  3     17675  randomize
  4     18650  matrx_algb
  5      6328  arr_sort
  6      9944  GPIO_toggle
  7      6752  Graphics
runtime ges.:  171944   
benchmark:     290


Code: [Select]
Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI +double fp
  0      7746  int_Add
  1     15795  int_Mult
  2    199888  float_op (double)
  3     17727  randomize
  4     18559  matrx_algb
  5      6330  arr_sort
  6      9734  GPIO toggle
  7      6759  Graphics   
runtime ges.:  282538   
benchmark:     176 

 
Code: [Select]
Arduino DUE + adafruit_ILI9341 Hardware-SPI + 32bit float
  0      4111  int_Add
  1      1389  int_Mult
  2     29124  float_op (float)
  3      3853  randomize
  4      4669  matrx_algb
  5      2832  arr_sort
  6     11859  GPIO_toggle
  7      6142  Graphics   
runtime ges.:  63979     
benchmark:     781 

 
Code: [Select]
Arduino DUE + adafruit_ILI9341 Hardware-SPI + double fp
  0      4111  int_Add
  1      1389  int_Mult
  2     57225  float_op (double)
  3      3852  randomize
  4      4666  matrx_algb
  5      2833  arr_sort
  6     11787  GPIO toggle
  7      6143  Graphics   
runtime ges.:  92006     
benchmark:     543



in comparison: Mega2560
Code: [Select]
Arduino MEGA + ILI9225 + Karlson UTFT
  0     90244  int_Add
  1    237402  int_Mult
  2    163613  float_op (float)
  3    158567  randomize
  4     46085  matrx_algb
  5     23052  arr_sort
  6     41569  GPIO toggle
  7     62109  Graphics   
runtime ges.:    822641
benchmark:        60   


I just now wanted to publish that and surprisingly found that you did that already and also for some other extra platforms - great! 
(and this M4 thing is really amazing!) 8)
So back to my TO question, to summarize: IIUC, the poor M0 fp performance is mostly based on a bad fp code optimization  in the M0 core, compared to AVR and M3 Due, and 2nd, it turned out that float32 by XXXf type fp functions can make it 2x as fast.
That is very precious to know!

Thanks a lot for your efforts!

PS, edit, offtopic:
do you think the Adafruit ItsyBitsy M4 Express featuring the ATSAMD51 
https://www.adafruit.com/product/3800
has got the fpu, too? they write just "ATSAMD51 32-bit Cortex M4 core running at 120 MHz, Hardware DSP and floating point support" but do not write "M4F" though...?

westfw



dsyleixa

And for grins, a SAMD51:
Code: [Select]

2     24482  double_op
2      2772  float_op


(I'm not convinced that the double floating point library "properly" utilizes single point hardware.  Sigh.)


tbh, it took some time for me to understand -
but yes, now I see...:
the fp double test on M4F  is not much faster than on the Due M3 (in light of cpu clock), although the M4 claims to have a hardware fpu -
that's really strange, but the benchmark test eventually  brought it to light...  :smiley-eek:

Anyone got a Teensy 3.5 or 3.6 to check that?
(rethorically asked.... sure - at least the Teensy factory owner, probably... ;) )

MartinL

Hi dsylexia,

Quote
PS, edit, offtopic:
do you think the Adafruit ItsyBitsy M4 Express featuring the ATSAMD51
https://www.adafruit.com/product/3800
has got the fpu, too? they write just "ATSAMD51 32-bit Cortex M4 core running at 120 MHz, Hardware DSP and floating point support" but do not write "M4F" though...?
The Adafruit Itsy Bitsy M4 looks like a great board for the price.

The on-board SAMD51G19A does include the single precision hardware floating point unit.

Other points to note are that the Itsy Bitsy M4's microcontroller runs crystalless, (Metro M4 and Feather M4 on the other hand have an external crystal) and the board's 48-pin, G variant doesn't include I2S support. Other than that it offers excellent number crunching power in a tiny package.

dsyleixa

yes, thanks, the "crystalless" thing is new to me (not sure which effects it causes though)
-  but what about the M4 double-fp issue?
Quote
the fp double test on M4F  is not much faster than on the Due M3 (in light of cpu clock), although the M4 claims to have a hardware fpu -
that's really strange, but the benchmark test eventually  brought it to light...  :smiley-eek:

Anyone got a Teensy 3.5 or 3.6 to check that?
(rethorically asked.... sure - at least the Teensy factory owner, probably... ;) )

MartinL

Crystalless just means that the microcontroller's using it's own internal and less accurate 32kHz oscillator, as the basis for its internal 48MHz, 100MHz and 120MHz clocks and timing.

It will only really affect your project if you have some absolute timing measurement, such as a Real Time Clock (RTC).

The single precision hardware Floating Point Unit (FPU) on the Arm Cortex M4F only supports "float" and not "double" for floating point operations.

avr_fred

Quote
Anyone got a Teensy 3.5 or 3.6 to check that?
Yes, using westfw's test code without the display, results are:

Code: [Select]
Benchmark for Teensy 3.5 120MHz
init test arrays
start test
  0      2254  int_Add
  1       868  int_Mult
  2     36066  double_op
  2      2635  float_op
  3      1449  randomize
  4      2831  matrx_algb
  5      1231  arr_sort
  6      2212  GPIO toggle
  




westfw

Quote
to summarize: IIUC, the poor M0 fp performance is mostly based on a bad fp code optimization  in the M0 core
I want to emphasize that by "core" here, we're actually talking about the CPU hardware, or the gcc floating point libraries.  Note the "Arduino Core" that frequently comes up in discussion...


Quote
the fp double test on M4F  is not much faster than on the Due M3 (in light of cpu clock), although the M4 claims to have a hardware fpu -
that's really strange, but the benchmark test eventually  brought it to light..
I checked the object code.  The gcc "double" floating point functions for CM4 do NOT make use of the single-precision floating point hardware.  In retrospect, I'm not quite sure how one would go about doing that...

dsyleixa

@westfw, I was refering to what I meant to have understood from your former post, that the MEGA2560 benefits from high AVR fp code optimization and the Due from ARM fp code optimization, but the M0 from neither, just native gcc,  IIUC.

About the M4 I actually don't understand what makes it crucial for the slow double FP execution speed, if it's  a hardware or a software issue, as the M4F just provides a single-precision fpu.

@avr_fred, thank you very much for testing! (edit, just observed, you testet the Teensy35).
Eventually it looks like the same slow double FP execution speed as already shown for westfw's  Adafruit M4, and apparently even a little slower than that (BTW, also the matrix_algebra test  utilizes double precision fp (if available)) ...

westfw

Apparently having single-precision FP hardware isn't much help with double-precision computation.
That sort-of makes sense; first: trying to figure out how to utilize it makes my head hurt.  2nd, "double precision" has 53bits of mantissa, which is more than twice the 24 bits of single precision, so it's worse than you'd think at first.  3rd, there's a lot of value to the compiler using "common" code for the double precision math, so different platforms get the same results...

dsyleixa

hmmmh, I see - eventually I learnt a lot about code coding, optimization, compiling, execution, graphic libs, and fpu utilization by developing and performing those benchmark tests, remarkably...!
(not only about Arduinos, also about the Raspi by a ported code version)

thank you very much for your contributions!

Go Up