why are fp arithmetics so slow on a M0/Zero, compared to AVR or M3 Due?

yes, thank you!
yesterday it was a bit late to rewrite my code by my own, but just now in that moment I finished my code update by myself with this new function:

float test_float_math32() { // 2,500,000 32bit float mult, transcend.
 volatile float s=(float)PI;
 unsigned long y;

 for(y=0;y<500000UL;y++) { 
    s*=sqrtf(s);
    s=sinf(s);
    s=expf(s);
    s*=s;
 }
 return s; // debug
}

and I found out already that on M3 and M0 float32 is 2x as fast as float64!

my results for M3 and M0 to float32 and float64:

Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI +32bit float
  0      7746  int_Add
  1     15795  int_Mult
  2     89054  float_op (float)
  3     17675  randomize
  4     18650  matrx_algb
  5      6328  arr_sort
  6      9944  GPIO_toggle
  7      6752  Graphics 
runtime ges.:  171944    
benchmark:     290
Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI +double fp
  0      7746  int_Add
  1     15795  int_Mult
  2    199888  float_op (double)
  3     17727  randomize
  4     18559  matrx_algb
  5      6330  arr_sort
  6      9734  GPIO toggle
  7      6759  Graphics   
runtime ges.:  282538    
benchmark:     176
Arduino DUE + adafruit_ILI9341 Hardware-SPI + 32bit float
  0      4111  int_Add
  1      1389  int_Mult
  2     29124  float_op (float)
  3      3853  randomize
  4      4669  matrx_algb
  5      2832  arr_sort
  6     11859  GPIO_toggle
  7      6142  Graphics   
runtime ges.:  63979     
benchmark:     781
Arduino DUE + adafruit_ILI9341 Hardware-SPI + double fp
  0      4111  int_Add
  1      1389  int_Mult
  2     57225  float_op (double)
  3      3852  randomize
  4      4666  matrx_algb
  5      2833  arr_sort
  6     11787  GPIO toggle
  7      6143  Graphics   
runtime ges.:  92006     
benchmark:     543

in comparison: Mega2560

Arduino MEGA + ILI9225 + Karlson UTFT
  0     90244  int_Add
  1    237402  int_Mult
  2    163613  float_op (float)
  3    158567  randomize
  4     46085  matrx_algb
  5     23052  arr_sort
  6     41569  GPIO toggle
  7     62109  Graphics   
runtime ges.:    822641 
benchmark:        60

I just now wanted to publish that and surprisingly found that you did that already and also for some other extra platforms - great!
(and this M4 thing is really amazing!) 8)
So back to my TO question, to summarize: IIUC, the poor M0 fp performance is mostly based on a bad fp code optimization in the M0 core, compared to AVR and M3 Due, and 2nd, it turned out that float32 by XXXf type fp functions can make it 2x as fast.
That is very precious to know!

Thanks a lot for your efforts!

PS, edit, offtopic:
do you think the Adafruit ItsyBitsy M4 Express featuring the ATSAMD51

has got the fpu, too? they write just "ATSAMD51 32-bit Cortex M4 core running at 120 MHz, Hardware DSP and floating point support" but do not write "M4F" though...?