# why are fp arithmetics so slow on a M0/Zero, compared to AVR or M3 Due?

hello,
why are fp arithmetics so slow on a M0/Zero, compared to AVR or M3 Due?

My test code

``````#define PI  M_PI

//--------------------------------------------
double test_float_math() { // 2,500,000 fp (double) mult, transcend.

volatile double s=(double)PI;
unsigned long y;

for(y=0;y<500000;y++) {
s*=sqrt(s);
s=sin(s);
s=exp(s);
s*=s;
}
return s; // debug
}

//--------------------------------------------
float test_float_math32() { // 2,500,000 32bit float mult, transcend.

volatile float s=(float)PI;
unsigned long y;

for(y=0;y<500000;y++) {
s*=sqrt(s);
s=sin(s);
s=exp(s);
s*=s;
}
return s;  // debug
}
``````

shows the runtimes (ms): (updated)

Arduino Mega2560 float: 163,613
Arduino M0/ItsyBitsy double: 199,888
Arduino Due double: 57,225

I actually expected both M0 bench marks to be ~100,000 ms (optimized double expectedly al little faster than 32bit float), so somewhere in between the AVR and the Due score.

(tbh, based on cpu clock, I actually expected even the Due score to be 5x as fast as on AVR (i.e., ~30,000 ms), so 2x as fast as it shows actually, and the M0 score about half of that)

But even much slower than on a AVR ??

To make the test fair, the Due should also be tested with floats.

I actually didn't expect it to differ so much, and indeed, it doesn't with the function above.

edit, update:
with the new XXXf style fp functions indeed both M0 and M3 are 2x as fast for foalt32 ccompared to float64!
Apologies, I stand corrected!

An arduino mega runs at 16MHz. An Arduino Due runs at 84MHz. The Due can perform more operations per second than the Mega. A float takes the same number of operations regardless of platform, so it makes sense that the Due can do the task faster. As for why the Zero is different, I don’t know

bos1714:
An arduino mega runs at 16MHz. An Arduino Due runs at 84MHz. The Due can perform more operations per second than the Mega. A float takes the same number of operations regardless of platform, so it makes sense that the Due can do the task faster. As for why the Zero is different, I don’t know

that 32 and 64 bit fp are not supposed to differ so much was exactly what I already stated.

But for
Mega2560 16 MHz
M0 zero 48 MHz
M3 Due 84 MHz

I would expect
M0 Zero: 3x as fast as Mega2560
M3 Due: 5x as fast as Mega2560

Both ARM boards are much slower than that, but especially for the M0/Zero it's crucial !

I wonder if you need to have "for(y=0;y<500000UL;y++)". Perhaps it is not doing as many loops as you think.

edit -
tested!
500000UL gives the same result like 500000 before in that function

The test is of a few math library functions. See the following for an older test of multiply and divide.

https://forum.arduino.cc/index.php?topic=431169.15

gdsports:
The test is of a few math library functions. See the following for an older test of multiply and divide.

Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2) - Microcontrollers - Arduino Forum

in which respect is that supposed to resolve my TO question?

Arduino Due (float): 58298 ms

First of all, to test with "float", you'll need to change your function calls to the fXXX versions.
Otherwise, they're still using doubles, and you're adding the the overhead of converting your
float arguments to doubles as well.

(Could you post your complete sketch? I don't feel like filling in the missing parts just to duplicate your results...)

2nd, the speed of flash is essentially constant at about 20MHz. Due and Zero both have some sort of flash accelerator in front of their flash, but you won't get "n times faster because the clock is n times faster" behavior. Zero runs the flash with 1 wait state (at 48MHz)

And Due runs the flash with 4 wait states (at 84MHz)

The "flash acceleration" is ... poorly documented and pretty unpredictable.

3rd, the AVR has highly optimized (for both speed and size) floating point libraries implemented as part of avr-libc, written in assembly language. Cortex-M3 (Due) has optimized ARM assembly code for basic floating point operations, but I think it uses generic C code for the high-level math functions. Cortex-m0 has nothing special at all :-(. It's using generic C code to implement all the floating point functions (slower and bigger.) (Figuring this out via the gcc source code is a pain.) (you might be interested in Qfplib: a family of floating-point libraries for ARM Cortex-M cores

4th, this is probably because the Cortex-M0 architecture is sort-of sucky. While still essentially "ARM", it's been pared down to a sort of bare minimum subset that is pretty unpleasant for assembly language authors (or compilers) to deal with. You wouldn't THINK that it would take a lot of effort to convert the CM3 assembly floating point code to CM0, but ... you'd be wrong.

For a good time, get ahold of one of the new SAMD51-based boards. Cortex-M4F processor with hardware single-precision floating point. Very zippy, at least for single precision basic math!

hi,
What I don't understand is the fXXX thing.... (?) what is that?

but here is the code, it's part of a longer benchmark test:

``````// Brickbench benchmark test
#include "SPI.h"

#if defined(__SAM3X8E__)
#undef __FlashStringHelper::F(string_literal)
#define F(string_literal) string_literal
#endif

// Arduino TFT pins
#define    tft_cs     10
#define    tft_dc      9
#define    tft_rst     8

// Adafruit Hardware SPI, no RST

#define  TimerMS() millis() // platform compatib.

unsigned long runtime[8];

#define tpin1  11  // GPIO test pin digitalWrite
#define tpin2  12  // GPIO test pin digitalWrite
#define tpin3  13  // GPIO test pin digitalRead

void TFTprint(char sbuf[], int16_t x, int16_t y) {
tft.setCursor(x, y);
tft.print(sbuf);
}

int a[500], b[500], c[500], t[500];

//--------------------------------------------
// Mersenne Twister
//--------------------------------------------

unsigned long randM(void) {

const int M = 7;
const unsigned long A[2] = { 0, 0x8ebfd028 };

static unsigned long y[25];
static int index = 25+1;

if (index >= 25) {
int k;
if (index > 25) {
unsigned long r = 9, s = 3402;
for (k=0 ; k<25 ; ++k) {
r = 509845221 * r + 3;
s *= s + 1;
y[k] = s + (r >> 10);
}
}
for (k=0 ; k<25-M ; ++k)
y[k] = y[k+M] ^ (y[k] >> 1) ^ A[y[k] & 1];
for (; k<25 ; ++k)
y[k] = y[k+(M-25)] ^ (y[k] >> 1) ^ A[y[k] & 1];
index = 0;
}

unsigned long e = y[index++];
e ^= (e << 7) & 0x2b5b2500;
e ^= (e << 15) & 0xdb8b0000;
e ^= (e >> 16);
return e;
}

//--------------------------------------------
// Matrix Algebra
//--------------------------------------------

// matrix * matrix multiplication (matrix product)

void MatrixMatrixMult(int N, int M, int K, double *A, double *B, double *C) {
int i, j, s;
for (i = 0; i < N; ++i) {
for (j = 0; j < K; ++j) {
C[i*K+j] = 0;
for (s = 0; s < M; ++s) {
C[i*K+j] = C[i*K+j] + A[i*N+s] * B[s*M+j];
}
}
}
}

// matrix determinant

double MatrixDet(int N, double A[]) {
int i, j, i_count, j_count, count = 0;
double Asub[N - 1][N - 1], det = 0;

if (N == 1)
return *A;
if (N == 2)
return ((*A) * (*(A+1+1*N)) - (*(A+1*N)) * (*(A+1)));

for (count = 0; count < N; count++) {
i_count = 0;
for (i = 1; i < N; i++) {
j_count = 0;
for (j = 0; j < N; j++) {
if (j == count)
continue;
Asub[i_count][j_count] = *(A+i+j*N);
j_count++;
}
i_count++;
}
det += pow(-1, count) * A[0+count*N] * MatrixDet(N - 1, &Asub[0][0]);
}
return det;
}

//--------------------------------------------
// shell sort
//--------------------------------------------

void shellsort(int size, int* A)
{
int i, j, increment;
int temp;
increment = size / 2;

while (increment > 0) {
for (i = increment; i < size; i++) {
j = i;
temp = A[i];
while ((j >= increment) && (A[j-increment] > temp)) {
A[j] = A[j - increment];
j = j - increment;
}
A[j] = temp;
}

if (increment == 2)
increment = 1;
else
increment = (unsigned int) (increment / 2.2);
}
}

//--------------------------------------------
// gnu quick sort
// (0ptional)
//--------------------------------------------

int compare_int (const int *a, const int *b)
{
int  temp = *a - *b;

if (temp > 0)          return  1;
else if (temp < 0)     return -1;
else                   return  0;
}

// gnu qsort:
// void qsort (void *a , size_a count, size_a size, compare_function)
// gnu qsort call for a[500] array of int:
// qsort (a , 500, sizeof(a), compare_int)

//--------------------------------------------
// benchmark test procedures
//--------------------------------------------

int test_Int_Add() { // 50,000,000 int +,- plus 5000000 counter
int i=1, j=11, k=112, l=1111, m=11111, n=-1, o=-11, p=-111, q=-1112, r=-11111;
unsigned long x;
volatile long s=0;
for(x=0;x<5000000;++x) {
s+=i; s+=j; s+=k; s+=l; s+=m; s+=n; s+=o; s+=p; s+=q; s+=r;
}
return s;
}

//--------------------------------------------
long test_Int_Mult() { // 10,000,000 int *,/ plus 10,000,000 counter
int  x;
unsigned long y;
volatile long s;

for(y=0;y<500000;y++) {
s=1;
for(x=1;x<=10;++x) { s*=x;}
for(x=10;x>0;--x) { s/=x;}
}
return s;
}

#define PI  M_PI

//--------------------------------------------
double test_float_math() { // 2,500,000 fp mult, transcend. plus 500000 counter

volatile double s=PI;
unsigned long y;

for(y=0;y<500000;++y) {
s*=sqrt(s);
s=sin(s);
s=exp(s);
s*=s;
}
return s;
}

//--------------------------------------------
// bug fixed !!
float test_float_math32() { // 2,500,000 32bit float mult, transcend.
volatile float s=(float)PI;
unsigned long y;

for(y=0;y<500000UL;y++) {
s*=sqrtf(s);
s=sinf(s);
s=expf(s);
s*=s;
}
return s; // debug
}

//--------------------------------------------
long test_rand_MT() { // 2,500,000 PRNGs
volatile unsigned long s;
unsigned long y;

for(y=0;y<2500000;++y) {
s=randM()%10001;
}
return s;
}
``````

code too large. part 2 to be continued...
AND can re-post only in 5 minutes !!! (effing restrictions!)

part 2:

``````//--------------------------------------------
double test_matrix_math() { // 150,000 2D Matrix algebra (mult, det)
unsigned long x;

double A[2][2], B[2][2], C[2][2];
double S[3][3], T[3][3];
unsigned long s;

for(x=0;x<50000;x++) {

A[0][0]=1;   A[0][1]=3;
A[1][0]=2;   A[1][1]=4;

B[0][0]=10;  B[0][1]=30;
B[1][0]=20;  B[1][1]=40;

MatrixMatrixMult(2, 2, 2, A[0], B[0], C[0]);

A[0][0]=1;   A[0][1]=3;
A[1][0]=2;   A[1][1]=4;

MatrixDet(2, A[0]);

S[0][0]=1;   S[0][1]=4;  S[0][2]=7;
S[1][0]=2;   S[1][1]=5;  S[1][2]=8;
S[2][0]=3;   S[2][1]=6;  S[2][2]=9;

MatrixDet(3, S[0]);

s=(S[0][0]*S[1][1]*S[2][2]);

}

return s;
}

//--------------------------------------------
// for array copy using void *memcpy(void *dest, const void *src, size_t n);

long test_Sort(){ // 1500 shellsort of random array[500]
unsigned long s;
int y, i;

int t[500];

for(y=0;y<500;y++) {
memcpy(t, a, sizeof(a));
shellsort(500, t);

memcpy(t, a, sizeof(b));
shellsort(500, t);

memcpy(t, a, sizeof(c));
shellsort(500, t);
}
return y;
}

//--------------------------------------------
int32_t test_GPIO(){  // 6,000,000 toggle GPIO r/w  plus counter
volatile static bool w=false, r;
uint32_t y;
for (y=0; y<2000000; y++) {
digitalWrite(tpin1, w);
w=!w;
digitalWrite(tpin2, w&!r);  // optional: (tpin2,  w&r);
}
return 1;
}

/*
//--------------------------------------------
int32_t test_GPIO_AVR() {  // 6,000,000 GPIO bit r/w, optionally for MEGA2560 pins
volatile static bool w=false, r;
uint32_t y;
for (y=0; y<2000000; y++) {
bitWrite(PORTB, PB5, w);
w=!w;
bitWrite(PORTB, PB6, w&!r);
}
return 1; // debug
}
*/

//--------------------------------------------
inline void displayValues() {

char buf[120];
tft.fillScreen(ILI9341_BLACK); // clrscr()

sprintf (buf, "%3d %9ld  int_Add",    0, runtime[0]); TFTprint(buf, 0,10);
sprintf (buf, "%3d %9ld  int_Mult",   1, runtime[1]); TFTprint(buf, 0,20);
sprintf (buf, "%3d %9ld  float_op",   2, runtime[2]); TFTprint(buf, 0,30);
sprintf (buf, "%3d %9ld  randomize",  3, runtime[3]); TFTprint(buf, 0,40);
sprintf (buf, "%3d %9ld  matrx_algb", 4, runtime[4]); TFTprint(buf, 0,50);
sprintf (buf, "%3d %9ld  arr_sort",   5, runtime[5]); TFTprint(buf, 0,60);
sprintf (buf, "%3d %9ld  GPIO_togg",  6, runtime[6]); TFTprint(buf, 0,70);
sprintf (buf, "%3d %9ld  Graphics",   7, runtime[7]); TFTprint(buf, 0,80);
}

//--------------------------------------------
int32_t test_TextOut() {  //10*8 lines of text
int  y;
char buf[120];

for(y=0;y<10;++y) {
tft.fillScreen(ILI9341_BLACK); // clrscr()
sprintf (buf, "%3d %9d  int_Add",    y, 1000);  TFTprint(buf, 0,10);
sprintf (buf, "%3d %9d  int_Mult",   0, 1010);  TFTprint(buf, 0,20);
sprintf (buf, "%3d %9d  float_op",   0, 1020);  TFTprint(buf, 0,30);
sprintf (buf, "%3d %9d  randomize",  0, 1030);  TFTprint(buf, 0,40);
sprintf (buf, "%3d %9d  matrx_algb", 0, 1040);  TFTprint(buf, 0,50);
sprintf (buf, "%3d %9d  GPIO_togg",  0, 1050);  TFTprint(buf, 0,60);
sprintf (buf, "%3d %9d  Graphics",   0, 1060);  TFTprint(buf, 0,70);
sprintf (buf, "%3d %9d  testing...", 0, 1070);  TFTprint(buf, 0,80);

}
return y;
}

//--------------------------------------------
int32_t test_graphics() { // 10x 8 shapes
int y;
char buf[120];

for(y=0;y<10;++y) {
tft.fillScreen(ILI9341_BLACK);
sprintf (buf, "%3d", y);  TFTprint(buf, 0,80);
tft.drawCircle(50, 40, 10, ILI9341_WHITE);
tft.fillCircle(30, 24, 10, ILI9341_WHITE);
tft.drawLine(10, 10, 60, 60, ILI9341_WHITE);
tft.drawLine(50, 20, 90, 70, ILI9341_WHITE);
tft.drawRect(20, 20, 40, 40, ILI9341_WHITE);
tft.fillRect(65, 25, 20, 30, ILI9341_WHITE);
tft.drawCircle(70, 30, 15, ILI9341_WHITE);
}
return y;
}

//--------------------------------------------
long test(){
unsigned long time0, x, y;
double s;
char  buf[120];
int   i;
float f;

Serial.println("init test arrays");

for(y=0;y<500;++y) {
a[y]=randM()%30000; b[y]=randM()%30000; c[y]=randM()%30000;
}

Serial.println("start test");
delay(10);

time0= TimerMS();

runtime[0]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  int_Add",    0, runtime[0]); Serial.println( buf);

time0=TimerMS();
s=test_Int_Mult();
runtime[1]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  int_Mult",   1, runtime[1]); Serial.println( buf);

time0=TimerMS();
s=test_float_math();
runtime[2]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  float_op",   2, runtime[2]); Serial.println( buf);

time0=TimerMS();
s=test_rand_MT();
runtime[3]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  randomize",  3, runtime[3]); Serial.println( buf);

time0=TimerMS();
s=test_matrix_math();
runtime[4]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  matrx_algb", 4, runtime[4]); Serial.println( buf);

time0=TimerMS();
s=test_Sort();
runtime[5]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  arr_sort",   5, runtime[5]); Serial.println( buf);

// GPIO R/W toggle test
//Serial.println("GPIO toggle test");
time0=TimerMS();
s=test_GPIO();
runtime[6]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  GPIO toggle", 6, runtime[6]); Serial.println( buf);

// lcd display text / graphs
time0=TimerMS();
s=test_TextOut();
s=test_graphics();
runtime[7]=TimerMS()-time0;
sprintf (buf, "%3d %9ld  Graphics   ", 7, runtime[7]); Serial.println( buf);

Serial.println();

y = 0;
for (x = 0; x < 8; ++x) {
y += runtime[x];
}

displayValues();
sprintf (buf, "runtime ges.:  %-9ld ", y);
Serial.println( buf);   TFTprint(buf, 0,90);

x=50000000.0/y;
sprintf (buf, "benchmark:     %-9ld ", x);
Serial.println( buf);   TFTprint(buf, 0,100);

return 1;
}

//--------------------------------------------
void setup() {

Serial.begin(115200);

// Setup the LCD
tft.begin();
tft.setRotation(3);
tft.setTextColor(ILI9341_WHITE); tft.setTextSize(1);
Serial.println("tft started");

pinMode(tpin1, OUTPUT);
pinMode(tpin2, OUTPUT);
pinMode(tpin3, INPUT_PULLUP);

}

void loop() {
char  buf[120];
test();

sprintf (buf, "Ende Benchmark");
Serial.println( buf);
TFTprint(buf, 0, 110);

while(1);
}

// Brickbench (C) 2018
``````

What I don't understand is the fXXX thing.... (?) what is that?

Sorry; I got that wrong...
As part of the C standard, "sin()" is defined as an operation on doubles that results in a double.
if you want a "float" version, you need to use sinf() instead. Similarly for the other math functions.

Here's a quote from "man sin":

NAME
sin -- sine function

SYNOPSIS
#include <math.h>

double
sin(double x);

long double
sinl(long double x);

float
sinf(float x);

The other thing to watch out for is that constants like "3.14159" are normally defined to be doubles, so if you use an expression like:

``````    float f = myfloat * 3.14;
``````

You'll get myfloat promoted to double and a call to the double version of multiply.
You could say:

``````    float f = myfloat * 3.14f;
``````

Or, there is a compiler switch: gcc - Make C floating point literals float (rather than double) - Stack Overflow

ok, I see, thank you very much!
Already the float thing is more complicated than I expected (I can't use compiler switches in my sketch and can't deal with flags at all - perhaps I'll try the sinf expf and sqrtf or what ever any time later - but finally it doesn't matter so much.

Far more important is the overcomplicated missing code optimization thing for M0, IIUYC. But finally I think I understood the crucial issues (partially), and I must therefore simply resign about that. But nevertheless, it was very precious to get to know that from you - again, thank you very much for your explanations!

I had to modify your benchmark a bit to get it to work on my display-less board.
Is this consistent with the timing that you're seeing?

Benchmark for Arduino Due
init test arrays
start test
1 1389 int_Mult
2 57199 double_op
2 28729 float_op
3 4042 randomize
4 4635 matrx_algb
5 2832 arr_sort
6 11429 GPIO toggle

And for SAMD21:

``````Benchmark for SFE SAMD21
``````

init test arrays
start test
1     15803  int_Mult
2    199420  double_op
2     89084  float_op
3     17631  randomize
4     18682  matrx_algb
5      6331  arr_sort
6     10178  GPIO toggle

And for grins, a SAMD51:

``````Benchmark for Adafruit Metro M4
``````

init test arrays
start test
1       872  int_Mult
2     24482  double_op
2      2772  float_op
3      1680  randomize
4      2077  matrx_algb
5      1553  arr_sort
6      2395  GPIO toggle

(I'm not convinced that the double floating point library "properly" utilizes single point hardware. Sigh.)

benchmark_dsyleixa.ino (12.7 KB)

yes, thank you!
yesterday it was a bit late to rewrite my code by my own, but just now in that moment I finished my code update by myself with this new function:

``````float test_float_math32() { // 2,500,000 32bit float mult, transcend.
volatile float s=(float)PI;
unsigned long y;

for(y=0;y<500000UL;y++) {
s*=sqrtf(s);
s=sinf(s);
s=expf(s);
s*=s;
}
return s; // debug
}
``````

and I found out already that on M3 and M0 float32 is 2x as fast as float64!

my results for M3 and M0 to float32 and float64:

``````Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI +32bit float
1     15795  int_Mult
2     89054  float_op (float)
3     17675  randomize
4     18650  matrx_algb
5      6328  arr_sort
6      9944  GPIO_toggle
7      6752  Graphics
runtime ges.:  171944
benchmark:     290
``````
``````Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI +double fp
1     15795  int_Mult
2    199888  float_op (double)
3     17727  randomize
4     18559  matrx_algb
5      6330  arr_sort
6      9734  GPIO toggle
7      6759  Graphics
runtime ges.:  282538
benchmark:     176
``````
``````Arduino DUE + adafruit_ILI9341 Hardware-SPI + 32bit float
1      1389  int_Mult
2     29124  float_op (float)
3      3853  randomize
4      4669  matrx_algb
5      2832  arr_sort
6     11859  GPIO_toggle
7      6142  Graphics
runtime ges.:  63979
benchmark:     781
``````
``````Arduino DUE + adafruit_ILI9341 Hardware-SPI + double fp
1      1389  int_Mult
2     57225  float_op (double)
3      3852  randomize
4      4666  matrx_algb
5      2833  arr_sort
6     11787  GPIO toggle
7      6143  Graphics
runtime ges.:  92006
benchmark:     543
``````

in comparison: Mega2560

``````Arduino MEGA + ILI9225 + Karlson UTFT
1    237402  int_Mult
2    163613  float_op (float)
3    158567  randomize
4     46085  matrx_algb
5     23052  arr_sort
6     41569  GPIO toggle
7     62109  Graphics
runtime ges.:    822641
benchmark:        60
``````

I just now wanted to publish that and surprisingly found that you did that already and also for some other extra platforms - great!
(and this M4 thing is really amazing!) 8)
So back to my TO question, to summarize: IIUC, the poor M0 fp performance is mostly based on a bad fp code optimization in the M0 core, compared to AVR and M3 Due, and 2nd, it turned out that float32 by XXXf type fp functions can make it 2x as fast.
That is very precious to know!

Thanks a lot for your efforts!

PS, edit, offtopic:
do you think the Adafruit ItsyBitsy M4 Express featuring the ATSAMD51

has got the fpu, too? they write just "ATSAMD51 32-bit Cortex M4 core running at 120 MHz, Hardware DSP and floating point support" but do not write "M4F" though...?

Yes, all of the adafruit m4 boards use the same chip.

;D

westfw:
And for grins, a SAMD51:

``````2     24482  double_op
``````

2      2772  float_op

``````

(I'm not convinced that the double floating point library "properly" utilizes single point hardware. Sigh.)
``````

tbh, it took some time for me to understand -
but yes, now I see...:
the fp double test on M4F is not much faster than on the Due M3 (in light of cpu clock), although the M4 claims to have a hardware fpu -
that's really strange, but the benchmark test eventually brought it to light...

Anyone got a Teensy 3.5 or 3.6 to check that?
(rethorically asked.... sure - at least the Teensy factory owner, probably... )

Hi dsylexia,

PS, edit, offtopic:
do you think the Adafruit ItsyBitsy M4 Express featuring the ATSAMD51
Adafruit ItsyBitsy M4 Express featuring ATSAMD51 : ID 3800 : \$14.95 : Adafruit Industries, Unique & fun DIY electronics and kits
has got the fpu, too? they write just "ATSAMD51 32-bit Cortex M4 core running at 120 MHz, Hardware DSP and floating point support" but do not write "M4F" though...?

The Adafruit Itsy Bitsy M4 looks like a great board for the price.

The on-board SAMD51G19A does include the single precision hardware floating point unit.

Other points to note are that the Itsy Bitsy M4's microcontroller runs crystalless, (Metro M4 and Feather M4 on the other hand have an external crystal) and the board's 48-pin, G variant doesn't include I2S support. Other than that it offers excellent number crunching power in a tiny package.