FYI, here are some more results v1.01 (pragma -O1) on Teensy 3.5/3.6/3.2 and on dragonfly (STM32L4@80MHz, hardware float)
t3.6 @180mhz
INT_LOOP(30000) bench...= 500 microseconds 60.00MIPS
LONG_LOOP(30000) bench...= 502 microseconds 59.76MIPS
FLOAT_DIV(30000) bench...= 2503 microseconds 14.99MFLOPS
DOUBLE_DIV(30000) bench...= 9343 microseconds 3.39MFLOPS
FLOAT_MUL(30000) bench...= 667 microseconds 181.82MFLOPS
DOUBLE_MUL(30000) bench...= 7008 microseconds 4.61MFLOPS
t3.6 @120mhz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 14019 microseconds 2.26MFLOPS
FLOAT_MUL(30000) bench...= 1001 microseconds 120.97MFLOPS
DOUBLE_MUL(30000) bench...= 10514 microseconds 3.07MFLOPS
t3.5@120mhz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
FLOAT_DIV(30000) bench...= 3758 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 18797 microseconds 1.66MFLOPS
FLOAT_MUL(30000) bench...= 1003 microseconds 120.97MFLOPS
DOUBLE_MUL(30000) bench...= 10529 microseconds 3.07MFLOPS
t3.2@120mhz
INT_LOOP(30000) bench...= 751 microseconds 39.95MIPS
LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
FLOAT_DIV(30000) bench...= 8784 microseconds 3.74MFLOPS
DOUBLE_DIV(30000) bench...= 17559 microseconds 1.79MFLOPS
FLOAT_MUL(30000) bench...= 6771 microseconds 4.99MFLOPS
DOUBLE_MUL(30000) bench...= 10533 microseconds 3.07MFLOPS
dragonfly@80MHz
INT_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
LONG_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
FLOAT_DIV(30000) bench...= 5641 microseconds 6.65MFLOPS
DOUBLE_DIV(30000) bench...= 21813 microseconds 1.45MFLOPS
FLOAT_MUL(30000) bench...= 1883 microseconds 39.79MFLOPS
DOUBLE_MUL(30000) bench...= 16173 microseconds 1.99MFLOPS