Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)

FYI, here are some more results v1.01 (pragma -O1) on Teensy 3.5/3.6/3.2 and on dragonfly (STM32L4@80MHz, hardware float)

       t3.6 @180mhz
         INT_LOOP(30000) bench...= 500 microseconds 60.00MIPS
         LONG_LOOP(30000) bench...= 502 microseconds 59.76MIPS
         FLOAT_DIV(30000) bench...= 2503 microseconds 14.99MFLOPS
         DOUBLE_DIV(30000) bench...= 9343 microseconds 3.39MFLOPS
         FLOAT_MUL(30000) bench...= 667 microseconds 181.82MFLOPS
         DOUBLE_MUL(30000) bench...= 7008 microseconds 4.61MFLOPS

     t3.6 @120mhz
        INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
        LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
        FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
        DOUBLE_DIV(30000) bench...= 14019 microseconds 2.26MFLOPS
        FLOAT_MUL(30000) bench...= 1001 microseconds 120.97MFLOPS
        DOUBLE_MUL(30000) bench...= 10514 microseconds 3.07MFLOPS

       t3.5@120mhz 
        INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
        LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
        FLOAT_DIV(30000) bench...= 3758 microseconds 9.99MFLOPS
        DOUBLE_DIV(30000) bench...= 18797 microseconds 1.66MFLOPS
        FLOAT_MUL(30000) bench...= 1003 microseconds 120.97MFLOPS
        DOUBLE_MUL(30000) bench...= 10529 microseconds 3.07MFLOPS

      t3.2@120mhz
        INT_LOOP(30000) bench...= 751 microseconds 39.95MIPS
        LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
        FLOAT_DIV(30000) bench...= 8784 microseconds 3.74MFLOPS
        DOUBLE_DIV(30000) bench...= 17559 microseconds 1.79MFLOPS
        FLOAT_MUL(30000) bench...= 6771 microseconds 4.99MFLOPS
        DOUBLE_MUL(30000) bench...= 10533 microseconds 3.07MFLOPS

    dragonfly@80MHz       
       INT_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
       LONG_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
       FLOAT_DIV(30000) bench...= 5641 microseconds 6.65MFLOPS
       DOUBLE_DIV(30000) bench...= 21813 microseconds 1.45MFLOPS
       FLOAT_MUL(30000) bench...= 1883 microseconds 39.79MFLOPS
       DOUBLE_MUL(30000) bench...= 16173 microseconds 1.99MFLOPS
1 Like