Dear All
In my research lab we are developing a custom application that required a bit of cost per performance evaluation in number crunching capabilities. The software we used to benchmark comprises simple loops, structured in a way to avoid compiler simplifications. Not very sophisticated, but it resembles many operations we are presently doing (including the direct storing of a result in one of the operators). Sure, there are more sophisticated bench around, but I like the idea to share the results.
These are the results we got, with the environment Arduino 1.6.9.
Update - 1.01
-I introduced a modification suggested by Riva
-Westfw pointed out the different default compiler optimizations on different platforms
Now the compiler optimization is fixed to -O1, and as expected the Due is closer to the Teensy 3.2 in terms of performances
- I added the bench on the Teensy LC
05/04/2018 Update on the results
-Arduino Zero added
-Arduino Pro 1284 (24MHz) added (Thanks Budvar10)
01/05/2018
-Adafruit Metro M4 Express (samd51 @120MHz) cache on added(Thanks gdsports)
18/09/2019
-Teensy 4.0 added
Generic STM32F103C8T6 72MHz (Cortex-M3)
INT_LOOP(30000) bench...= 2924 microseconds 10.26MIPS
LONG_LOOP(30000) bench...= 2926 microseconds 10.25MIPS
FLOAT_DIV(30000) bench...= 27979 microseconds 1.20MFLOPS
DOUBLE_DIV(30000) bench...= 38000 microseconds 0.86MFLOPS
FLOAT_MUL(30000) bench...= 20463 microseconds 1.71MFLOPS
DOUBLE_MUL(30000) bench...= 25891 microseconds 1.31MFLOPS
Arduino Nano (ATMega328 16MHz AVR)
INT_LOOP(30000) bench...= 7544 microseconds 3.98MIPS
LONG_LOOP(30000) bench...= 13408 microseconds 2.24MIPS
FLOAT_DIV(30000) bench...= 154792 microseconds 0.21MFLOPS
DOUBLE_DIV(30000) bench...= 154800 microseconds 0.21MFLOPS
FLOAT_MUL(30000) bench...= 156744 microseconds 0.21MFLOPS
DOUBLE_MUL(30000) bench...= 156736 microseconds 0.21MFLOPS
Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)
INT_LOOP(30000) bench...= 116898 microseconds 11.92MIPS
LONG_LOOP(30000) bench...= 116898 microseconds 11.93MIPS
FLOAT_DIV(30000) bench...= 116898 microseconds 0.38MFLOPS
DOUBLE_DIV(30000) bench...= 113126 microseconds 0.27MFLOPS
FLOAT_MUL(30000) bench...= 92387 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench...= 116898 microseconds 0.26MFLOPS
Arduino Due (Atmel SAM3X8E 84 MHz Cortex-M3)
INT_LOOP(30000) bench...= 1074 microseconds 27.93MIPS
LONG_LOOP(30000) bench...= 1107 microseconds 27.10MIPS
FLOAT_DIV(30000) bench...= 25859 microseconds 1.21MFLOPS
DOUBLE_DIV(30000) bench...= 37966 microseconds 0.81MFLOPS
FLOAT_MUL(30000) bench...= 18659 microseconds 1.71MFLOPS
DOUBLE_MUL(30000) bench...= 25450 microseconds 1.23MFLOPS
Teensy LC (MKL26Z64 Cortex-M0 48MHz)
INT_LOOP(30000) bench...= 2508 microseconds 11.96MIPS
LONG_LOOP(30000) bench...= 2512 microseconds 11.94MIPS
FLOAT_DIV(30000) bench...= 76705 microseconds 0.40MFLOPS
DOUBLE_DIV(30000) bench...= 101840 microseconds 0.30MFLOPS
FLOAT_MUL(30000) bench...= 80471 microseconds 0.38MFLOPS
DOUBLE_MUL(30000) bench...= 106242 microseconds 0.29MFLOPS
Teensy 3.2 (MK20DX256 Cortex-M4 96 MHz)
INT_LOOP(30000) bench...= 940 microseconds 31.91MIPS
LONG_LOOP(30000) bench...= 944 microseconds 31.78MIPS
FLOAT_DIV(30000) bench...= 10977 microseconds 2.99MFLOPS
DOUBLE_DIV(30000) bench...= 21317 microseconds 1.47MFLOPS
FLOAT_MUL(30000) bench...= 8463 microseconds 3.99MFLOPS
DOUBLE_MUL(30000) bench...= 13162 microseconds 2.46MFLOPS
Teensy 3.2 (MK20DX256 Cortex-M4 72MHz)
INT_LOOP(30000) bench...= 1253 microseconds 23.94MIPS
LONG_LOOP(30000) bench...= 1256 microseconds 23.89MIPS
FLOAT_DIV(30000) bench...= 14635 microseconds 2.24MFLOPS
DOUBLE_DIV(30000) bench...= 25083 microseconds 1.26MFLOPS
FLOAT_MUL(30000) bench...= 11288 microseconds 2.99MFLOPS
DOUBLE_MUL(30000) bench...= 17551 microseconds 1.84MFLOPS
ESP8266 esp-12e 160MHz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 751 microseconds 39.95MIPS
FLOAT_DIV(30000) bench...= 7500 microseconds 4.45MFLOPS
DOUBLE_DIV(30000) bench...= 8063 microseconds 4.10MFLOPS
FLOAT_MUL(30000) bench...= 9938 microseconds 3.27MFLOPS
DOUBLE_MUL(30000) bench...= 10688 microseconds 3.02MFLOPS
ESP8266 esp-12e 80MHz
INT_LOOP(30000) bench...= 1504 microseconds 19.95MIPS
LONG_LOOP(30000) bench...= 1501 microseconds 19.99MIPS
FLOAT_DIV(30000) bench...= 15001 microseconds 2.22MFLOPS
DOUBLE_DIV(30000) bench...= 16126 microseconds 2.05MFLOPS
FLOAT_MUL(30000) bench...= 19876 microseconds 1.63MFLOPS
DOUBLE_MUL(30000) bench...= 21377 microseconds 1.51MFLOPS
#From mantoui
teensy3.6 @180mhz
INT_LOOP(30000) bench...= 500 microseconds 60.00MIPS
LONG_LOOP(30000) bench...= 502 microseconds 59.76MIPS
FLOAT_DIV(30000) bench...= 2503 microseconds 14.99MFLOPS
DOUBLE_DIV(30000) bench...= 9343 microseconds 3.39MFLOPS
FLOAT_MUL(30000) bench...= 667 microseconds 181.82MFLOPS
DOUBLE_MUL(30000) bench...= 7008 microseconds 4.61MFLOPS
teensy3.6 @120mhz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 14019 microseconds 2.26MFLOPS
FLOAT_MUL(30000) bench...= 1001 microseconds 120.97MFLOPS
DOUBLE_MUL(30000) bench...= 10514 microseconds 3.07MFLOPS
teensy3.5@120mhz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
FLOAT_DIV(30000) bench...= 3758 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 18797 microseconds 1.66MFLOPS
FLOAT_MUL(30000) bench...= 1003 microseconds 120.97MFLOPS
DOUBLE_MUL(30000) bench...= 10529 microseconds 3.07MFLOPS
teensy3.2@120mhz
INT_LOOP(30000) bench...= 751 microseconds 39.95MIPS
LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
FLOAT_DIV(30000) bench...= 8784 microseconds 3.74MFLOPS
DOUBLE_DIV(30000) bench...= 17559 microseconds 1.79MFLOPS
FLOAT_MUL(30000) bench...= 6771 microseconds 4.99MFLOPS
DOUBLE_MUL(30000) bench...= 10533 microseconds 3.07MFLOPS
dragonfly@80MHz
INT_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
LONG_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
FLOAT_DIV(30000) bench...= 5641 microseconds 6.65MFLOPS
DOUBLE_DIV(30000) bench...= 21813 microseconds 1.45MFLOPS
FLOAT_MUL(30000) bench...= 1883 microseconds 39.79MFLOPS
DOUBLE_MUL(30000) bench...= 16173 microseconds 1.99MFLOPS
#From Budvar10
Arduino-PRO 1284 (ATmega1284P 24MHz)
INT_LOOP(30000) bench...= 5024 microseconds 5.97MIPS
LONG_LOOP(30000) bench...= 8992 microseconds 3.34MIPS
FLOAT_DIV(30000) bench...= 96789 microseconds 0.34MFLOPS
DOUBLE_DIV(30000) bench...= 96800 microseconds 0.34MFLOPS
FLOAT_MUL(30000) bench...= 98058 microseconds 0.34MFLOPS
DOUBLE_MUL(30000) bench...= 98059 microseconds 0.34MFLOPS
#From gdsports
Adafruit Metro M4 Express (samd51 @120MHz) cache on
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 14022 microseconds 2.26MFLOPS
FLOAT_MUL(30000) bench...= 1002 microseconds 120.48MFLOPS
DOUBLE_MUL(30000) bench...= 10516 microseconds 3.07MFLOPS
Teensy 4.0 @600MHz
FLOAT_DIV(30000) bench...= 200 microseconds 300.00MFLOPS
DOUBLE_DIV(30000) bench...= 201 microseconds 297.03MFLOPS
FLOAT_MUL(30000) bench...= 150 microseconds 600.00MFLOPS
DOUBLE_MUL(30000) bench...= 300 microseconds 150.00MFLOPS
Time (ms)...= 396577 ms
INT_LOOP(30000) bench...= 300 microseconds 600.00MIPS
LONG_LOOP(30000) bench...= 300 microseconds 300.00MIPS
FLOAT_DIV(30000) bench...= 300 microseconds 300.00MFLOPS
the code is in attachment.
Very soon I will have a comparison of the relative typical noise in the A/D of the different platform. Indeed the Teensy platform seems to have more muscles, and also the performance per MHz in integer operations is very solid. However, in terms of cost/performance the STM32 board is a generic clone acquired for around 2$, difficult to beat.
Cheers!
Trycage
bench_test_101.ino (3.39 KB)