Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)

moises1953:
Operations in less time than calibration loop?. Not posible. May be invalid formating of time functions.

Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)
INT_LOOP(30000) bench...= 116898 microseconds 11.92MIPS
LONG_LOOP(30000) bench...= 116898 microseconds 11.93MIPS
FLOAT_DIV(30000) bench...= 116898 microseconds 0.38MFLOPS
DOUBLE_DIV(30000) bench...= 113126 microseconds 0.27MFLOPS
FLOAT_MUL(30000) bench...= 92387 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench...= 116898 microseconds 0.26MFLOPS

At high speed the results are imprecise:
Teensy 3.6 (Cortex M4@180Mhz). The result of FLOAT_MUL is 181.82 MIPS.
The empty reference loop has the following repetitive high level operations:
1)increment
2)compare
3)jump
And takes 502 microsecond for 30000 iterations, so 59.76Mloops. The high level operations MIPS are: 59.76*3=179.28
How is posible to achieve 181.82 MIPS using FLOAT_MUL?. Without optimizations must be 180 MIPS or 179.28 may be.

Operations are operation and asignement, and may be the asignement time was negligible. The inclusion of asignement to a constant in the LONG calibration loop may be a best approach, as sugested by westfw.

May be interesting to measure the asignement time (ad MIPS) of diferent data types

The attach contains a operations MIPS comparative table, asigning 3 operations to a loop

Thanks Moises. I am grateful you took the time to look at the code.

I wrote the code a while ago, (indeed 180MHz microcontrollers were not exactly a target).

, if I recall correctly I tried to make all the loops look similar "in structure" to the calibration loop (so I could remove the loop weight). A float should give about 180MFLOPS in cortex-M4+FPU. I see your points however the accuracy is quite undermined by the use of the function micros (which has a granularity of 8 microseconds) and a loop of 30000 is probably quite insufficient. Actually I think 181.82MFLOPS is quite close, but probably the number of digits is definetely pointless.

The "DUMMY" assignments were made (if I still recall) because they somewhat had an effect in the compiled code. Probably a better programmer would have coded directly in assembler caring to make all the loops exaclty the same (and I am also a lazy programmer most of the time!).

I recall testing the different suggestion (looking at the compiled code), but I did not have time to improve the bench for high speed (without affecting the old results).

:slight_smile:

Marco