Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)

Due, slightly modified; seems more reasonable.

Time (ms)...= 12083 ms
INT_LOOP(30000) bench...= 1151 microseconds 26.06MIPS
LONG_LOOP(30000) bench...= 1131 microseconds 26.53MIPS
FLOAT_DIV(30000) bench...= 28098 microseconds 1.11MFLOPS
DOUBLE_DIV(30000) bench...= 36951 microseconds 0.84MFLOPS
FLOAT_MUL(30000) bench...= 19788 microseconds 1.61MFLOPS
DOUBLE_MUL(30000) bench...= 24436 microseconds 1.29MFLOPS

It turns out that Due compiles with optimization flag "-Os", while Teensy3 compiles with just "-O"
On AVR, -Os seems to incorporate nearly all of the useful optimizations from -O, but that doesn't seem to be the case for ARM. With -Os, Due produces code like this for the integer loop:

  for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
   8018a:    460b          mov    r3, r1
   8018c:    f501 42ea     add.w    r2, r1, #29952    ; 0x7500
   80190:    322f          adds    r2, #47    ; 0x2f
   80192:    429a          cmp    r2, r3
   80194:    db01          blt.n    8019a <loop+0x52>
   80196:    3301          adds    r3, #1
   80198:    e7f8          b.n    8018c <loop+0x44>

Notice that the branch at the end goes back to 8018c (the "add.w" instruction), so there are 5 instructions in the loop.
With -O, it does:

  for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
   80186:       6823            ldr     r3, [r4, #0]
   80188:       4aa0            ldr     r2, [pc, #640]  ; (8040c <loop+0x2c4>)
   8018a:       6013            str     r3, [r2, #0]
   8018c:       f503 42ea       add.w   r2, r3, #29952  ; 0x7500
   80190:       3230            adds    r2, #48 ; 0x30
   80192:       4293            cmp     r3, r2
   80194:       da05            bge.n   801a2 <loop+0x5a>
   80196:       f247 5330       movw    r3, #30000      ; 0x7530
   8019a:       3b01            subs    r3, #1
   8019c:       d1fd            bne.n   8019a <loop+0x52>

This reduces the loop to a single instruction that it does an "equivalent" number of times, but it has to check the initial condition separately, so it's a bit bigger. (Why they can't use the same 30000 for the add and the loop counter, I'm not sure...) Sketch uses 28,792 bytes, vs Sketch uses 28,044 with -Os - about 3% larger...

Just for kicks, using -O3 makes for 29,528 bytes, and succeeds in completely optimizing the loops away, giving a speed of up to 769 MIPS :slight_smile: