Due, slightly modified; seems more reasonable.
Time (ms)...= 12083 ms
INT_LOOP(30000) bench...= 1151 microseconds 26.06MIPS
LONG_LOOP(30000) bench...= 1131 microseconds 26.53MIPS
FLOAT_DIV(30000) bench...= 28098 microseconds 1.11MFLOPS
DOUBLE_DIV(30000) bench...= 36951 microseconds 0.84MFLOPS
FLOAT_MUL(30000) bench...= 19788 microseconds 1.61MFLOPS
DOUBLE_MUL(30000) bench...= 24436 microseconds 1.29MFLOPS
It turns out that Due compiles with optimization flag "-Os", while Teensy3 compiles with just "-O"
On AVR, -Os seems to incorporate nearly all of the useful optimizations from -O, but that doesn't seem to be the case for ARM. With -Os, Due produces code like this for the integer loop:
for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
8018a: 460b mov r3, r1
8018c: f501 42ea add.w r2, r1, #29952 ; 0x7500
80190: 322f adds r2, #47 ; 0x2f
80192: 429a cmp r2, r3
80194: db01 blt.n 8019a <loop+0x52>
80196: 3301 adds r3, #1
80198: e7f8 b.n 8018c <loop+0x44>
Notice that the branch at the end goes back to 8018c (the "add.w" instruction), so there are 5 instructions in the loop.
With -O, it does:
for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
80186: 6823 ldr r3, [r4, #0]
80188: 4aa0 ldr r2, [pc, #640] ; (8040c <loop+0x2c4>)
8018a: 6013 str r3, [r2, #0]
8018c: f503 42ea add.w r2, r3, #29952 ; 0x7500
80190: 3230 adds r2, #48 ; 0x30
80192: 4293 cmp r3, r2
80194: da05 bge.n 801a2 <loop+0x5a>
80196: f247 5330 movw r3, #30000 ; 0x7530
8019a: 3b01 subs r3, #1
8019c: d1fd bne.n 8019a <loop+0x52>
This reduces the loop to a single instruction that it does an "equivalent" number of times, but it has to check the initial condition separately, so it's a bit bigger. (Why they can't use the same 30000 for the add and the loop counter, I'm not sure...) Sketch uses 28,792 bytes, vs Sketch uses 28,044 with -Os - about 3% larger...
Just for kicks, using -O3 makes for 29,528 bytes, and succeeds in completely optimizing the loops away, giving a speed of up to 769 MIPS