Any ideas why the DOUBLE_DIV() test is slow on Teensy 3.5
Due, slightly modified; seems more reasonable.
Time (ms)...= 12083 ms INT_LOOP(30000) bench...= 1151 microseconds 26.06MIPS LONG_LOOP(30000) bench...= 1131 microseconds 26.53MIPS FLOAT_DIV(30000) bench...= 28098 microseconds 1.11MFLOPS DOUBLE_DIV(30000) bench...= 36951 microseconds 0.84MFLOPS FLOAT_MUL(30000) bench...= 19788 microseconds 1.61MFLOPS
DOUBLE_MUL(30000) bench...= 24436 microseconds 1.29MFLOPS
It turns out that Due compiles with optimization flag "-Os", while Teensy3 compiles with just "-O" On AVR, -Os seems to incorporate nearly all of the useful optimizations from -O, but that doesn't seem to be the case for ARM. With -Os, Due produces code like this for the integer loop:
for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
8018a: 460b mov r3, r1
8018c: f501 42ea add.w r2, r1, #29952 ; 0x7500
80190: 322f adds r2, #47 ; 0x2f
80192: 429a cmp r2, r3
80194: db01 blt.n 8019a
80196: 3301 adds r3, #1
80198: e7f8 b.n 8018c
Notice that the branch at the end goes back to 8018c (the "add.w" instruction), so there are 5 instructions in the loop. With -O, it does:
for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
80186: 6823 ldr r3, [r4, #0]
80188: 4aa0 ldr r2, [pc, #640] ; (8040c )
8018a: 6013 str r3, [r2, #0]
8018c: f503 42ea add.w r2, r3, #29952 ; 0x7500
80190: 3230 adds r2, #48 ; 0x30
80192: 4293 cmp r3, r2
80194: da05 bge.n 801a2
80196: f247 5330 movw r3, #30000 ; 0x7530
8019a: 3b01 subs r3, #1
8019c: d1fd bne.n 8019a
This reduces the loop to a single instruction that it does an "equivalent" number of times, but it has to check the initial condition separately, so it's a bit bigger. (Why they can't use the same 30000 for the add and the loop counter, I'm not sure...) Sketch uses 28,792 bytes, vs Sketch uses 28,044 with -Os - about 3% larger...
Just for kicks, using -O3 makes for 29,528 bytes, and succeeds in completely optimizing the loops away, giving a speed of up to 769 MIPS :-)
Riva: Nice to have some more benchmarking tests but the test will give wrong values for the loop times as your printing several lines of text before calculating the loop duration. I think the elapsed=micros()-elapsed; should come directly after the for loop.
Thanks Riva,
I will implement your modification.
trycage
@ron_sutherland, @hansibull and @Budvar10, if it is not too much trouble if you could re-run the latest version of bench (whihc now fixes the compiler options on every platform) I will publish your results at the top of the Post.
Thanks
Trycage
I suggest modifying a volatile variable inside the loop.
volatile byte dosomething;
:
for (lc=le; lc<(le+30000); lc++) //this syntax avoid compiler semplifications
{
dosomething = 0;
}
Because null loops are pretty boring. Then you won't need to be so tricky with your loops, either...
Various "long" variables used to hold timestamps should be "unsigned long"
...or put NOP instruction there
{
__asm__ __volatile__("nop"); // AVR
}
EDIT: Forgot this. Several different processors - I totally missed, it was stupid idea. :)
"nop" isn't guaranteed to be the right assembly on all chips. (mind you, you'd have to be out of your mind as a chip designer not to have a "nop" instructions, but it could happen...)
@westfw Yes, yes, while I realize a mistake, you've posted... :-*
westfw: I suggest modifying a volatile variable inside the loop.
volatile byte dosomething; : for (lc=le; lc<(le+30000); lc++) //this syntax avoid compiler semplifications { dosomething = 0; }
Because null loops are pretty boring. Then you won't need to be so tricky with your loops, either...
Various "long" variables used to hold timestamps should be "unsigned long"
Thanks westfw, our initial version of the code included some operations in the INT loop, however we reason that in the FOR statement there was already an increment operation. The code use the INT loop to calibrate the speed of the FLOAT loop, and it is probably ok to have a rough comparison between the platforms we got.
Probably I could code a WHILE statement where comparison and increment can appear as different recognizable operation, but I got the feeling that It would not be that different for the compiler.
Thanks a lot for the input.
FYI, here are some more results v1.01 (pragma -O1) on Teensy 3.5/3.6/3.2 and on dragonfly (STM32L4@80MHz, hardware float)
t3.6 @180mhz
INT_LOOP(30000) bench...= 500 microseconds 60.00MIPS
LONG_LOOP(30000) bench...= 502 microseconds 59.76MIPS
FLOAT_DIV(30000) bench...= 2503 microseconds 14.99MFLOPS
DOUBLE_DIV(30000) bench...= 9343 microseconds 3.39MFLOPS
FLOAT_MUL(30000) bench...= 667 microseconds 181.82MFLOPS
DOUBLE_MUL(30000) bench...= 7008 microseconds 4.61MFLOPS
t3.6 @120mhz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 14019 microseconds 2.26MFLOPS
FLOAT_MUL(30000) bench...= 1001 microseconds 120.97MFLOPS
DOUBLE_MUL(30000) bench...= 10514 microseconds 3.07MFLOPS
t3.5@120mhz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
FLOAT_DIV(30000) bench...= 3758 microseconds 9.99MFLOPS
DOUBLE_DIV(30000) bench...= 18797 microseconds 1.66MFLOPS
FLOAT_MUL(30000) bench...= 1003 microseconds 120.97MFLOPS
DOUBLE_MUL(30000) bench...= 10529 microseconds 3.07MFLOPS
t3.2@120mhz
INT_LOOP(30000) bench...= 751 microseconds 39.95MIPS
LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
FLOAT_DIV(30000) bench...= 8784 microseconds 3.74MFLOPS
DOUBLE_DIV(30000) bench...= 17559 microseconds 1.79MFLOPS
FLOAT_MUL(30000) bench...= 6771 microseconds 4.99MFLOPS
DOUBLE_MUL(30000) bench...= 10533 microseconds 3.07MFLOPS
dragonfly@80MHz
INT_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
LONG_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
FLOAT_DIV(30000) bench...= 5641 microseconds 6.65MFLOPS
DOUBLE_DIV(30000) bench...= 21813 microseconds 1.45MFLOPS
FLOAT_MUL(30000) bench...= 1883 microseconds 39.79MFLOPS
DOUBLE_MUL(30000) bench...= 16173 microseconds 1.99MFLOPS
-Updated Added Arduino Zero and Arduino Pro 1284 (Thanks Budvar10)
Adafruit Metro M4 Express (samd51 @120MHz) cache on INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS DOUBLE_DIV(30000) bench...= 14022 microseconds 2.26MFLOPS FLOAT_MUL(30000) bench...= 1002 microseconds 120.48MFLOPS DOUBLE_MUL(30000) bench...= 10516 microseconds 3.07MFLOPS
@gdsports Thanks!!!!
Then:
-Update Added Adafruit Metro M4 Express (Thanks gdsports)
Operations in less time than calibration loop?. Not posible. May be invalid formating of time functions.
Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)
INT_LOOP(30000) bench…= 116898 microseconds 11.92MIPS
LONG_LOOP(30000) bench…= 116898 microseconds 11.93MIPS
FLOAT_DIV(30000) bench…= 116898 microseconds 0.38MFLOPS
DOUBLE_DIV(30000) bench…= 113126 microseconds 0.27MFLOPS
FLOAT_MUL(30000) bench…= 92387 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench…= 116898 microseconds 0.26MFLOPS
At high speed the results are imprecise:
Teensy 3.6 (Cortex M4@180Mhz). The result of FLOAT_MUL is 181.82 MIPS.
The empty reference loop has the following repetitive high level operations:
1)increment
2)compare
3)jump
And takes 502 microsecond for 30000 iterations, so 59.76Mloops. The high level operations MIPS are: 59.76*3=179.28
How is posible to achieve 181.82 MIPS using FLOAT_MUL?. Without optimizations must be 180 MIPS or 179.28 may be.
Operations are operation and asignement, and may be the asignement time was negligible. The inclusion of asignement to a constant in the LONG calibration loop may be a best approach, as sugested by westfw.
May be interesting to measure the asignement time (ad MIPS) of diferent data types
The attach contains a operations MIPS comparative table, asigning 3 operations to a loop
Arduino Benchmark-v0101.doc (22 KB)
This is the code of FDIV loop:
fa=(float)random(1,2);
fb=(float)random(1,1000);
fb=0; // this line must be suppressed
fg=0;
le=random(1,2);
elapsed=micros();
for (lc=le; lc<(le+30000); lc++)
{
fb=fb/fa;
}
elapsed=micros()-elapsed;
If fb is initialized to 0, then all operations are 0./fa, so this initialization must be suppressed, and also in the DDIV loop.
The use of variables fg and dg is useless, and may be suppressed.
The operation in for is a overcharge.
Proposed code for FDIV:
fa=(float)random(1,2);
fb=(float)random(1,1000);
le=random(1,2);
lg=le+30000;
elapsed=micros();
for (lc=le; lc<lg); lc++) //this syntax avoid compiler semplifications?
{
fb=fb/fa;
}
elapsed=micros()-elapsed;
// compute MIPS and display
The int loop may be a ISUM
ia=random(1,2);
ib=random(1,1000);
le=random(1,2);
lg=le+30000;
elapsed=micros();
for (lc=le; lc<lg; lc++) //this syntax avoid compiler semplifications?
{
ib=ib+ia;
}
elapsed=micros()-elapsed;
// compute MIPS and display
moises1953: Operations in less time than calibration loop?. Not posible. May be invalid formating of time functions.
Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+) INT_LOOP(30000) bench...= 116898 microseconds 11.92MIPS LONG_LOOP(30000) bench...= 116898 microseconds 11.93MIPS FLOAT_DIV(30000) bench...= 116898 microseconds 0.38MFLOPS DOUBLE_DIV(30000) bench...= 113126 microseconds 0.27MFLOPS FLOAT_MUL(30000) bench...= 92387 microseconds 0.33MFLOPS DOUBLE_MUL(30000) bench...= 116898 microseconds 0.26MFLOPS
At high speed the results are imprecise: Teensy 3.6 (Cortex M4@180Mhz). The result of FLOAT_MUL is 181.82 MIPS. The empty reference loop has the following repetitive high level operations: 1)increment 2)compare 3)jump And takes 502 microsecond for 30000 iterations, so 59.76Mloops. The high level operations MIPS are: 59.76*3=179.28 How is posible to achieve 181.82 MIPS using FLOAT_MUL?. Without optimizations must be 180 MIPS or 179.28 may be.
Operations are operation and asignement, and may be the asignement time was negligible. The inclusion of asignement to a constant in the LONG calibration loop may be a best approach, as sugested by westfw.
May be interesting to measure the asignement time (ad MIPS) of diferent data types
The attach contains a operations MIPS comparative table, asigning 3 operations to a loop
Thanks Moises. I am grateful you took the time to look at the code.
I wrote the code a while ago, (indeed 180MHz microcontrollers were not exactly a target).
, if I recall correctly I tried to make all the loops look similar "in structure" to the calibration loop (so I could remove the loop weight). A float should give about 180MFLOPS in cortex-M4+FPU. I see your points however the accuracy is quite undermined by the use of the function micros (which has a granularity of 8 microseconds) and a loop of 30000 is probably quite insufficient. Actually I think 181.82MFLOPS is quite close, but probably the number of digits is definetely pointless.
The "DUMMY" assignments were made (if I still recall) because they somewhat had an effect in the compiled code. Probably a better programmer would have coded directly in assembler caring to make all the loops exaclty the same (and I am also a lazy programmer most of the time!).
I recall testing the different suggestion (looking at the compiled code), but I did not have time to improve the bench for high speed (without affecting the old results).
:)
Marco
ESP32 INT_LOOP(30000) bench...= 1 microseconds 30000.00MIPS LONG_LOOP(30000) bench...= 1 microseconds 30000.00MIPS FLOAT_DIV(30000) bench...= 6420 microseconds 4.67MFLOPS DOUBLE_DIV(30000) bench...= 5036 microseconds 5.96MFLOPS FLOAT_MUL(30000) bench...= 501 microseconds 60.00MFLOPS DOUBLE_MUL(30000) bench...= 5544 microseconds 5.41MFLOPS
I also wrote a benchmark for different MCUs, both AVRs and ARMs.
The benchmark test peforms low- and high-level tests for integers, floats, doubles, bitshift, random, sort, matrix algebra, GPIO r/w, and graphics.
The test will run even without having attached a TFT, you may keep the #included Adafruit libs or optionally substitute them by proprietary ones.
Update: the test for Raspberry Pi now also has been completed.
As AVRs don’t feature 64 bit doubles, the 32bit float test is performed twice, without issueing penalty points though (which admittedly is not fair to the ARM boards )
( … to be continued … )
test design:
0 int_Add 50,000,000 int +,- plus counter
1 int_Mult 10,000,000 int *,/ plus counter
2 fp32_ops 2,500,000 fp32 mult, transc. plus counter
3 fp64_ops 2,500,000 fp64 mult, transc. plus counter (if N/A: 32bit)
4 randomize 2,500,000 Mersenne PRNG (+ * & ^ << >>)
5 matrx_algb 150,000 2D Matrix algebra (mult, det)
6 arr_sort 1500 shellsort of random array[500]
7 GPIO toggle 6,000,000 toggle GPIO r/w plus counter
8 Graphics 10*8 textlines + 10*8 shapes + 20 clrscr
.
Vergleichswerte (update: auch für Raspi jetzt komplett durchgeführt):
Arduino MEGA + ILI9225 + Karlson UTFT + Arduino GPIO-r/w
0 90244 int_Add
1 237402 int_Mult
2 163613 fp32_ops(float)
3 163613 fp32_ops(float=double)
4 158567 randomize
5 46085 matrx_algb
6 23052 arr_sort
7 41569 GPIO toggle
8 62109 Graphics
runtime ges.: 986254
benchmark: 51
Arduino MEGA + ILI9225 + Karlson UTFT + Register bitRead/Write
0 90238 int_Add
1 237387 int_Mult
2 163602 fp32_ops (float)
3 163602 fp32_ops (float=double)
4 158557 randomize
5 45396 matrx_algb
6 23051 arr_sort
7 4528 GPIO_toggle bit r/w
8 62106 Graphics
runtime ges.: 948467
benchmark: 53
Arduino MEGA + adafruit_ILI9341 Hardware-SPI Arduino GPIO r/w
0 90244 int_Add
1 237401 int_Mult
2 163612 fp32_ops (float)
3 163612 fp32_ops (float=double)
4 158725 randomize
5 46079 matrx_algb
6 23051 arr_sort
7 41947 GPIO toggle
8 6915 Graphics
runtime ges.: 931586
benchmark: 54
Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI
0 7746 int_Add
1 15795 int_Mult
2 89054 fp32_ops
3 199888 fp64_ops(double)
4 17675 randomize
5 18650 matrx_algb
6 6328 arr_sort
7 9944 GPIO_toggle
8 6752 Graphics
runtime ges.: 371832
benchmark: 134
Arduino DUE + adafruit_ILI9341 Hardware-SPI
0 4111 int_Add
1 1389 int_Mult
2 29124 fp32_ops(float)
3 57225 fp64_ops(double)
4 3853 randomize
5 4669 matrx_algb
6 2832 arr_sort
7 11859 GPIO_toggle
8 6142 Graphics
runtime ges.: 121204
benchmark: 413
Arduino/Adafruit M4 + adafruit_HX8357 Hardware-SPI
0 2253 int_Add
1 872 int_Mult
2 2773 fp32_ops (float)
3 24455 fp64_ops (double)
4 1680 randomize
5 1962 matrx_algb
6 1553 arr_sort
7 2395 GPIO_toggle
8 4600 Graphics
runtime ges.: 39864
benchmark: 1254
Arduino/Adafruit ESP32 + adafruit_HX8357 Hardware-SPI
0 2308 int_Add
1 592 int_Mult
2 1318 fp32_ops
3 14528 fp64_ops
4 825 randomize
5 1101 matrx_algb
6 687 arr_sort
7 972 GPIO_toggle
8 3053 Graphics
runtime ges.: 25384
benchmark: 1969
Raspberry Pi:
Raspi 2 (v1): 4x 900MHz, GPU 400MHz, no CPU overclock, full-HD, openVG:
0 384 int_Add
1 439 int_Mult
2 346 fp32_ops(float)
3 441 fp64_ops(double)
4 399 randomize
5 173 matrx_algb
6 508 arr_sort
7 823 GPIO_toggle
8 2632 graphics
runtime ges.: 6145
benchmark: 8137
edit: updated for
Arduino/Adafruit Feather ESP32
Ardubench_22_ILI9341_Adafruit.zip (4.88 KB)
Trycage, thanks a lot for posting the list of benchmarks, and for updating them as new boards appeared.
To avoid transcription errors on my part, I grabbed the table as-is and parsed it with Python. I've normalized the figures relative to the STM32, then took the mean of the normalized figures for each device and plotted them.
It seems that the way that the MIPS & MFLOPS were calculated from the timings was the same for the STM32, Arduino Nano & Due, Teensy LC/3.2/3.2@120MHz, and ESP8266. The timings for the Arduino Zero seem overlong although its MIPS & MFLOPS look OK. A roughly doubled way of calculating the MIPS and MFLOPS from the timings seems to be the case for the Teensy 3.5/3.6/40 and Dragonfly.
I guess that this comes from getting the timings from different people, and possibly different versions of the benchmark, although it would be nice to figure out where this difference is coming from.
Larger image: https://i.imgur.com/PlJjQ72.png
Puzzling over why there was an inconsistency between the two ways of plotting the benchmarks, I've looked at the C++ code of the benchmark and I can see that the Mops figure is simply 30000 divided by the timing. So if the table is accurate, then multiplying the timing by the Mops figure should regenerate that 30000 figure, i.e. the number of times the test was run.
I've written another Python program to parse the benchmark table and highlight in red if that figure is not with 10% of 30000. Even with that wide range, there are many outside that range.
I don't know why that should be so. Anyway, that's enough for today, I'll try to figure that out some other time.
The python parsing code:
The parsed table:
TEST TIME MOPS TIME*MOPS
---- ---- ---- ---------
[color=grey]STM32F103C8T6 72MHz (Cortex-M3)[/color]
INT_LOOP 2924 μs 10.26 Mips 30000.24
LONG_LOOP 2926 μs 10.25 Mips 29991.50
FLOAT_DIV 27979 μs 1.20 Mflops [color=red] 33574.80[/color]
DOUBLE_DIV 38000 μs 0.86 Mflops 32680.00
FLOAT_MUL 20463 μs 1.71 Mflops [color=red] 34991.73[/color]
DOUBLE_MUL 25891 μs 1.31 Mflops [color=red] 33917.21[/color]
[color=grey]Arduino Nano (ATMega328 16MHz AVR)[/color]
INT_LOOP 7544 μs 3.98 Mips 30025.12
LONG_LOOP 13408 μs 2.24 Mips 30033.92
FLOAT_DIV 154792 μs 0.21 Mflops 32506.32
DOUBLE_DIV 154800 μs 0.21 Mflops 32508.00
FLOAT_MUL 156744 μs 0.21 Mflops 32916.24
DOUBLE_MUL 156736 μs 0.21 Mflops 32914.56
[color=grey]Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)[/color]
INT_LOOP 116898 μs 11.92 Mips [color=red] 1393424.16[/color]
LONG_LOOP 116898 μs 11.93 Mips [color=red] 1394593.14[/color]
FLOAT_DIV 116898 μs 0.38 Mflops [color=red] 44421.24[/color]
DOUBLE_DIV 113126 μs 0.27 Mflops 30544.02
FLOAT_MUL 92387 μs 0.33 Mflops 30487.71
DOUBLE_MUL 116898 μs 0.26 Mflops 30393.48
[color=grey]Arduino Due (Atmel SAM3X8E 84 MHz Cortex-M3)[/color]
INT_LOOP 1074 μs 27.93 Mips 29996.82
LONG_LOOP 1107 μs 27.10 Mips 29999.70
FLOAT_DIV 25859 μs 1.21 Mflops 31289.39
DOUBLE_DIV 37966 μs 0.81 Mflops 30752.46
FLOAT_MUL 18659 μs 1.71 Mflops 31906.89
DOUBLE_MUL 25450 μs 1.23 Mflops 31303.50
[color=grey]Teensy LC (MKL26Z64 Cortex-M0 48MHz)[/color]
INT_LOOP 2508 μs 11.96 Mips 29995.68
LONG_LOOP 2512 μs 11.94 Mips 29993.28
FLOAT_DIV 76705 μs 0.40 Mflops 30682.00
DOUBLE_DIV 101840 μs 0.30 Mflops 30552.00
FLOAT_MUL 80471 μs 0.38 Mflops 30578.98
DOUBLE_MUL 106242 μs 0.29 Mflops 30810.18
[color=grey]Teensy 3.2 (MK20DX256 Cortex-M4 96 MHz)[/color]
INT_LOOP 940 μs 31.91 Mips 29995.40
LONG_LOOP 944 μs 31.78 Mips 30000.32
FLOAT_DIV 10977 μs 2.99 Mflops 32821.23
DOUBLE_DIV 21317 μs 1.47 Mflops 31335.99
FLOAT_MUL 8463 μs 3.99 Mflops [color=red] 33767.37[/color]
DOUBLE_MUL 13162 μs 2.46 Mflops 32378.52
[color=grey]Teensy 3.2 (MK20DX256 Cortex-M4 72MHz)[/color]
INT_LOOP 1253 μs 23.94 Mips 29996.82
LONG_LOOP 1256 μs 23.89 Mips 30005.84
FLOAT_DIV 14635 μs 2.24 Mflops 32782.40
DOUBLE_DIV 25083 μs 1.26 Mflops 31604.58
FLOAT_MUL 11288 μs 2.99 Mflops [color=red] 33751.12[/color]
DOUBLE_MUL 17551 μs 1.84 Mflops 32293.84
[color=grey]ESP8266 esp-12e 160MHz[/color]
INT_LOOP 752 μs 39.89 Mips 29997.28
LONG_LOOP 751 μs 39.95 Mips 30002.45
FLOAT_DIV 7500 μs 4.45 Mflops [color=red] 33375.00[/color]
DOUBLE_DIV 8063 μs 4.10 Mflops [color=red] 33058.30[/color]
FLOAT_MUL 9938 μs 3.27 Mflops 32497.26
DOUBLE_MUL 10688 μs 3.02 Mflops 32277.76
[color=grey]ESP8266 esp-12e 80MHz[/color]
INT_LOOP 1504 μs 19.95 Mips 30004.80
LONG_LOOP 1501 μs 19.99 Mips 30004.99
FLOAT_DIV 15001 μs 2.22 Mflops [color=red] 33302.22[/color]
DOUBLE_DIV 16126 μs 2.05 Mflops [color=red] 33058.30[/color]
FLOAT_MUL 19876 μs 1.63 Mflops 32397.88
DOUBLE_MUL 21377 μs 1.51 Mflops 32279.27
[color=grey]#From mantoui[/color]
[color=grey]teensy3.6 @180mhz[/color]
INT_LOOP 500 μs 60.00 Mips 30000.00
LONG_LOOP 502 μs 59.76 Mips 29999.52
FLOAT_DIV 2503 μs 14.99 Mflops [color=red] 37519.97[/color]
DOUBLE_DIV 9343 μs 3.39 Mflops 31672.77
FLOAT_MUL 667 μs 181.82 Mflops [color=red] 121273.94[/color]
DOUBLE_MUL 7008 μs 4.61 Mflops 32306.88
[color=grey]teensy3.6 @120mhz[/color]
INT_LOOP 752 μs 39.89 Mips 29997.28
LONG_LOOP 753 μs 39.84 Mips 29999.52
FLOAT_DIV 3756 μs 9.99 Mflops [color=red] 37522.44[/color]
DOUBLE_DIV 14019 μs 2.26 Mflops 31682.94
FLOAT_MUL 1001 μs 120.97 Mflops [color=red] 121090.97[/color]
DOUBLE_MUL 10514 μs 3.07 Mflops 32277.98
[color=grey]teensy3.5@120mhz[/color]
INT_LOOP 752 μs 39.89 Mips 29997.28
LONG_LOOP 755 μs 39.74 Mips 30003.70
FLOAT_DIV 3758 μs 9.99 Mflops [color=red] 37542.42[/color]
DOUBLE_DIV 18797 μs 1.66 Mflops 31203.02
FLOAT_MUL 1003 μs 120.97 Mflops [color=red] 121332.91[/color]
DOUBLE_MUL 10529 μs 3.07 Mflops 32324.03
[color=grey]teensy3.2@120mhz[/color]
INT_LOOP 751 μs 39.95 Mips 30002.45
LONG_LOOP 755 μs 39.74 Mips 30003.70
FLOAT_DIV 8784 μs 3.74 Mflops 32852.16
DOUBLE_DIV 17559 μs 1.79 Mflops 31430.61
FLOAT_MUL 6771 μs 4.99 Mflops [color=red] 33787.29[/color]
DOUBLE_MUL 10533 μs 3.07 Mflops 32336.31
[color=grey]dragonfly@80MHz [/color]
INT_LOOP 1129 μs 26.57 Mips 29997.53
LONG_LOOP 1129 μs 26.57 Mips 29997.53
FLOAT_DIV 5641 μs 6.65 Mflops [color=red] 37512.65[/color]
DOUBLE_DIV 21813 μs 1.45 Mflops 31628.85
FLOAT_MUL 1883 μs 39.79 Mflops [color=red] 74924.57[/color]
DOUBLE_MUL 16173 μs 1.99 Mflops 32184.27
[color=grey]#From Budvar10[/color]
INT_LOOP 5024 μs 5.97 Mips 29993.28
LONG_LOOP 8992 μs 3.34 Mips 30033.28
FLOAT_DIV 96789 μs 0.34 Mflops 32908.26
DOUBLE_DIV 96800 μs 0.34 Mflops 32912.00
FLOAT_MUL 98058 μs 0.34 Mflops [color=red] 33339.72[/color]
DOUBLE_MUL 98059 μs 0.34 Mflops [color=red] 33340.06[/color]
[color=grey]#From gdsports[/color]
INT_LOOP 752 μs 39.89 Mips 29997.28
LONG_LOOP 753 μs 39.84 Mips 29999.52
FLOAT_DIV 3756 μs 9.99 Mflops [color=red] 37522.44[/color]
DOUBLE_DIV 14022 μs 2.26 Mflops 31689.72
FLOAT_MUL 1002 μs 120.48 Mflops [color=red] 120720.96[/color]
DOUBLE_MUL 10516 μs 3.07 Mflops 32284.12
[color=grey]Teensy 4.0 @600MHz[/color]
FLOAT_DIV 200 μs 300.00 Mflops [color=red] 60000.00[/color]
DOUBLE_DIV 201 μs 297.03 Mflops [color=red] 59703.03[/color]
FLOAT_MUL 150 μs 600.00 Mflops [color=red] 90000.00[/color]
DOUBLE_MUL 300 μs 150.00 Mflops [color=red] 45000.00[/color]
INT_LOOP 300 μs 600.00 Mips [color=red] 180000.00[/color]
LONG_LOOP 300 μs 300.00 Mips [color=red] 90000.00[/color]
FLOAT_DIV 300 μs 300.00 Mflops [color=red] 90000.00[/color]