Arduino Forum

Using Arduino => Microcontrollers => Topic started by: trycage on Oct 23, 2016, 07:14 pm

Title: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Oct 23, 2016, 07:14 pm
Dear All
In my research lab we are developing a custom application that required a bit of cost per performance evaluation in number crunching capabilities. The software we used to benchmark comprises simple loops, structured in a way to avoid compiler simplifications. Not very sophisticated, but it resembles many operations we are presently doing (including the direct storing of a result in one of the operators). Sure, there are more sophisticated bench around, but I like the idea to share the results.
These are the results we got, with the environment Arduino 1.6.9.

--------------------------------------------------------------------
Update - 1.01
-I introduced a modification suggested by Riva
-Westfw pointed out the different default compiler optimizations on different platforms
Now the compiler optimization is fixed to -O1, and as expected the Due is closer to the Teensy 3.2 in terms of performances
- I added the bench on the Teensy LC

05/04/2018 Update on the results
-Arduino Zero added
-Arduino Pro 1284 (24MHz) added (Thanks Budvar10)

01/05/2018
-Adafruit Metro M4 Express (samd51 @120MHz) cache on added(Thanks gdsports)


Generic STM32F103C8T6 72MHz (Cortex-M3)
INT_LOOP(30000) bench...= 2924 microseconds 10.26MIPS
LONG_LOOP(30000) bench...= 2926 microseconds 10.25MIPS
FLOAT_DIV(30000) bench...= 27979 microseconds 1.20MFLOPS
DOUBLE_DIV(30000) bench...= 38000 microseconds 0.86MFLOPS
FLOAT_MUL(30000) bench...= 20463 microseconds 1.71MFLOPS
DOUBLE_MUL(30000) bench...= 25891 microseconds 1.31MFLOPS

Arduino Nano (ATMega328 16MHz AVR)
INT_LOOP(30000) bench...= 7544 microseconds 3.98MIPS
LONG_LOOP(30000) bench...= 13408 microseconds 2.24MIPS
FLOAT_DIV(30000) bench...= 154792 microseconds 0.21MFLOPS
DOUBLE_DIV(30000) bench...= 154800 microseconds 0.21MFLOPS
FLOAT_MUL(30000) bench...= 156744 microseconds 0.21MFLOPS
DOUBLE_MUL(30000) bench...= 156736 microseconds 0.21MFLOPS

Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)
INT_LOOP(30000) bench...= 116898 microseconds 11.92MIPS
LONG_LOOP(30000) bench...= 116898 microseconds 11.93MIPS
FLOAT_DIV(30000) bench...= 116898 microseconds 0.38MFLOPS
DOUBLE_DIV(30000) bench...= 113126 microseconds 0.27MFLOPS
FLOAT_MUL(30000) bench...= 92387 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench...= 116898 microseconds 0.26MFLOPS


Arduino Due (Atmel SAM3X8E 84 MHz Cortex-M3)
INT_LOOP(30000) bench...= 1074 microseconds 27.93MIPS
LONG_LOOP(30000) bench...= 1107 microseconds 27.10MIPS
FLOAT_DIV(30000) bench...= 25859 microseconds 1.21MFLOPS
DOUBLE_DIV(30000) bench...= 37966 microseconds 0.81MFLOPS
FLOAT_MUL(30000) bench...= 18659 microseconds 1.71MFLOPS
DOUBLE_MUL(30000) bench...= 25450 microseconds 1.23MFLOPS


Teensy LC (MKL26Z64 Cortex-M0 48MHz)
INT_LOOP(30000) bench...= 2508 microseconds 11.96MIPS
LONG_LOOP(30000) bench...= 2512 microseconds 11.94MIPS
FLOAT_DIV(30000) bench...= 76705 microseconds 0.40MFLOPS
DOUBLE_DIV(30000) bench...= 101840 microseconds 0.30MFLOPS
FLOAT_MUL(30000) bench...= 80471 microseconds 0.38MFLOPS
DOUBLE_MUL(30000) bench...= 106242 microseconds 0.29MFLOPS


Teensy 3.2 (MK20DX256 Cortex-M4 96 MHz)
INT_LOOP(30000) bench...= 940 microseconds 31.91MIPS
LONG_LOOP(30000) bench...= 944 microseconds 31.78MIPS
FLOAT_DIV(30000) bench...= 10977 microseconds 2.99MFLOPS
DOUBLE_DIV(30000) bench...= 21317 microseconds 1.47MFLOPS
FLOAT_MUL(30000) bench...= 8463 microseconds 3.99MFLOPS
DOUBLE_MUL(30000) bench...= 13162 microseconds 2.46MFLOPS


Teensy 3.2 (MK20DX256 Cortex-M4 72MHz)
INT_LOOP(30000) bench...= 1253 microseconds 23.94MIPS
LONG_LOOP(30000) bench...= 1256 microseconds 23.89MIPS
FLOAT_DIV(30000) bench...= 14635 microseconds 2.24MFLOPS
DOUBLE_DIV(30000) bench...= 25083 microseconds 1.26MFLOPS
FLOAT_MUL(30000) bench...= 11288 microseconds 2.99MFLOPS
DOUBLE_MUL(30000) bench...= 17551 microseconds 1.84MFLOPS

ESP8266 esp-12e 160MHz
INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
LONG_LOOP(30000) bench...= 751 microseconds 39.95MIPS
FLOAT_DIV(30000) bench...= 7500 microseconds 4.45MFLOPS
DOUBLE_DIV(30000) bench...= 8063 microseconds 4.10MFLOPS
FLOAT_MUL(30000) bench...= 9938 microseconds 3.27MFLOPS
DOUBLE_MUL(30000) bench...= 10688 microseconds 3.02MFLOPS


ESP8266 esp-12e 80MHz
INT_LOOP(30000) bench...= 1504 microseconds 19.95MIPS
LONG_LOOP(30000) bench...= 1501 microseconds 19.99MIPS
FLOAT_DIV(30000) bench...= 15001 microseconds 2.22MFLOPS
DOUBLE_DIV(30000) bench...= 16126 microseconds 2.05MFLOPS
FLOAT_MUL(30000) bench...= 19876 microseconds 1.63MFLOPS
DOUBLE_MUL(30000) bench...= 21377 microseconds 1.51MFLOPS



#From mantoui

teensy3.6 @180mhz
      INT_LOOP(30000) bench...= 500 microseconds 60.00MIPS
      LONG_LOOP(30000) bench...= 502 microseconds 59.76MIPS
      FLOAT_DIV(30000) bench...= 2503 microseconds 14.99MFLOPS
      DOUBLE_DIV(30000) bench...= 9343 microseconds 3.39MFLOPS
      FLOAT_MUL(30000) bench...= 667 microseconds 181.82MFLOPS
      DOUBLE_MUL(30000) bench...= 7008 microseconds 4.61MFLOPS

teensy3.6 @120mhz
     INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
     LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
     FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
     DOUBLE_DIV(30000) bench...= 14019 microseconds 2.26MFLOPS
     FLOAT_MUL(30000) bench...= 1001 microseconds 120.97MFLOPS
     DOUBLE_MUL(30000) bench...= 10514 microseconds 3.07MFLOPS

teensy3.5@120mhz
     INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
     LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
     FLOAT_DIV(30000) bench...= 3758 microseconds 9.99MFLOPS
     DOUBLE_DIV(30000) bench...= 18797 microseconds 1.66MFLOPS
     FLOAT_MUL(30000) bench...= 1003 microseconds 120.97MFLOPS
     DOUBLE_MUL(30000) bench...= 10529 microseconds 3.07MFLOPS

teensy3.2@120mhz
     INT_LOOP(30000) bench...= 751 microseconds 39.95MIPS
     LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
     FLOAT_DIV(30000) bench...= 8784 microseconds 3.74MFLOPS
     DOUBLE_DIV(30000) bench...= 17559 microseconds 1.79MFLOPS
     FLOAT_MUL(30000) bench...= 6771 microseconds 4.99MFLOPS
     DOUBLE_MUL(30000) bench...= 10533 microseconds 3.07MFLOPS

dragonfly@80MHz      
    INT_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
    LONG_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
    FLOAT_DIV(30000) bench...= 5641 microseconds 6.65MFLOPS
    DOUBLE_DIV(30000) bench...= 21813 microseconds 1.45MFLOPS
    FLOAT_MUL(30000) bench...= 1883 microseconds 39.79MFLOPS
    DOUBLE_MUL(30000) bench...= 16173 microseconds 1.99MFLOPS

#From Budvar10
  Arduino-PRO 1284 (ATmega1284P 24MHz)
  INT_LOOP(30000) bench...= 5024 microseconds 5.97MIPS
  LONG_LOOP(30000) bench...= 8992 microseconds 3.34MIPS
  FLOAT_DIV(30000) bench...= 96789 microseconds 0.34MFLOPS
  DOUBLE_DIV(30000) bench...= 96800 microseconds 0.34MFLOPS
  FLOAT_MUL(30000) bench...= 98058 microseconds 0.34MFLOPS
  DOUBLE_MUL(30000) bench...= 98059 microseconds 0.34MFLOPS
 
#From gdsports
  Adafruit Metro M4 Express (samd51 @120MHz) cache on
  INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
  LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
  FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
  DOUBLE_DIV(30000) bench...= 14022 microseconds 2.26MFLOPS
  FLOAT_MUL(30000) bench...= 1002 microseconds 120.48MFLOPS
  DOUBLE_MUL(30000) bench...= 10516 microseconds 3.07MFLOPS




the code is in attachment.

Very soon I will have a comparison of the relative typical noise in the A/D of the different platform. Indeed the Teensy platform seems to have more muscles, and  also the performance per MHz in integer operations is very solid. However, in terms of cost/performance the STM32 board is a generic clone acquired for around 2$, difficult to beat.

Cheers!



Trycage





Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: ron_sutherland on Oct 23, 2016, 09:48 pm
I've not had a chance to try the Teensy 3.5 I got from Paul, but your bench test looks like an easy thing to try (thanks for posting it).

Teensy 3.5 (MK64FX512 Cortex-M4 120 MHz)
 INT_LOOP(30000) bench...= 765 microseconds 39.22MIPS
 LONG_LOOP(30000) bench...= 757 microseconds 39.63MIPS
 FLOAT_DIV(30000) bench...= 3762 microseconds 9.98MFLOPS
 DOUBLE_DIV(30000) bench...= 26316 microseconds 1.17MFLOPS
 FLOAT_MUL(30000) bench...= 1257 microseconds 60.00MFLOPS
 DOUBLE_MUL(30000) bench...= 10534 microseconds 3.07MFLOPS

NOTE: used Arduino IDE 1.6.12 with Teensyduino 1.31 beta1
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: hansibull on Oct 23, 2016, 11:02 pm
EDIT: benchmark with v1.01:
ESP8266 (Wemos D1 mini PRO) @ 80 MHz

 INT_LOOP(30000) bench...= 1504 microseconds 19.95MIPS
 LONG_LOOP(30000) bench...= 1500 microseconds 20.00MIPS
 FLOAT_DIV(30000) bench...= 15001 microseconds 2.22MFLOPS
 DOUBLE_DIV(30000) bench...= 16126 microseconds 2.05MFLOPS
 FLOAT_MUL(30000) bench...= 19876 microseconds 1.63MFLOPS
 DOUBLE_MUL(30000) bench...= 21376 microseconds 1.51MFLOPS



ESP8266 (Wemos D1 mini PRO) @ 160 MHz

 INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
 LONG_LOOP(30000) bench...= 750 microseconds 40.00MIPS
 FLOAT_DIV(30000) bench...= 7500 microseconds 4.44MFLOPS
 DOUBLE_DIV(30000) bench...= 8063 microseconds 4.10MFLOPS
 FLOAT_MUL(30000) bench...= 9938 microseconds 3.27MFLOPS
 DOUBLE_MUL(30000) bench...= 10688 microseconds 3.02MFLOPS



If anyone got an ESP32 please post your benchmarks! :)
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: westfw on Oct 24, 2016, 07:37 am
Quote
Arduino Due (Atmel SAM3X8E 84 MHz Cortex-M3)
 INT_LOOP(30000) bench...= 5268 microseconds 5.69MIPS
 LONG_LOOP(30000) bench...= 6712 microseconds 4.47MIPS

Teensy 3.2 (MK20DX256 Cortex-M4 96 MHz)
 INT_LOOP(30000) bench...= 956 microseconds 31.38MIPS
 LONG_LOOP(30000) bench...= 947 microseconds 31.68MIPS
Those are very strange results for the Due...
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: Budvar10 on Oct 24, 2016, 08:09 am
@rduino-PRO 1284 (ATmega1284P 24MHz) (http://forum.arduino.cc/index.php?topic=277260.0)
INT_LOOP(30000) bench...= 9354 microseconds 3.21MIPS
LONG_LOOP(30000) bench...= 14240 microseconds 2.11MIPS
FLOAT_DIV(30000) bench...= 103296 microseconds 0.34MFLOPS
DOUBLE_DIV(30000) bench...= 103307 microseconds 0.34MFLOPS
FLOAT_MUL(30000) bench...= 104565 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench...= 104576 microseconds 0.33MFLOPS
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: Riva on Oct 24, 2016, 08:30 am
Nice to have some more benchmarking tests but the test will give wrong values for the loop times as your printing several lines of text before calculating the loop duration.
I think the elapsed=micros()-elapsed; should come directly after the for loop.
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Oct 24, 2016, 08:40 am
Those are very strange results for the Due...
I agree it seens quite slower. The specific loop for integers we are using is

"
 for (ic=ie; ic<(ie+30000); ic++)
 {
     
 }
"
The specific syntax avoids compiler simplifications (at least with the default compiler) flag as "ie" cannot be evaluated prior execution. It should be implemented as an integer increment and a comparison, however the Cortex-M4 has a more extended instruction set, and the M4 support saturation arithmetic, so probably the different speed is simply due to a different code generated. Hence, as usual the due is maybe not fast in doing this specific task. However, the specific iteration used is quite common in coding.


Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: ron_sutherland on Oct 24, 2016, 09:13 am
Any ideas why the DOUBLE_DIV() test is slow on Teensy 3.5
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: westfw on Oct 24, 2016, 09:13 am
Due, slightly modified; seems more reasonable.

Time (ms)...= 12083 ms
INT_LOOP(30000) bench...= 1151 microseconds 26.06MIPS
LONG_LOOP(30000) bench...= 1131 microseconds 26.53MIPS
FLOAT_DIV(30000) bench...= 28098 microseconds 1.11MFLOPS
DOUBLE_DIV(30000) bench...= 36951 microseconds 0.84MFLOPS
FLOAT_MUL(30000) bench...= 19788 microseconds 1.61MFLOPS
DOUBLE_MUL(30000) bench...= 24436 microseconds 1.29MFLOPS
-------------------------------------------

It turns out that Due compiles with optimization flag "-Os", while Teensy3 compiles with just "-O"
On AVR, -Os seems to incorporate nearly all of the useful optimizations from -O, but that doesn't seem to be the case for ARM.   With -Os, Due produces code like this for the integer loop:


Code: [Select]
 for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
   8018a:    460b          mov    r3, r1
   8018c:    f501 42ea     add.w    r2, r1, #29952    ; 0x7500
   80190:    322f          adds    r2, #47    ; 0x2f
   80192:    429a          cmp    r2, r3
   80194:    db01          blt.n    8019a <loop+0x52>
   80196:    3301          adds    r3, #1
   80198:    e7f8          b.n    8018c <loop+0x44>


Notice that the branch at the end goes back to 8018c (the "add.w" instruction), so there are 5 instructions in the loop.
With -O, it does:

Code: [Select]
 for (ic=ie; ic<(ie+30000); ic++) //this syntax avoid compiler semplifications
   80186:       6823            ldr     r3, [r4, #0]
   80188:       4aa0            ldr     r2, [pc, #640]  ; (8040c <loop+0x2c4>)
   8018a:       6013            str     r3, [r2, #0]
   8018c:       f503 42ea       add.w   r2, r3, #29952  ; 0x7500
   80190:       3230            adds    r2, #48 ; 0x30
   80192:       4293            cmp     r3, r2
   80194:       da05            bge.n   801a2 <loop+0x5a>
   80196:       f247 5330       movw    r3, #30000      ; 0x7530
   8019a:       3b01            subs    r3, #1
   8019c:       d1fd            bne.n   8019a <loop+0x52>


This reduces the loop to a single instruction that it does an "equivalent" number of times, but it has to check the initial condition separately, so it's a bit bigger.  (Why they can't use the same 30000 for the add and the loop counter, I'm not sure...)  Sketch uses 28,792 bytes, vs Sketch uses 28,044 with -Os - about 3% larger...

Just for kicks, using -O3 makes for 29,528 bytes, and succeeds in completely optimizing the loops away, giving a speed of up to 769 MIPS  :-)






Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Oct 24, 2016, 09:19 am
Nice to have some more benchmarking tests but the test will give wrong values for the loop times as your printing several lines of text before calculating the loop duration.
I think the elapsed=micros()-elapsed; should come directly after the for loop.

Thanks Riva,

I will implement your modification.

trycage
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Oct 24, 2016, 11:24 am
@ron_sutherland, @hansibull and @Budvar10, if it is not too much trouble
if you could re-run the latest version of bench (whihc now fixes the compiler options on every platform) I will publish your results at the top of the Post.

Thanks

Trycage
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: westfw on Oct 24, 2016, 11:26 am
I suggest modifying a volatile variable inside the loop.


Code: [Select]
volatile byte dosomething;
  :
 for (lc=le; lc<(le+30000); lc++) //this syntax avoid compiler semplifications
  {
      dosomething = 0;
  }

Because null loops are pretty boring.  Then you won't need to be so tricky with your loops, either...

Various "long" variables used to hold timestamps should be "unsigned long"
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: Budvar10 on Oct 24, 2016, 11:36 am
...or put NOP instruction there
Code: [Select]
{
  __asm__ __volatile__("nop"); // AVR
}


EDIT: Forgot this. Several different processors - I totally missed, it was stupid idea. :)
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: westfw on Oct 24, 2016, 11:39 am
"nop" isn't guaranteed to be the right assembly on all chips.  (mind you, you'd have to be out of your mind as a chip designer not to have a "nop" instructions, but it could happen...)
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: Budvar10 on Oct 24, 2016, 11:43 am
@westfw
Yes, yes, while I realize a mistake, you've posted...
 :-*
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Oct 24, 2016, 12:11 pm
I suggest modifying a volatile variable inside the loop.


Code: [Select]
volatile byte dosomething;
  :
 for (lc=le; lc<(le+30000); lc++) //this syntax avoid compiler semplifications
  {
      dosomething = 0;
  }

Because null loops are pretty boring.  Then you won't need to be so tricky with your loops, either...

Various "long" variables used to hold timestamps should be "unsigned long"

Thanks westfw, our initial version of the code included some operations in the INT loop, however we reason that in the FOR statement there was already an increment operation. The code use the INT loop to calibrate the speed of the FLOAT loop, and it is probably ok to have a rough comparison between the platforms we got.

Probably I could code a WHILE statement where comparison and increment can appear as different recognizable operation, but I got the feeling that It would not be that different for the compiler.

Thanks a lot for the input.
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: mantoui on Nov 10, 2016, 07:51 pm
FYI, here are some more results v1.01 (pragma -O1) on Teensy 3.5/3.6/3.2 and on dragonfly (https://www.tindie.com/products/onehorse/dragonfly-stm32l4-development-board/) (STM32L4@80MHz, hardware float)
Code: [Select]
       t3.6 @180mhz
         INT_LOOP(30000) bench...= 500 microseconds 60.00MIPS
         LONG_LOOP(30000) bench...= 502 microseconds 59.76MIPS
         FLOAT_DIV(30000) bench...= 2503 microseconds 14.99MFLOPS
         DOUBLE_DIV(30000) bench...= 9343 microseconds 3.39MFLOPS
         FLOAT_MUL(30000) bench...= 667 microseconds 181.82MFLOPS
         DOUBLE_MUL(30000) bench...= 7008 microseconds 4.61MFLOPS

     t3.6 @120mhz
        INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
        LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
        FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
        DOUBLE_DIV(30000) bench...= 14019 microseconds 2.26MFLOPS
        FLOAT_MUL(30000) bench...= 1001 microseconds 120.97MFLOPS
        DOUBLE_MUL(30000) bench...= 10514 microseconds 3.07MFLOPS

       t3.5@120mhz
        INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
        LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
        FLOAT_DIV(30000) bench...= 3758 microseconds 9.99MFLOPS
        DOUBLE_DIV(30000) bench...= 18797 microseconds 1.66MFLOPS
        FLOAT_MUL(30000) bench...= 1003 microseconds 120.97MFLOPS
        DOUBLE_MUL(30000) bench...= 10529 microseconds 3.07MFLOPS

      t3.2@120mhz
        INT_LOOP(30000) bench...= 751 microseconds 39.95MIPS
        LONG_LOOP(30000) bench...= 755 microseconds 39.74MIPS
        FLOAT_DIV(30000) bench...= 8784 microseconds 3.74MFLOPS
        DOUBLE_DIV(30000) bench...= 17559 microseconds 1.79MFLOPS
        FLOAT_MUL(30000) bench...= 6771 microseconds 4.99MFLOPS
        DOUBLE_MUL(30000) bench...= 10533 microseconds 3.07MFLOPS

    dragonfly@80MHz      
       INT_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
       LONG_LOOP(30000) bench...= 1129 microseconds 26.57MIPS
       FLOAT_DIV(30000) bench...= 5641 microseconds 6.65MFLOPS
       DOUBLE_DIV(30000) bench...= 21813 microseconds 1.45MFLOPS
       FLOAT_MUL(30000) bench...= 1883 microseconds 39.79MFLOPS
       DOUBLE_MUL(30000) bench...= 16173 microseconds 1.99MFLOPS

Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Apr 05, 2018, 11:24 pm
-Updated
Added Arduino Zero and Arduino Pro 1284 (Thanks Budvar10)
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: gdsports on Apr 28, 2018, 03:54 am
Adafruit Metro M4 Express (samd51 @120MHz) cache on
 INT_LOOP(30000) bench...= 752 microseconds 39.89MIPS
 LONG_LOOP(30000) bench...= 753 microseconds 39.84MIPS
 FLOAT_DIV(30000) bench...= 3756 microseconds 9.99MFLOPS
 DOUBLE_DIV(30000) bench...= 14022 microseconds 2.26MFLOPS
 FLOAT_MUL(30000) bench...= 1002 microseconds 120.48MFLOPS
 DOUBLE_MUL(30000) bench...= 10516 microseconds 3.07MFLOPS
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on May 01, 2018, 07:08 pm
@gdsports Thanks!!!!

Then:

-Update
Added Adafruit Metro M4 Express (Thanks gdsports)
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: moises1953 on Jul 25, 2018, 06:26 am
Operations in less time than calibration loop?. Not posible. May be invalid formating of time functions.

Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)
INT_LOOP(30000) bench...= 116898 microseconds 11.92MIPS
LONG_LOOP(30000) bench...= 116898 microseconds 11.93MIPS
FLOAT_DIV(30000) bench...= 116898 microseconds 0.38MFLOPS
DOUBLE_DIV(30000) bench...= 113126 microseconds 0.27MFLOPS
FLOAT_MUL(30000) bench...= 92387 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench...= 116898 microseconds 0.26MFLOPS

At high speed the results are imprecise:
Teensy 3.6 (Cortex M4@180Mhz). The result of FLOAT_MUL is 181.82 MIPS.
The empty reference loop has the following repetitive high level operations:
1)increment
2)compare
3)jump
And takes 502 microsecond for 30000 iterations, so 59.76Mloops. The high level operations MIPS are: 59.76*3=179.28
How is posible to achieve 181.82 MIPS using FLOAT_MUL?. Without optimizations must be 180 MIPS or 179.28 may be.

Operations are operation and asignement, and may be the asignement time was negligible. The inclusion of asignement to a constant in the LONG calibration loop may be a best approach, as sugested by westfw.

May be interesting to measure the asignement time (ad MIPS) of diferent data types

The attach contains a operations MIPS comparative table, asigning 3 operations to a loop
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: moises1953 on Aug 05, 2018, 08:33 am
This is the code of FDIV loop:
Code: [Select]
fa=(float)random(1,2);
  fb=(float)random(1,1000);
  fb=0; // this line must be suppressed
  fg=0;
  le=random(1,2);
  elapsed=micros();
  for (lc=le; lc<(le+30000); lc++)
  {
    fb=fb/fa;       
  }
  elapsed=micros()-elapsed;


If fb is initialized to 0, then all operations are 0./fa, so this initialization must be suppressed, and also in the DDIV loop.

The use of variables fg and dg is useless, and may be suppressed.

The operation in for is a overcharge.

Proposed code for FDIV:
Code: [Select]
fa=(float)random(1,2);
  fb=(float)random(1,1000);
  le=random(1,2);
  lg=le+30000;
  elapsed=micros();
  for (lc=le; lc<lg); lc++)  //this syntax avoid compiler semplifications?
  {
    fb=fb/fa;       
  }
  elapsed=micros()-elapsed;
// compute MIPS and display


The int loop may be a ISUM
Code: [Select]
  ia=random(1,2);
  ib=random(1,1000);
  le=random(1,2);
  lg=le+30000;
  elapsed=micros();
  for (lc=le; lc<lg; lc++) //this syntax avoid compiler semplifications?
  {
    ib=ib+ia;
  }
  elapsed=micros()-elapsed;
// compute MIPS and display
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: trycage on Aug 23, 2018, 05:57 pm
Operations in less time than calibration loop?. Not posible. May be invalid formating of time functions.

Arduino Zero (Atmel ATSAMD21G18 48MHz Cortex-M0+)
INT_LOOP(30000) bench...= 116898 microseconds 11.92MIPS
LONG_LOOP(30000) bench...= 116898 microseconds 11.93MIPS
FLOAT_DIV(30000) bench...= 116898 microseconds 0.38MFLOPS
DOUBLE_DIV(30000) bench...= 113126 microseconds 0.27MFLOPS
FLOAT_MUL(30000) bench...= 92387 microseconds 0.33MFLOPS
DOUBLE_MUL(30000) bench...= 116898 microseconds 0.26MFLOPS

At high speed the results are imprecise:
Teensy 3.6 (Cortex M4@180Mhz). The result of FLOAT_MUL is 181.82 MIPS.
The empty reference loop has the following repetitive high level operations:
1)increment
2)compare
3)jump
And takes 502 microsecond for 30000 iterations, so 59.76Mloops. The high level operations MIPS are: 59.76*3=179.28
How is posible to achieve 181.82 MIPS using FLOAT_MUL?. Without optimizations must be 180 MIPS or 179.28 may be.

Operations are operation and asignement, and may be the asignement time was negligible. The inclusion of asignement to a constant in the LONG calibration loop may be a best approach, as sugested by westfw.

May be interesting to measure the asignement time (ad MIPS) of diferent data types

The attach contains a operations MIPS comparative table, asigning 3 operations to a loop
Thanks Moises. I am grateful you took the time to look at the code.

I wrote the code a while ago, (indeed 180MHz microcontrollers were not exactly a target).

, if I recall correctly I tried to make all the loops look similar "in structure" to the calibration loop (so I could remove the loop weight). A float should give about 180MFLOPS in cortex-M4+FPU. I see your points however the accuracy is quite undermined by the use of the function micros (which has a granularity of 8 microseconds) and a loop of 30000 is probably quite insufficient. Actually I think 181.82MFLOPS is quite close, but probably the number of digits is definetely pointless.

The "DUMMY" assignments were made (if I still recall) because they somewhat had an effect in the compiled code. Probably a better programmer would have coded directly in assembler caring to make all the loops exaclty the same (and I am also a lazy programmer most of the time!).

I recall testing the different suggestion (looking at the compiled code), but I did not have time to improve the bench for high speed (without affecting the old results).

:)



Marco


Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: phantom_ts on Aug 29, 2018, 06:34 pm

 ESP32
 INT_LOOP(30000) bench...= 1 microseconds 30000.00MIPS
 LONG_LOOP(30000) bench...= 1 microseconds 30000.00MIPS
 FLOAT_DIV(30000) bench...= 6420 microseconds 4.67MFLOPS
 DOUBLE_DIV(30000) bench...= 5036 microseconds 5.96MFLOPS
 FLOAT_MUL(30000) bench...= 501 microseconds 60.00MFLOPS
 DOUBLE_MUL(30000) bench...= 5544 microseconds 5.41MFLOPS
Title: Re: Benchmark STM32 vs ATMega328 (nano) vs SAM3X8E (due) vs MK20DX256 (teensy 3.2)
Post by: dsyleixa on Apr 20, 2019, 08:55 pm
I also wrote a benchmark for different MCUs, both AVRs and ARMs.
The benchmark test peforms low- and high-level tests for integers, floats, doubles, bitshift, random, sort, matrix algebra, GPIO r/w, and graphics.
The test will run even without having attached a TFT, you may keep the  #included Adafruit libs or optionally substitute them by proprietary ones.

Update: the test for Raspberry Pi now also has been completed.

As AVRs don't feature 64 bit doubles, the 32bit float test is performed twice, without issueing penalty points though (which admittedly is not fair to the ARM boards ;) )

( ... to be continued ... )

Code: [Select]

test design:
  0   int_Add     50,000,000 int +,- plus counter
  1   int_Mult    10,000,000 int *,/  plus counter
  2   fp32_ops    2,500,000 fp32 mult, transc.  plus counter
  3   fp64_ops    2,500,000 fp64 mult, transc.  plus counter (if N/A: 32bit)
  4   randomize   2,500,000 Mersenne PRNG (+ * & ^ << >>)
  5   matrx_algb  150,000 2D Matrix algebra (mult, det)
  6   arr_sort    1500 shellsort of random array[500]
  7   GPIO toggle 6,000,000 toggle GPIO r/w  plus counter
  8   Graphics    10*8 textlines + 10*8 shapes + 20 clrscr

.


Vergleichswerte (update: auch für Raspi jetzt komplett durchgeführt):

Arduino MEGA + ILI9225 + Karlson UTFT + Arduino GPIO-r/w
  0     90244  int_Add
  1    237402  int_Mult
  2    163613  fp32_ops(float)
  3    163613  fp32_ops(float=double)
  4    158567  randomize
  5     46085  matrx_algb
  6     23052  arr_sort
  7     41569  GPIO toggle
  8     62109  Graphics   
runtime ges.:  986254
benchmark:     51




Arduino MEGA + ILI9225 + Karlson UTFT + Register bitRead/Write
  0     90238  int_Add
  1    237387  int_Mult
  2    163602  fp32_ops (float)
  3    163602  fp32_ops (float=double)
  4    158557  randomize
  5     45396  matrx_algb
  6     23051  arr_sort
  7      4528  GPIO_toggle bit r/w
  8     62106  Graphics   
runtime ges.:  948467
benchmark:     53 


Arduino MEGA + adafruit_ILI9341 Hardware-SPI  Arduino GPIO r/w
  0     90244  int_Add
  1    237401  int_Mult
  2    163612  fp32_ops (float)
  3    163612  fp32_ops (float=double)
  4    158725  randomize
  5     46079  matrx_algb
  6     23051  arr_sort
  7     41947  GPIO toggle
  8      6915  Graphics   
runtime ges.:  931586
benchmark:     54 
 
 



Arduino/Adafruit M0 + adafruit_ILI9341 Hardware-SPI 
  0      7746  int_Add
  1     15795  int_Mult
  2     89054  fp32_ops
  3    199888  fp64_ops(double)
  4     17675  randomize
  5     18650  matrx_algb
  6      6328  arr_sort
  7      9944  GPIO_toggle
  8      6752  Graphics
runtime ges.:  371832
benchmark:     134



Arduino DUE + adafruit_ILI9341 Hardware-SPI 
  0      4111  int_Add
  1      1389  int_Mult
  2     29124  fp32_ops(float)
  3     57225  fp64_ops(double)
  4      3853  randomize
  5      4669  matrx_algb
  6      2832  arr_sort
  7     11859  GPIO_toggle
  8      6142  Graphics   
runtime ges.:  121204
benchmark:     413   



Arduino/Adafruit M4 + adafruit_HX8357 Hardware-SPI 
  0      2253  int_Add
  1       872  int_Mult
  2      2773  fp32_ops (float)
  3     24455  fp64_ops (double)
  4      1680  randomize
  5      1962  matrx_algb
  6      1553  arr_sort
  7      2395  GPIO_toggle
  8      4600  Graphics   
runtime ges.:  39864
benchmark:     1254   



Arduino/Adafruit ESP32 + adafruit_HX8357 Hardware-SPI 
  0      2308  int_Add
  1       592  int_Mult
  2      1318  fp32_ops
  3     14528  fp64_ops
  4       825  randomize
  5      1101  matrx_algb
  6       687  arr_sort
  7       972  GPIO_toggle
  8      3053  Graphics   
runtime ges.:  25384     
benchmark:     1969


Raspberry Pi:

Raspi 2 (v1): 4x 900MHz,  GPU 400MHz, no CPU overclock, full-HD, openVG:
  0     384  int_Add
  1     439  int_Mult
  2     346  fp32_ops(float)
  3     441  fp64_ops(double)
  4     399  randomize
  5     173  matrx_algb
  6     508  arr_sort
  7     823  GPIO_toggle
  8    2632  graphics
runtime ges.: 6145
benchmark: 8137   




edit: updated for
Arduino/Adafruit Feather ESP32