Zero Floating Point Benchmark vs. Duo

I ran a little 5x5 matrix multiplication Floating Point benchmark created by Texas Instruments. Code below. Since the Zero runs at 48MHz compared to Uno’s 16MHz, along with 32-bit execution space, I expected some significant 3x-plus improvement.

Zero: 915,756
Uno: 2,557,896

This is 2.79x improvement.

Replacing the multiplication with division (a bit tougher operation), the results were:

Zero: 1,665,925
Uno: 4,805,320

This is 2.88x improvement.

I guess floating-point emulation can only be so-so efficient, eh?

const float m1[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };
const float m2[5][5] = { {0.0001, 0.001, 0.01, 0.1, 1},{0.001, 0.01, 0.1, 1, 10},{0.01, 0.1, 1, 10, 100},{0.1, 1.0, 10, 100, 1000},{1, 10, 100, 1000, 10000} };
float m3[5][5];

void setup() {
Serial.begin(115200);
delay(100);
}

void loop() {
int j, m, n, p;
unsigned long starttime;
unsigned long endtime;
unsigned long looptime;

while(1 > 0){
starttime = micros();
for(j = 0; j < 1000; j++) {
for(m = 0; m < 5; m++) {
for(p = 0; p < 5; p++) {
m3[m][p] = 0;
for(n = 0; n < 5; n++) {
m3[m][p] += m1[m][n] * m2[n][p];
}
}
}
}
endtime = micros();
looptime = endtime - starttime;
Serial.println(looptime);
}
}

I once wrote a benchmark test which tests low- and high-level computing algorithms for several platforms. Because also large array sorting is tested, just the memory of at least a mega would fit. But I would be very curious about how the Zero would perform, and so I would gladly appreciate if you runn the following code on your Zero (you might either skip or adjust the display benchmarks if you wish):

what the fuc*, code >9000 chars…

:stuck_out_tongue_closed_eyes:

(remove code tags for compiling! it was meant for posting the code)

benchmark.zip (3.07 KB)

the results for Mega and Due are the following (UTFT 220x176 LCD TFT):

Type                                  Mega2560         Due
MCU, cpu-Takt                        AVR/16MHz     ARM M3/84Mhz
Firmware / vers.                     Sketch 1.5    Sketch 1.5
============================================================
 0 100000 integer add/subtr                129           8
 1 52000 integer multiply/division        1219           7
 2 5000 float operations                   326         107
 3 Mersenne Tw. PRNG (&,^,>>,<<)           313           7
------------------------------------------------------------
 4 matrix algebra (prod/determ.)           244          23
 5 int array sort [500]                   1366         171
-----------------------------------------------------------
 6 display text (N/A:penalty=max+)*)     80618        9110
 7 display graphics                     224505       40675
===========================================================
execution time                          308720       50108
===========================================================

Perhaps you should consider running the loops without the multiplication / division in the middle and subtracting that time from the loops with it to determine how much time the operation actually takes versus the loops.

Jumps for loops can often be expensive operations that might eclipse the time spent doing the multiplication. Especially when you're doing one for every single multiplication.

The UTFT library I have requires special include files for the particular ARM processor (DUE, teensy 3*), so is there such an include file for the ZERO?? Could delete all the UTFT stuff, I guess. Here are ZERO results without UTFT

Type                                  Mega2560         Due         ZERO    teensy3.1
MCU, cpu-Takt                        AVR/16MHz     ARM M3/84Mhz    48 mhz    96 mhz
Firmware / vers.                     Sketch 1.5    Sketch 1.5
============================================================
 0 100000 integer add/subtr                129           8          15          5
 1 52000 integer multiply/division        1219           7         102          5
 2 5000 float operations                   326         107         397         92
 3 Mersenne Tw. PRNG (&,^,>>,<<)           313           7          36          4
------------------------------------------------------------
 4 matrix algebra (prod/determ.)           244          23          92         19
 5 int array sort [500]                   1366         171         379        110
-----------------------------------------------------------
 6 display text (N/A:penalty=max+)*)     80618        9110                   1667  ILI9327 NC
 7 display graphics                     224505       40675                   3194      
===========================================================
execution time                          308720       50108                   5096
===========================================================

some other simple MCU performance numbers, including ZERO

thank you very much! indeed the UTFT score is not elementary! (edit just found: ILI9327 on Teensy, but what means "NC" ? )

ArthurD: thank you very much! indeed the UTFT score is not elementary! (edit just found: ILI9327 on Teensy, but what means "NC" ? )

In the UTFT library I had (v2.72), it wouldn't compile with QD220A, so I just picked a driver at random. NC means "Not Connected" -- since IO is just digitalWrite(), doesn't need to have a device ... For comparison, using ILI9327 on DUE (TextOut 2681us, Graphics 3631) and on Mega2560 (TextOut 10300, Graphics 17527).

The ILI9327 is a 16-bit device, closer to the original results is the ILI9325C (8-bit). here are TFT results from 1.6.4 IDE

           mega2560   DUE    teensy3.1
TextOut     14451     3441       2314
Graphics    27382     5391       4964

Since UTFT performance depends on device driver and how much effort someone has put in to providing the fast digitalWrite, it may not be such a useful benchmark for an MCU (IMHO).

thank you very much for your reply!

[OT] BTW, Although the TFT tests are OT... ...do you have a ILI9341 display ? I just tested it with this lib https://github.com/marekburiak/ILI9341_due

It's just using 4-pin Hardware-SPI, resolution 320x240

  • it's amazing fast! on the Due (Mega tests will follow):
ILI9341_due tft, board: Arduino DUE:
TextOut          973 
Graphics        3255 
gesamt ms:      4228

[/OT]

I do have the ILI9341. The examples/graphicstest/ in the library is an extensive benchmark-- a lot depends on how fast you can clock your SPI interface. There is long discussion of teensy 3.1 performance here

this is the isolated graphics test by which I got the results above.
I’m curious about how a Teensy or the Zero would perform.
For the Mega I will perhaps try tomorrow by myself!
Anyway, all the “stupid” TFTs show about the same performance, either if UTFT or Adafruit libs.
The marekburiak/ILI9341_due test is by far the fastest one I ever observed, not just for SPI interface. So as all those TFTs have no graphic video processor, speeding up SPI clock is supposed to be the only way to speed their performance up remarkably.
On the Due, it’s about 15x times as fast as UTFT or Adafruit_ILI9340.

 TFT Brickbench

#include <SPI.h>

// ILI9341_due NEW lib by Marek Buriak http://marekburiak.github.io/ILI9341_due/
#include "ILI9341_due_config.h"
#include "ILI9341_due.h"
#include "SystemFont5x7.h"
//#include "Streaming.h"

// For the Adafruit shield, these are the default.
/*
#define TFT_RST 8
#define TFT_DC 9
#define TFT_CS 10
// Use hardware SPI (on Uno, #13, #12, #11) and the above for CS/DC
*/

#define    tft_cs     50
#define    tft_dc     49
#define    tft_rst     0


ILI9341_due tft = ILI9341_due(tft_cs, tft_dc, tft_rst);

char textBuff[20];

// Color set
#define   BLACK           0x0000
#define   RED             0xF800
#define   GREEN           0x07E0
//#define   BLUE            0x001F
#define   BLUE            0x102E
#define CYAN            0x07FF
#define MAGENTA         0xF81F
#define YELLOW          0xFFE0
#define ORANGE          0xFD20
#define GREENYELLOW     0xAFE5
#define DARKGREEN       0x03E0
#define WHITE           0xFFFF

uint16_t  color;
uint16_t  colorFONDO = BLACK;


#define  TimerMS() millis()

unsigned long runtime[8];


void TFTprint(char sbuf[], int16_t x, int16_t y) {
  tft.cursorToXY(x, y);
  tft.print(sbuf);
}
   

inline void displayValues() {

  char buf[120];
  tft.fillScreen(BLACK); // clrscr()

    sprintf (buf, "%3d %9ld  int_Add",    0, runtime[0]); TFTprint(buf, 0,10);
    sprintf (buf, "%3d %9ld  int_Mult",   1, runtime[1]); TFTprint(buf, 0,20);
    sprintf (buf, "%3d %9ld  float_op",   2, runtime[2]); TFTprint(buf, 0,30);
    sprintf (buf, "%3d %9ld  randomize",  3, runtime[3]); TFTprint(buf, 0,40);
    sprintf (buf, "%3d %9ld  matrx_algb", 4, runtime[4]); TFTprint(buf, 0,50);
    sprintf (buf, "%3d %9ld  arr_sort",   5, runtime[5]); TFTprint(buf, 0,60);
    sprintf (buf, "%3d %9ld  TextOut",    6, runtime[6]); TFTprint(buf, 0,70);
    sprintf (buf, "%3d %9ld  Graphics",   7, runtime[7]); TFTprint(buf, 0,80);
   
   
}


int32_t test_TextOut(){
  int  y;
  char buf[120];
 
  for(y=0;y<20;++y) {   
    tft.fillScreen(BLACK); // clrscr()
    sprintf (buf, "%3d %9d  int_Add",    y, 1000);  TFTprint(buf, 0,10);
    sprintf (buf, "%3d %9d  int_Mult",   0, 1010);  TFTprint(buf, 0,20);
    sprintf (buf, "%3d %9d  float_op",   0, 1020);  TFTprint(buf, 0,30);
    sprintf (buf, "%3d %9d  randomize",  0, 1030);  TFTprint(buf, 0,40);
    sprintf (buf, "%3d %9d  matrx_algb", 0, 1040);  TFTprint(buf, 0,50);
    sprintf (buf, "%3d %9d  arr_sort",   0, 1050);  TFTprint(buf, 0,60);
    sprintf (buf, "%3d %9d  displ_txt",  0, 1060);  TFTprint(buf, 0,70);
    sprintf (buf, "%3d %9d  testing...", 0, 1070);  TFTprint(buf, 0,80);

  }
  return y;
}


int32_t test_graphics(){
  int y;
  char buf[120];
 
 
  for(y=0;y<100;++y) {
    tft.fillScreen(BLACK);
    sprintf (buf, "%3d", y);  TFTprint(buf, 0,80); // outcomment for downwards compatibility

    tft.drawCircle(50, 40, 10, WHITE);
    tft.fillCircle(30, 24, 10, WHITE);
    tft.drawLine(10, 10, 60, 60, WHITE);
    tft.drawLine(50, 20, 90, 70, WHITE);
    tft.drawRect(20, 20, 40, 40, WHITE);
    tft.fillRect(65, 25, 20, 30, WHITE);
    //tft.drawEllipse(70, 30, 15, 20); //  original test
    tft.drawCircle(70, 30, 15, WHITE); // alternatively, just if no drawEclipse is avaiable in the Arduino graph libs!

  }
  return y;
}



int test(){

  unsigned long time0, x, y;
  double s;
  char  buf[120];
  int   i;
  float f;

 
 
  // lcd display text / graphs
 
  time0=TimerMS();
  s=test_TextOut();
  runtime[6]=TimerMS()-time0;
  sprintf (buf, "%3d %9ld  TextOut", 6, runtime[6]); Serial.println( buf);
  TFTprint(buf, 0,70);
 
  time0=TimerMS();
  s=test_graphics();
  runtime[7]=TimerMS()-time0;
  sprintf (buf, "%3d %9ld  Graphics", 7, runtime[7]); Serial.println( buf);
  TFTprint(buf, 0,80);
 

  Serial.println();
 
  y = 0;
  for (x = 0; x < 8; ++x) {
      y += runtime[x];
  }
 
  displayValues();
 
  sprintf (buf, "gesamt ms: %9ld ", y);
  Serial.println( buf);
  TFTprint(buf, 0,110);
 
  x=50000000.0/y;
  sprintf (buf, "benchmark: %9ld ", x);
  Serial.println( buf);
  TFTprint(buf, 0,120);

  return 1;

}



void setup() {
 
  Serial.begin(115200);
 
  // Setup the LCD

  tft.begin();
  tft.setRotation(iliRotation270);
  tft.fillScreen(colorFONDO);
  tft.setFont(SystemFont5x7); 

  tft.setTextColor(WHITE);

  Serial.println("tft started");
 
}

void loop() {
  char  buf[120];
  test();
 
  sprintf (buf, "Ende HaWe brickbench");   
  Serial.println( buf);
  TFTprint(buf, 0, 140);
 
  while(1);
}
[code]

I have a comparison of MCU SPI speeds that includes the ZERO, DUE , teensy, see SPIperf.txt The graphics performance is guaranteed not to be faster than the underlying SPI clock!

thanks, but I was just curious about the Zero and the Teensy with the marekburiak/ILI9341_due test results.

Here are the microseconds time for the the graphicstest example from the ILI9341 libs.

                                 DUE  teensy3.1    mega    ZERO
      Screen fill              148311 224767  2416012 2004456
      Text                      31836  14786   267536  259034
      Lines                    116353   58644  2494424 2700486
      Horiz/Vert Lines          14926    18474   202868  174972
      Rectangles (outline)       8839  11714   133980  117306
      Rectangles (filled)      311035 467409  5018684 4163854
      Circles (filled)          60764    74296   994000  952879
      Circles (outline)         84131     63726  1090360 1178594
      Triangles (outline)       29263  14230   791332  856384
      Triangles (filled)       118001 156737  2029780 1628356
      Rounded rects (outline)   33152    28353   416496  430619
      Rounded rects (filled)   316887   511531  5552360 4645896

Notes:

  • DUE optimized lib, SPI CLK 42MHz
  • Teensy 3.1 optimized lib, SPI CLK 30MHz
  • mega2560, SPI CLK 8 MHz
  • ZERO, SPI CLK 12 MHz

No one has written a ILI9341 lib for the ZERO, so I hacked the Adafruit library, using sercom4.writeDataSPI() and digitalWrite() for CS. As noted above in this thread, the SPI speed is the limiting factor. The ZERO could profit from a SPI+DMA boost, but max SPI clock is only 12 mhz.

your mileage may vary

Perhaps you should consider running the loops without the multiplication / division in the middle and subtracting that time from the loops with it to determine how much time the operation actually takes versus the loops.

Jumps for loops can often be expensive operations that might eclipse the time spent doing the multiplication. Especially when you're doing one for every single multiplication.

True. The other posters have presented benchmarks that show the Zero has a big speed pickups. However, everyone is showing abysmal floating-point performance. The Zero actually did better (x2.7) on my benchmark than others in this thread, where the Zero did worse of the platforms tested:

Type Mega2560 Due ZERO teensy3.1

2 5000 float operations 326 107 397 92

This looks like an poorly optimized floating-point emulation library for the Zero. Since I work with a lot of floating-point state-space systems, matrix operations are a big hit for me. Note: some stuff can be done with integer math, and multidimensional non-homogeneous non-linear kalman systems can be sped up with bierman-thorsten equations, but sometimes, floating-point performance is necessary and can't be programmed away.

Here are speedTest results, described here

                               Teensy                    ZERO
    speedTest               3.1@96     @48    LC @48mhz  48mhz   DUE 84mhz
  nop                       : 0.010 :  0.021 :  0.021 :  0.021 :  0.012 us
  Arduino digitalRead       : 0.146 :  0.292 :  0.503 :  0.901 :  1.033 us
  Arduino digitalWrite      : 0.466 :  0.867 :  1.113 :  1.366 :  1.263 us
  pinMode                   : 0.238 :  0.469 :  1.151 :  1.938 :  3.068 us
  multiply volatile byte    : 0.062 :  0.124 :  0.173 :  0.201 :  0.118 us
  divide volatile byte      : 0.088 :  0.167 :  0.551 :  0.606 :  0.138 us
  multiply volatile integer : 0.062 :  0.124 :  0.150 :  0.168 :  0.083 us
  divide volatile integer   : 0.068 :  0.147 :  0.786 :  0.906 :  0.093 us
  multiply volatile long    : 0.063 :  0.124 :  0.151 :  0.168 :  0.083 us
  multiply single float     : 0.456 :  0.879 :  2.503 :  2.796 :  0.903 us
  multiply double float     : 0.686 :  1.362 :  3.871 :  4.183 :  1.158 us :
  divide double float       : 11.923 :  21.222 : 41.096 : 43.546 : 19.118 us
  random()                  : 0.373 :  0.697 :  9.021 :  9.771 :  1.368 us
  bitSet() with volatile    : 0.052 :  0.104 :  0.108 :  0.124 :  0.070 us
  analogRead()              : 8.448 :  9.147 :  12.846 : 423.096 : 39.543 us
  analogWrite() PWM         : 1.678 :  2.562 :  3.871 :  8.291 :  3.508 us

The Teensy LC is also a Cortex-M0+, and note that architecture does not have an integer divide.

Any chance someone could give this simple Print benchmark a try on Zero?

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint32_t t1 = micros();
  Serial.println(100);
  Serial.println(0xFFFFFFFFul);
  Serial.println(12345);
  Serial.println(-97847383l);
  uint32_t t2 = micros();
  Serial.print("microseconds = ");
  Serial.println(t2 - t1);
  Serial.println();
  delay(1000);
}

(or maybe change it to SerialUSB, if Zero's Serial isn't buffered yet...)

Any chance someone could give this simple Print benchmark a try on Zero?

Here's your sketch output, run on the programming port:

100
4294967295
12345
-97847383
microseconds = 2876

100
4294967295
12345
-97847383
microseconds = 2875

100
4294967295
12345
-97847383
microseconds = 2877

100
4294967295
12345
-97847383
microseconds = 2875

Here's the same sketch, run on the native USB port:

100
4294967295
12345
-97847383
microseconds = 780

100
4294967295
12345
-97847383
microseconds = 806

100
4294967295
12345
-97847383
microseconds = 930

100
4294967295
12345
-97847383
microseconds = 830

AFAIK, the Atmel ARM compiler is using the standard gnu soft floating point code, which means that instead of getting highly optimized 32bit routines mostly written in AVR assembly language, you're getting very generic and careful (and large!) 64bit (or perhaps 80bit) code written in C. :-(

AFAIK, the Atmel ARM compiler is using the standard gnu soft floating point code

Ah. I'm wrong! It looks like there are nice optimized assembler floating point functions for ARM cores other than armv6-m (ie m0, m0+) IIRC, armv6 is the only ARM sub-architecture that has ONLY the arm thumb 16-bit instructions.

I've been looking at an alternative software single precision floating point library that is very tiny (about 1k for basic math plus trig and log) and supposedly quicker than the gcc libraries (especially for the "scientific" functions.) It's still a bit not-ready-for-arduino, but looks promising. (also, by virtue of the way that gcc works, it should be pretty easy to add to any sketch, REPLACING the default libraries...) ( http://www.quinapalus.com/qfplib.html )