 Arduino Due Benchmark - Newton Approximation for Pi

After receiving my Due, I wanted to benchmark its numeric processing power as compared to the Mega. I was also interested in understanding which of my standard Arduino shields and gadgets were compatible with the 3.3v I/O of the Due.

Being a recovering physicist, what better test than to approximate pi, and periodically display the approximation results on 1638 display!

I utilized the slowly-converging Newton Approximation for pi, which does a reasonably good job of calculating pi / 4, using the infinite series 1 - 1/3 + 1/5- 1/7 + 1/9 - 1/11 +…

This is simple to express in Arduino C, and uses the floating point libraries to further assess performance. I tested a Due and a Mega, connected to a JY-LKM1638 V1.2 display module (which is based on the TM1638 controller chip). The times required to traverse 100,000 iterations were as follows:

Due: 1785 ms
Mega: 6249 ms
(roughly 3.5x performance difference)

Upping ITERATIONS to 10000000 (ten million) gets pi accurate to about 8 significant digits!

Photo attached, and sketch follows:

//
// Pi_2
//
// Steve Curd
// December 2012
//
// This program approximates pi utilizing the Newton's approximation.  It quickly
// converges on the first 5-6 digits of precision, but converges verrrry slowly
// after that.  For example, it takes over a million iterations to get to 7-8
// significant digits.
//
// For demonstration purposes, drives a JY-LKM1638 display module to show the
// approximated value after each 1,000 iterations, and toggles the pin13 LED for a
// visual "sign of life".
//
// I wrote this to evaluate the performance difference between the 8-bit Arduino Mega,
// and the 32-bit Arduino Due.
//
// Benchmark results for 100,000 iterations (pi accurate to 5 significant digits):
//
// Due: 1785 ms
// Mega: 6249 ms
//
// 1638 display module connections:
// VCC -> 3.3v
// GND -> GND
// DIO -> Pin 8
// CLK -> Pin 9
// STB0 -> Pin 7
//
//
#define ITERATIONS 20000000L    // number of iterations
#define FLASH 1000            // blink LED every 1000 iterations

#include <TM1638.h>

// TM1638 module(DIO, CLK, STB0)
TM1638 module(8, 9, 7);

void setup() {
pinMode(13, OUTPUT);
Serial.begin(57600);
}

void loop() {

unsigned long start, time;
unsigned long niter=ITERATIONS;
int LEDcounter = 0;
boolean alternate = false;
unsigned long i, count=0;        /* # of points in the 1st quadrant of unit circle */
double x = 1.0;
double temp, pi=1.0;

start = millis();

Serial.print("Beginning ");
Serial.print(niter);
Serial.println(" iterations...");
Serial.println();

count=0;
for ( i = 2; i < niter; i++) {
x *= -1.0;
pi += x / (2.0*(double)i-1);
if (LEDcounter++ > FLASH) {
LEDcounter = 0;
if (alternate) {
digitalWrite(13, HIGH);
alternate = false;
} else {
digitalWrite(13, LOW);
alternate = true;
}
temp = 40000000.0 * pi;
module.setDisplayToDecNumber( temp, 0x80);
}
}

time = millis() - start;

pi = pi * 4.0;

Serial.print("# of trials = ");
Serial.println(niter);
Serial.print("Estimate of pi = ");
Serial.println(pi, 10);

Serial.print("Time: "); Serial.print(time); Serial.println(" ms");

delay(10000);
}

Moderator edit:
</mark> <mark>[code]</mark> <mark>

</mark> <mark>[/code]</mark> <mark>

There might be an imbalance between your tests as on a Uno/Mega double is only really a float (32 bits) but on the Due I think it is full double precision (64 bits). Try it again using float on the Due, you should still get 7 digits of precision out of it.

Ok, well, this is strange. I changed all of the doubles to floats. The Mega had exactly the same time (which reinforces your point). However the time for the Due actually went UP.

For due…
double: 1785 ms
float: 2056 ms

I haven’t had time to go through the floating point library, but do you suppose the Due handles all floating points as doubles by default?

Steve.

No, it's just a 'feature' of C++ which makes it quite hard to get it to do calculations as floats - basically any floating point constant is considered to be a double, then because the calculation has one double in it the whole thing gets done in double precision. To get round it you have to put f after the constants, ie: pi += x / (2.0f*(float)i-1.0f); I am getting 675ms for 100000 iterations (3.1416058540), although I have removed the display driver code.

reminds me of another PI tester that gives an indication of the quality of the random generator;

throw a dart on a square board with a (maximized) circle on it. The chance it is in the circle is PI/4

//
//    FILE: pi.pde
//  AUTHOR: Rob Tillaart
//    DATE: 2011
//
// PUPROSE: approx pi by 'darting' randomly
//
float pi = 0;
long in = 0;
long out = 0;

int x = 0;
int y = 0;
int xx = 0;
int yy = 0;

void setup()
{
Serial.begin(115200);
}

void loop()
{
x = y; //random(0,101);
xx = yy;
y = random(0,101);
yy = y * y;
if (xx + yy <= 10000) in++;
out++;
if ((out % 10000) == 0) Serial.println(4.0 * in / out, 7);
}

Hi Stimmer,
That was it – thank you! The new benchmark numbers, with the LED blink and display logic enabled:

Using float precision, 100,000 iterations with display:
Mega: 6249 ms
Due: 821 ms (Due is about 7.5x faster)

Using double precision, 100,000 iterations with display:
Mega: 6249 ms (obviously still using float)
Due: 1780 ms (actually using double)

(Using double precision on the Due, 8 million iterations will yield about 8 digits of precision in 143,000 ms. Wow, Newton converges slowly for high levels of precision…)

of trials = 8000000

Estimate of pi = 3.1415927786
Time: 143102 ms

Below is the code utilizing float, and runs on Mega and Due for apples-to-apples performance comparison:

//
// Pi_2
//
// Steve Curd
// December 2012
//
// This program approximates pi utilizing the Newton's approximation.  It quickly
// converges on the first 5-6 digits of precision, but converges verrrry slowly
// after that.  For example, it takes over a million iterations to get to 7-8
// significant digits.
//
// For demonstration purposes, drives a JY-LKM1638 display module to show the
// approximated value after each 1,000 iterations, and toggles the pin13 LED for a
// visual "sign of life".
//
// I wrote this to evaluate the performance difference between the 8-bit Arduino Mega,
// and the 32-bit Arduino Due.
//
// Benchmark results for 100,000 iterations (pi accurate to 5 significant digits):
//   Mega: 6249 ms
//   Due: 821 ms (Due is about 7.5x faster)
//
// 1638 display module connections:
// VCC -> 3.3v
// GND -> GND
// DIO -> Pin 8
// CLK -> Pin 9
// STB0 -> Pin 7
//
//

#define ITERATIONS 100000L    // number of iterations
#define FLASH 1000            // blink LED every 1000 iterations

#include <TM1638.h>           // include the display library

// TM1638 module(DIO, CLK, STB0)
TM1638 module(8, 9, 7);

void setup() {
pinMode(13, OUTPUT);        // set the LED up to blink every 1000 iterations
Serial.begin(57600);
}

void loop() {

unsigned long start, time;
unsigned long niter=ITERATIONS;
int LEDcounter = 0;
boolean alternate = false;
unsigned long i, count=0;
float x = 1.0;
float temp, pi=1.0;

Serial.print("Beginning ");
Serial.print(niter);
Serial.println(" iterations...");
Serial.println();

start = millis();
for ( i = 2; i < niter; i++) {
x *= -1.0;
pi += x / (2.0f*(float)i-1.0f);
if (LEDcounter++ > FLASH) {
LEDcounter = 0;
if (alternate) {
digitalWrite(13, HIGH);
alternate = false;
} else {
digitalWrite(13, LOW);
alternate = true;
}
temp = 40000000.0 * pi;
module.setDisplayToDecNumber( temp, 0x80);
}
}
time = millis() - start;

pi = pi * 4.0;

Serial.print("# of trials = ");
Serial.println(niter);
Serial.print("Estimate of pi = ");
Serial.println(pi, 10);

Serial.print("Time: "); Serial.print(time); Serial.println(" ms");

delay(10000);
}

Moderator edit:
</mark> <mark>[code]</mark> <mark>

</mark> <mark>[/code]</mark> <mark>

Cool project, maybe adding something that measures the RAM in use would enrich your benchmark. Greetings

Ran the last example on Arduino edison.........

11 ms

Stan

securd: I haven't had time to go through the floating point library, but do you suppose the Due handles all floating points as doubles by default?

Steve.

securd: Ok, well, this is strange. I changed all of the doubles to floats. The Mega had exactly the same time (which reinforces your point).

The official Arduino docs clearly state that for AVR, double is the same as float, likewise 32bit with 6-7 digits of precsion.

However the time for the Due actually went UP.

For due... double: 1785 ms float: 2056 ms

I haven't had time to go through the floating point library, but do you suppose the Due handles all floating points as doubles by default?

As the Cortex-M3 core (on which the SAM3X used in the Due is based on) doesn't have a FPU (only the M4 has an optional 32bit FPU, and the M7 has an optional 32bit or 64bit FPU) and in order not to implement all functions to emulate FP operations in both 32bit and 64bit, it is probably safe to assume that all FP operations are indeed by default 64bit, eventually truncated to 32bit. That would explain the relatively small increase when using floats vs double on the Due, as each parameter and result has to converted to 64bit and back to 32bit...

Ralf

Hi All,
as PCWorxLA pointed right (some 3 years before) ARM Coretex M4…M7 MCU’s have SP and DP FPU units that do speed up securd’s code.
On STM32 Nucelo F767 board with ARM M7 MCU (@216MHz) without LCD display module code with float (SP - 32bit FPU) make 100k run in less then 15ms

Beginning  100000
iterations...
# of trials = 100000

Estimate of pi = 3.1416058540 10
Time:   14 ms

Beginning  100000
iterations...
# of trials = 100000

Estimate of pi = 3.1416058540 10
Time:   15 ms

I made some modification of that code to be able to compare its results with PC (single thread C++) performance, and ESP32 (NodeMCU-32 board). Number of runs was increased to 10M and LED blinking is omitted or enabled on 100k-1M step.
You can find more mbed code for STM32 Nucleo F7 on :
https://developer.mbed.org/users/JovanEps/code/Newton_s_approximation_bench/

This 10M iterations version on PC, ARM M7 and ESP 32 give:

----------------------
----- Nucleo M7 SP FPU -10M
----------------------
Beginning  10000000
iterations...
# of trials = 10000000

Estimate of pi = 3.1415936536 10
Time:   1204 ms

----------------------
----- Nucleo M7 DP FPU -10M
----------------------
Beginning  10000000
iterations...
# of trials = 10000000

Estimate of pi = 3.1415936536 10
Time:   2061 ms

-----------------------------------------
----------------------
----- ESP 32 SP FPU -10M
----------------------
Beginning 10000000 iterations...

# of trials = 10000000
Estimate of pi = 3.1415936536
Time: 2810 ms

----------------------
----- ESP 32 DP FPU -10M
----------------------
Beginning 10000000 iterations...

# of trials = 10000000
Estimate of pi = 3.1415936536
Time: 33497 ms

---------------------------------
----------------------
----- PC 1x thread on AMD Phenom II X6 1100T - 3.3GHz- SP/DP FPU -10M
----------------------

Beginning  10000000
iterations...
# of trials = 10000000

Estimate of pi = 3.1415927536 10
time: 0.094005 sec.
state: 0

Process returned 0 (0x0)   execution time : 0.146 s
Press any key to continue.

Arduino IDE code for ESP32

#define ITERATIONS 10000000L    // number of iterations 100k-10M
#define FLASH 100000            // blink LED every 100k-1M iterations

void setup() {
pinMode(13, OUTPUT);        // set the LED up to blink every 100k-1M iterations
pinMode(LED_BUILTIN, OUTPUT);
Serial.begin(9600);
}

void loop() {

unsigned long start, ttime;
unsigned long niter = ITERATIONS;
int LEDcounter = 0;
boolean alternate = false;
unsigned long i, count = 0;
double x = 1.0;       //double
double temp, pi = 1.0; //double
digitalWrite(LED_BUILTIN, LOW);

Serial.print("Beginning ");
Serial.print(niter);
Serial.println(" iterations...");
Serial.println();
digitalWrite(LED_BUILTIN, HIGH);

start = millis();
for ( i = 2; i < niter; i++) {
x *= -1.0;
pi += x / (2.0f * (double)i - 1.0f); //double

if (LEDcounter++ > FLASH) {
LEDcounter = 0;
if (alternate) {
//digitalWrite(13, HIGH);
digitalWrite(LED_BUILTIN, HIGH);
alternate = false;
} else {
//digitalWrite(13, LOW);
digitalWrite(LED_BUILTIN, LOW);
alternate = true;
}
temp = (float) 40000000.0 * pi;
}
}
ttime = millis() - start;
pi = pi * 4.0;
digitalWrite(LED_BUILTIN, LOW);

Serial.print("# of trials = ");
Serial.println(niter);
Serial.print("Estimate of pi = ");
Serial.println(pi, 16);
Serial.print("Time: "); Serial.print(ttime); Serial.println(" ms");

delay(3000);
}

If you want to read more about performances of new 32bit IoT ready MCU’s check this paper:
http://www.eventiotic.com/eventiotic/library/paper/326?event=0
or
https://www.researchgate.net/publication/316173015_Analysis_of_the_performance_of_the_new_generation_of_32-bit_Microcontrollers_for_IoT_and_Big_Data_Application