Arduino Due digitalWrite vs. direct port manipulation speed

Currently, I am working on a very speed-constrained application using the Arduino Due to send data to my PC from a high-ish speed ADC (24-bit, 128kHz). Unfortunately, using the built-in libraries I was only able to sample from the ADC at ~44kHz, and it looks like there’s a lot of dead time that isn’t SPI or USB transactions. I was curious how much of that dead time was due to the overhead of the built-in Arduino libraries, so I did a test.

Here’s the result on pin 53 using a digitalWrite (see code below) using nothing but HIGH and LOW digitalWrites to the pin involved:

As you can see from the scope trace, it takes about 2.24us to complete a digitalWrite (high or low). That’s a LOT of time in real-time digital land (about 188 cycles with the Due’s 84MHz clock).

And here’s the result using a direct write to the port:

This takes about 23ns to complete a write (high or low), or 2 clock cycles. This is exactly what is specified in the SAM3X datasheet in Section 31.1 (it should take 2 clock cycles to do a register read or write). It’s so fast my poor 50MHz scope can barely keep up.

In other words, I got a speedup of ~100x using direct writing to registers over using the Arduino’s built-in libraries. I was actually so surprised by this I decided to post it here.

[EDIT: code]

void setup() {
  // put your setup code here, to run once:
  pinMode(53, OUTPUT);
}

void loop() {
  // put your main code here, to run repeatedly:
  /*
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  */
  digitalWrite(53, HIGH);
  digitalWrite(53, LOW);
  digitalWrite(53, HIGH);
  digitalWrite(53, LOW);
  digitalWrite(53, HIGH);
  digitalWrite(53, LOW);
  digitalWrite(53, HIGH);
  digitalWrite(53, LOW);
  digitalWrite(53, HIGH);
  digitalWrite(53, LOW);
}

There was a recent topic on AvrFreaks. I wrote this sketch to write 50 toggles i.e. 100 edges. And measure the start and finish time with a Logic Analyser.

The interesting result is that Due digitalWrite() is appalling.
M0, M3 are not that impressive.
M4 gives very fast result.

In practice many Arduino apps know the wiring at compile-time. So can use the appropriate register writes.
If you have to determine the GPIO at run-time it is well worth writing a mask to the port address.

//                 F072 @ 48MHz  F103 @ 64MHz  L476 @ 80MHz  F446 @ 180MHz    SAM3X @ 84MHz  SAMD21 @ 48MHz  UNO @ 16MHz
//digitalWrite:     98.75us 47    101.6us 65     41.33us 33   18.5us 33cycle 207.2us 174      154.8us 74      356.5us  57
//PortAddress:      14.83    7     17.29  11      7.67    6    3.96   7       13.17   11       25.29  12       31.92    5
//ReadModifyWrite:  10.58    5     14.21   9      7.67    6    2.83   5       10.79    9       20.46  10       12.63    2
//WriteOnly:         4.33    2      3.29   2      1.33    1    0.58   1        2.38    2        8.96   4       12.63    2

#define TGL  { TGL_RMW; }

#if 0
#elif defined(ARDUINO_SAM_DUE)
#define P8 PIOC
#define B8 22
#define P9 PIOC
#define B9 21
#define PIN_HIGH(port, pin)   (port)-> PIO_SODR = (1<<(pin))
#define PIN_LOW(port, pin)    (port)-> PIO_CODR = (1<<((pin)))
#define PIN_HIGHX(port, pin)   (port)-> PIO_ODSR |= (1<<(pin))
#define PIN_LOWX(port, pin)    (port)-> PIO_ODSR &= ~(1<<((pin)))

#elif defined(ARDUINO_ARCH_SAMD)
#define P8 REG_PORT_OUT0
#define B8 6
#define P9 REG_PORT_OUT0
#define B9 7
#define PIN_HIGH(port, pin)   REG_PORT_OUTSET0 = (1<<(pin))
#define PIN_LOW(port, pin)    REG_PORT_OUTCLR0 = (1<<((pin)))
#define PIN_HIGHX(port, pin)   (port) |= (1<<(pin))
#define PIN_LOWX(port, pin)    (port) &= ~(1<<((pin)))

#elif defined(ARDUINO_ARCH_SAMD)
#define P8 PORT_IOBUS->Group[0]
#define B8 6
#define P9 PORT_IOBUS->Group[0]
#define B9 7
#define PIN_HIGH(port, pin)   (port).OUTSET.reg = (1<<(pin))
#define PIN_LOW(port, pin)    (port).OUTCLR.reg = (1<<((pin)))
#define PIN_HIGHX(port, pin)   (port).OUT.reg |= (1<<(pin))
#define PIN_LOWX(port, pin)    (port).OUT.reg &= ~(1<<((pin)))

#elif defined(ARDUINO_ARCH_STM32)
#define P8 GPIOA
#define B8 7
#define P9 GPIOC
#define B9 7
#define PIN_HIGH(port, pin)   (port)-> BSRR = (1<<(pin))
//#define PIN_LOW(port, pin)    (port)-> BSRR = (1<<((pin)+16))
#define PIN_LOW(port, pin)   (port)-> BRR = (1<<(pin))
#define PIN_HIGHX(port, pin)   (port)-> ODR |= (1<<(pin))
#define PIN_LOWX(port, pin)    (port)-> ODR &= ~(1<<((pin)))

#elif defined(ARDUINO_AVR_UNO)
#define P8 PORTB
#define B8 0
#define P9 PORTB
#define B9 1
#define PIN_HIGH(port, pin)   (port) |= (1<<(pin))
#define PIN_LOW(port, pin)    (port) &= ~(1<<(pin))
#define PIN_HIGHX(port, pin)   (port) |= (1<<(pin))
#define PIN_LOWX(port, pin)    (port) &= ~(1<<(pin))
#endif

#define TGL_ARD { digitalWrite(8, HIGH); digitalWrite(8, LOW); }
//#define TGL_ADS { *d8Port |= d8PinSet; *d8Port &= ~d8PinSet; }
#define TGL_ADS { *d8Port |= d8PinSet; *d8Port &= d8PinClr; }
#define TGL_RMW { PIN_HIGHX(P8, B8); PIN_LOWX(P8, B8); }
#define TGL_WO  { PIN_HIGH(P8, B8); PIN_LOW(P8, B8); }

#define TGL2 { TGL; TGL; }
#define TGL4 { TGL2; TGL2; }
#define TGL8 { TGL4; TGL4; }
#define TGL16 { TGL8; TGL8; }
#define TGL32 { TGL16; TGL16; }
#define TGL50 { TGL32; TGL16; TGL2; }

#if defined(__AVR__)
volatile uint8_t *d8Port;
uint8_t d8PinSet, d8PinClr;
#else
volatile uint32_t *d8Port;
uint32_t d8PinSet, d8PinClr;
#endif

void setup()
{
    Serial.begin(9600);
    Serial.print("toggle GPIO with OUTSET @ F_CPU = ");
    Serial.print(F_CPU / 1000000);
    Serial.println("MHz");
    pinMode(13, OUTPUT);
    pinMode(8, OUTPUT);  //toggle signal
    d8Port = portOutputRegister(digitalPinToPort(8));
    d8PinSet = digitalPinToBitMask(8);
    d8PinClr = ~d8PinSet;
    pinMode(9, OUTPUT);  //start, end signal
}

void loop()
{
    PIN_HIGH(P9, B9);  //digital#9 PC21
    TGL50;   //100 edges digital#8 PA7 
    PIN_LOW(P9, B9);
    digitalWrite(13, HIGH);
    delay(500);
    digitalWrite(13, LOW);
    delay(500);
}

David.

So what is the frequency of the square wave in the case of digitalWrite? ~223kHz?

//                 F072 @ 48MHz  F103 @ 64MHz  L476 @ 80MHz  F446 @ 180MHz    SAM3X @ 84MHz  SAMD21 @ 48MHz  UNO @ 16MHz
//digitalWrite:     98.75us 47    101.6us 65     41.33us 33   18.5us 33cycle 207.2us 174      154.8us 74      356.5us  57
//PortAddress:      14.83    7     17.29  11      7.67    6    3.96   7       13.17   11       25.29  12       31.92    5
//ReadModifyWrite:  10.58    5     14.21   9      7.67    6    2.83   5       10.79    9       20.46  10       12.63    2
//WriteOnly:         4.33    2      3.29   2      1.33    1    0.58   1        2.38    2        8.96   4       12.63    2

If you were asking about my numbers. 100 edges in 207.2us is 25 full square wave cycles i.e. 120kHz.
But using PIO_SODR and PIO_CODR, 25 full cycles in 2.38us is 10.5MHz

No one is going to use digitalWrite() for millions of operations.
I was expecting 84MHz Due digitalWrite() to be much faster than the 16MHz Uno.

David.

I would use this code for testing.

void loop() {
while (1) {
digitalWrite(53, HIGH);
digitalWrite(53, LOW);
}
}

In other words. You do not have any practical application.

If you wanted "fast" you could use a hardware timer : 42MHz squarewave
If you wanted "swift" you could use PIO_SODR : 10.5MHz squarewave
If you wanted "flexible" you could write to PortAddress: 2.3MHz
If you wanted "snail pace" you could use digitalWrite() : 0.12MHz

The hardware timer solution would give a perfect squarewave.
All of the other methods would have a glitch for the software while() loop.

Hey-ho. Just try it for yourself.

David.

Well, there is digitalWrite() is embarassingly slow on Due... · Issue #4030 · arduino/Arduino · GitHub for Due.
It claims that it was integrated, but I don’t see it in the current core... :frowning:

DigitalWriteFast() for samd here: Duino-hacks/fastdigitalIO_samd.h at master · WestfW/Duino-hacks · GitHub

As David implies, no one seems to care ...

Edit: I misread. The due issue was moved, not incorporated. Now here: digitalWrite() is embarassingly slow on Due... · Issue #16 · arduino/ArduinoCore-sam · GitHub

Hi David, what is that option 3, write to PortAddress?

If you look at my sketch:

volatile uint32_t *d8Port;
uint32_t d8PinSet, d8PinClr;
    ...
    d8Port = portOutputRegister(digitalPinToPort(8));
    d8PinSet = digitalPinToBitMask(8);
    ...

All Arduinos can decipher the Port and pinMask from a digital#
Note that d8Port is uint8_t* on some targets.

Seriously, most apps don't digitalWrite() very often. Who cares about 10 - 1000 calls.
But other apps do care. e.g. writing to a TFT

As westfw has suggested, there might be a "better" digitalWriteFast()
But if you just want a steady square wave, the hardware timer will be best.

David.

Does it write PIO_ODSR instead of PIO_SODR and PIO_CODR, right?

Go on. You have been a member for 5 years. You are hardly a newcomer.

Yes, you can use Read-Modify-Write if you want (PIO_ODSR). I provide the macros for 4 styles.
In fact the sketch as posted will create RMW version. Which is considerably slower than the PIO_SODR version.

Seriously. If you want a Square Wave Generator, use a hardware Timer.

David.

No, I don't want use a square wave generator. I just wanted to know the maximum frequency in case of use digitalWrite.

Surely I answered that in #5. i.e. 120kHz

You could have tested yourself. e.g. by writing 100 digitalWrite() s like I did. Time with micros().
Or write 100k digitalWrite()s and time with millis()

No need for a Logic Analyser. (which is what I used)

David.

David’s code in this posting above

compiles and flashes, but does not toggle digital pin 8 on my Due.
Changing Blink example sketch tob write to the same pin works (logic analyzer shows signal).

edmundsj’s code from initial posting in this thread works on my Due, and shows 47.5ns period (21.05MHz) on my 400Msps logic analyzer. Because of “void loop()” the delay between bursts is longer than 2us. Doing loop with “goto” in setup reduces time between bursts to 117.5ns:

void setup() {
  pinMode(53, OUTPUT);
loop:
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  
goto loop; // 117.5ns
}

void loop() {
}

Surely the goto is no faster than a while (1) loop...

My example used D8 which is PORTC.22 on a Due.
I am sure that I measured on a Due.

Your example uses D53 which is PORTB.14 on a Due.

I was just trying to show the use of the PIOx -> PIO_SODR instructions.

I was not concerned with how C++ implements a while(1), for(;:wink: or goto loop.
I would expect them to generate exactly the same code.

Seriously, if you want to generate perfect pulse sequences you would use the Due hardware.

Incidentally, I find a cheap Saleae clone LA works very well. But obviously has a limited sampling frequency. When debugging, I can just reduce the CPU or Peripheral speed to get my logic correct. Then test at full speed.

It would be interesting to hear your views on a "faster" Logic Analyser.

David.

westfw:
Surely the goto is no faster than a while (1) loop…

Logic analyzer says different.
You have seen 117.5ns delay between bursts in previous screenshot with “goto”.
Now using edmundj’s code with the burst pattern inside loop() results in 2.285us delay (factor 19.4!):

void setup() {
  pinMode(53, OUTPUT);
}

void loop() {
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
  PIOB -> PIO_SODR = 1 << 14;
  PIOB -> PIO_CODR = 1 << 14;
}

You can examine the LSS file if you really want. Or even control the Due via SWD/JTAG in AS7.0

There is oodles of Flash memory in a Due. So you could inline several hundred PIO_SODR instructions.
Eventually you will want to start again e.g. with while(1), goto, ...

If you want a glitch-free sequence you use the Timer hardware. Just like you would on any microcontroller.

David.

david_prentice:
It would be interesting to hear your views on a “faster” Logic Analyser.

I do own 3 logic analyzers, and do not know others.

I started years ago with 5$ Salea clone that has 24Msps:
https://www.aliexpress.com/wholesale?SearchText=logic%2Banalyzer%2B24M

In 8/2016 I did buy next, 100Msps for 24$ at that time, today 35$:

Finally in 12/2016 I ordered 400Msps logic analyzer for 68$, similar price today:
https://www.aliexpress.com/wholesale?SearchText=400MHz+logic+analyzer

Besides 400 vs. 100 Msps, the cable quality is different. Each Channel cable of the 400Msps logic analyzer is thicker cable containing GND and signal. Only At the end you have two cables. That is good to avoid electric noise while measuring high frequency stuff.

You can capture up to 4 channels at 400Msps resolution, and ssince it captures to internal ram, can capture 0.25s in total at 400Msps. That was a problem until I learned to select rising or falling edge trigger at one of the channels. Starting capturing then just waits until trigger happens, and then it captures 0.25 seconds including the trigger.

I really like the 2.5ns resolution the 400Msps analyzer provides – light does travel only 75cm in that time :wink:

Years ago I built DSView software for 400Msps analyzer on Ubuntu laptop of my wife, but that is dead since some time. I was not able to build under RHEL due to Qt library issues. But compiling the software according the instructions worked easily under Raspberry, and since then I use DSView on Raspberry only:
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=270197&p=1640608&hilit=DSView+compile#p1640608

I did buy a 26$ in total 9" 1024x600 HDMI display for my PIs, and created my 2nd WoodenBoardPi
https://www.raspberrypi.org/forums/viewtopic.php?f=45&t=254059&p=1739924#p1717730
from that. That is more than good enough for portable logic analyzer with Anker power bank:
https://www.raspberrypi.org/forums/viewtopic.php?f=45&t=222525&p=1721926#p1721926

Last, Raspberry pigpio library nanoplse.c example is really able to create nanosecond resolution pulses, starting with 4ns (3ns for Pi4B). I used the 400Msps logic analyzer to verify that (values get rounded to next multiple of 2.5ns with the analyzer):
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=270197&p=1640608&hilit=DSView+compile#p1641091

Summary:
As you have seen I am a very big fan of the 400Msps logic analyzer, but the 100Msps analyzer is good as well for less money. The 5$ analyzer works fine, but 24Msps 41.66us steps between measurements was too often not sufficient for what I wanted to measure.

Your 400Msps link does not point to a specific device.

The 100Msps seems to do 3 channels at 100MHz.
The 400Msps seems to do 4 channels at 400MHz.

I have never wanted more than 8 channels. I did have a 4 channel LA that was woefully inadequate.

In real life. Which device to you go to first?

David.