Mega vs Due execution time comparison

Hello,

I have a code that I run on Mega using DDRx and PORTx manupulation commands. Works fine and does everything I have in there.

Recently, I got Due (84MHz internal clock) thinking that it might run faster than Mega at 16MHz. So, I went ahead and wrote simple code as below
On Mega
void setup() {
// put your setup code here, to run once:
pinMode(23, OUTPUT);
pinMode(24, OUTPUT);
DDRA = 0xFF;
PORTA = 0x00;
}

void loop() {
// put your main code here, to run repeatedly:
PORTA = 0x2;
PORTA = 0x4;
delayMicroseconds(10);
PORTA = 0;
delayMicroseconds(10);
}

and on Due
void setup() {
pinMode(23, OUTPUT);
pinMode(24, OUTPUT);
PIOA->PIO_OWER |= 0xC000;
PIOA->PIO_ODSR = 0;
}
void loop() {
PIOA->PIO_ODSR = 0x4000;
PIOA->PIO_ODSR = 0x8000;
delayMicroseconds(10);
PIOA->PIO_ODSR = 0;
delayMicroseconds(10);
}

The results are very different.
On Mega, the width of Pin23 positive pulse width is 62.5ns vs 59.5ns on Due which should give us an idea of single command execution time comparison. Not much faster on Due.
Then the positive pulse width of Pin24 output on Mega is something strange, for 10us, it gives 8.817us while the Due is always pretty much exact to whatever delay I use, in this case , 10.05us (if I use 1us, it is the same 1.05us).
The total loop time for both
Mega - 17.95us
Due - 22.36us (60ns5+10us2+loop restart time)

One thing I noticed is on Mega, using delayMicroseconds always end up having lower delay than the one in the command until I get to milli second intervals while Due is always exactly the same as I give in delayMicroseconds command.

Any thoughts or insights about what I am seeing here?

thanks

I think you'd get more "impressive" results if you take-out the delay and run a for-loop that counts up to 10,000 or 1 Million (or something) and measure how long it takes (with the millis() timer).

Or, do the opposite and run a while() loop for a certain number of seconds and count the number of loops.

I would expect the micros() timer to be more precise with a faster clock.

Followed your suggestion and ran a for loop as follows and got very interesting results
unsigned long i = 0;
void setup() {
Serial.begin(9600);
}
void loop() {
Serial.println("Time: ");
Serial.println(micros());
for (i=0; i<10000000; i++);
Serial.println(micros());
delay(1000);
}

On Mega execution times are 344uS while on Due its 715,248 uS. Unless micros is not properly defined on Mega, this is very strange.
due
mega

empty loops are frequently optimized away... The avrcode becomes essemtially

void loop() {
  Serial.println("Time: ");
  Serial.println(micros());
  i=10000000;
  Serial.println(micros());
  delay(1000);
}

(but the Due compile does not seem to be doing so. It will if I switch from -Os to -O2...)

I haven't tried it, but declaring "i" with the qualifier "volatile" will probably prevent the compiler from optimizing out the for loop as per post #4.

1 Like

Thanks for tips and suggestions, updated the code by declaring i as volatile and now I can see huge difference between Due and Mega speeds shown in the screen shots.
due
mega

Basically Mega is taking 22s to complete the loop while Due is taking 3.7s to do the same. Does this makes sense to someone familiar with these boards.

At the end, for our code, which is heavily based on PORT output data manipulation and continuous interrupt handling, I am not seeing good speeds on Due for these commands.

Would like to hear any suggestions to speed up these commands.

thanks

I've duplicated these results. Unfortunately, it make some sense.
The SAM3x8e (and ARM chips in general) is not really optimized for fast IO.

First of all, while your AVR example ends up compiling to two consecutive OUT instructions (625ns each), an ARM generally takes 3 to 4 instructions to do a single output:

00080174 <loop>:
   80174:       4b09            ldr     r3, [pc, #36]  ; get address of port
   80176:       f44f 4280       mov.w   r2, #16384      ; get value to output
   8017a:       639a            str     r2, [r3, #56]   ; output value to register

Second, the IO registers for the PIO are off on a relatively slow "peripheral bus", so it takes two cycles to write to.

Thirdly, the flash memory in the SAM chip does not operate at the full 84MHz, and has to be configured for 4 "wait states." That means up to 4 additional cycles for each flash read. There is a "flash accelerator" of some kind in there, but the way the code is structured in your original example, with the port manipulation at the very start of a function, It's not going to be very helpful. (If I put the pio stuff in a separate while loop, I can get the pin23 On time down to about 25ns.)
Theoretically, you can get more deterministic behavior by moving code into RAM, but .... it seems to be slower than the cached flash version (~50ns) (there's no instruction cache on RAM?)


our code ... is heavily based on PORT output data manipulation and continuous interrupt handling, I am not seeing good speeds on Due for these commands.

Interrupts are also higher-overhead on an ARM (at least, a "minimal interrupt" function on an ARM has a latency of 12 cycles (not including wait states) and similar overhead on exist, vs ~8 cycles on AVR.

You might want to look at one of the newer ARM chips (after vendors realized that embedded programmers really like their fast IO.) A SAMD51 (Adafruit "Grand Central") or RP2040 has a special "fast IOBUS", tightly coupled RAM, and/or "real" cache, and a faster clock rate as well.

1 Like

Thanks westfw for taking time to replicate my results. Will stop spending more time with Due and do some reading on Grand Central and possibly get one to try it out.

I tried the simple case on a Grand Central and got about 17ns...
(however, I was wrong about it having a fast IOBUS. That's on the SAMD21 and rp2040, but not on the SAMD51 (I think I've had that discussion before - the fast IOBUS and actual cache memory don't get along...)

#define SerialUSB Serial
void setup() {
  pinMode(53, OUTPUT);
  SerialUSB.begin(9600);
  while (!SerialUSB)
    ;
  delay(1000);
  SerialUSB.print("Configured flash wait states: ");
  SerialUSB.println(NVMCTRL->CTRLA.reg, HEX);
}
#define PORTD (PORT->Group[3])

//__attribute__ ((section(".ramfunc")))
void myloop() {
  while (1) {
    PORTD.OUT.reg = 1 << 10;
    PORTD.OUT.reg = 1 << 11;
    delayMicroseconds(1);
    PORTD.OUT.reg = 0;
    delayMicroseconds(1);
  }
}

void loop() {
  myloop();
}

rp2040 (on a RPi Pico) is also about 17ns.

#define SerialUSB Serial
void setup() {
  pinMode(9, OUTPUT);
  SerialUSB.begin(9600);
  while (!SerialUSB)
    ;
  delay(1000);
  SerialUSB.println("Starting");
}

//__attribute__ ((section(".ramfunc")))
void myloop() {
  while (1) {
    sio_hw->gpio_out = 1 << 9;
    sio_hw->gpio_out = 1 << 11;
    delayMicroseconds(1);
    sio_hw->gpio_out = 0;
    delayMicroseconds(1);
  }
}

void loop() {
  myloop();
}

Thanks again westfw, looks like both of these are much faster than even compared to Mega. Which one would you prefer to try out if you have none of these boards on hand?

I kinda like Grand Central for its board layout which can be easily adapted to our add on board we use with Mega.

I think I have a preference for the SAMD51. It's a more "traditional" chip, I like the CM4 better than the CM0 CPU, it's got that hardware floating point, and the development/runtime environment is "shallower" (rp2040 Arduino support has mBed on top of the Pico SDK.)

OTOH, the rp2040's extra features (dual core, PIO "io coprocessor") might be exactly what you need...

Great thanks for the feedback. The only thing thats keeping me away from RP2040 is the number of standalone GPIO pins, thers isnt enough for our project.

I feel compelled to note that, sticking to AVRs, you could go to an AVR128DA or AVR128DB (available with pincounts of 28, 32, 48, and 64 (54 usable I/O pins). They are spec'ed to run at 24 MHz. If you;re only flipping single bits at a time in the port registers, using the VPORTx.OUT register on the new parts is single cycle (SBI and CBI are single cycle on post-2016 (AVRxt or "modern AVRs"!). If you're setting the whole thing, though, and doing direct assignment as shown in that example, setting a register in the I/O space is 0 is always going to be 1 cycle (since it can write r1, the known zero). Depending on how clever the compiler is with the loop, the others would be either 1 or 2 clocks (it ought to be putting the two LDI's outside of the loop, so it just needs to OUT them) - that's the same on any AVR
BTW, is that what you wanted? Your code would produce a 1 system-clock pulse on PA1, then a ~10 us pulse on PA2, then 10us with nothing. The 1/16th us pulse seems an unusual thing to want

Delay microseconds also doesn't always run in the correct length of time on official Arduino cores; never has. This is corrected in DxCore (which supports the AVR DA, DB, and will support the DD-series upon it's release): It's better for the case of a compile-time-unknown delay (the default core does not prevent delayMicroseconds from being inlined - but relies on the call overhead to make an accurate delay!), but if the delay is known at compile time, my cores instead use the avrlibc _delay_us() which gives exact delays as long as the delay is compile time known (and falls to pieces if it's not, hence why we don't use it in that case, and resort to something much like what the official core does, only we don't inline it).

If you want even more speed, the DA and DB parts can be run at 32 MHz at room temperature without issue (never met a part that couldn't, at least), even from the internal oscillator (and they run at the same speed over the whole voltage range - apparently the core runs at 1.8v or something supplied by an internal regulator). If you get the E-spec (extended temperature range) I've been able to get apparently correct behavior on an AVR128DB64 E-spec (extended temp range) with 48 MHz external CLOCK and 40 MHz external crystal at room temperature. The 48 MHz clock didn't work on a normal I-spec one though. I think the 40 MHz crystal did. These are of course sample sizes of 1 currently :wink: process variation will result in some parts being more or less capable of such speeds (for example, I have one (just one out of many tested) tinyAVR 1-series part that will run at 32 MHz off aggressively tuned internal oscillator at the top end of the operating voltage range and room temp, at least well enough to pass my cursory checks. (2-series parts, all the ones I've tested have done 32 from tuned internal at 5v/room temp.

1 Like

Thanks DrAzzy for your detailed reply. The above sketch is just a simple test script to test the IO command execution delay. In our actual sketch we use about 36 I/O (in which 2 are interrupt inputs) + SPI pins (MOSI/SCLK/MISO) in which 20 I/Os must have low latency (including interrupts)