Arduino DUE assembly language clock cycles

So I have an application that requires rapid reading of port C. I need 320 samples of that 32-bit port as fast as possible. The relevant bits of code are:

int dataX[321], *p = dataX, t1, t2;
t1 = micros();
#include "extra.c"
t2 = micros();

extra.c was generated by a BASH shell script and it contains:

*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
.....
.....
.....
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;

So, looking at the generated assembly language using arm-objdump -S, I see these:

*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #0]
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #4]
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #8]
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #12]
....etc....

So far so good. I run this code (320 iterations of load port C / store in integer array) and it consistently takes 25 microseconds according to the difference in calls to micros() made by t2 minus t1.

In 25 uS @ 84MHz, there are 25*84 = 2100 clock cycles. This means each LDR/STR above is taking on average 6.5 clock cycles. This is the source of my confusion.

Are the LDR and STR not single clock cycle instructions? Am I looking at the effect of interrupts adding delay that is unaccountable?

Any help understanding this would be appreciated.

You may be saturating one of the internal busses - worth delving around the datasheet to see
how fast the data can be fed from the PIO unit to the CPU.

I think you're on to something with that. I have been pouring over:

http://www.atmel.com/images/atmel-11057-32-bit-cortex-m3-microcontroller-sam3x-sam3a_datasheet.pdf

Specifically, section 31.5.9 Input Glitch and Debouncing Filters. Even if PIO_IFSR is zero for my GPIO's of interest, looks like it could take 1.5 to 2 MCLK cycles for the data to be valid in PIO_PDSR.

Still looking for a reference to internal bus bandwidth as well.

There will ALWAYS be at least two clocks of delay on any input signal used to trigger an interrupt, or anyo other internal operation, to synchronize the signal to the internal clock. Without that synchronization, interrupts would be very unreliable.

Regards,
Ray L.

This means each LDR/STR above is taking on average 6.5 clock cycles.

Total for both instructions? That doesn't sound too unlikely:

  1. The Sam3X has a bus matrix with multiple masters (like the CPU or the DMA controller) and slaves (like the PIOs) Contention can occur.
  2. I think the PIO ports are on the "low speed" peripheral bus, so reads or writes to them take more than one cycle.
  3. the flash memory doesn't run at full speed, either, can can introduce wait states. (however, it usually has some sort of acceleration technology (wide access, pre-fetches, etc), so calculating exactly when wait states are added is difficult. (You can try copying your code to RAM to see if it runs faster.)
  4. It's not clear that "str r2, [r4, #N]" is always the same instruction; IIRC you go from a thumb (16bit) instruction to a thumb2 (32bit) instruction depending on the value of N. There are auto-increment and doubly indexed versions of "str", but I think they're all thumb2, so I don't know whether that would help overall. It might be important if you need your samples to be evenly spaced.