So I have an application that requires rapid reading of port C. I need 320 samples of that 32-bit port as fast as possible. The relevant bits of code are:
int dataX[321], *p = dataX, t1, t2;
t1 = micros();
#include "extra.c"
t2 = micros();
extra.c was generated by a BASH shell script and it contains:
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
.....
.....
.....
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
*p++ = REG_PIOC_PDSR;
So, looking at the generated assembly language using arm-objdump -S, I see these:
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #0]
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #4]
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #8]
*p++ = REG_PIOC_PDSR;
ldr r2, [r3, #0]
str r2, [r4, #12]
....etc....
So far so good. I run this code (320 iterations of load port C / store in integer array) and it consistently takes 25 microseconds according to the difference in calls to micros() made by t2 minus t1.
In 25 uS @ 84MHz, there are 25*84 = 2100 clock cycles. This means each LDR/STR above is taking on average 6.5 clock cycles. This is the source of my confusion.
Are the LDR and STR not single clock cycle instructions? Am I looking at the effect of interrupts adding delay that is unaccountable?
Any help understanding this would be appreciated.