Disappointing execution times for Due

Hi,
I want to use a Due in a hard real time job (oscilloscope) with interrupts. Therefore I made some speed tests with following setup
• On Pin7 (TICPIN) an interrupt pulse is generated
• Pin7 is connected to Pin4
• On Pin6 the interrupt handler generates a pulse
• Pulses are measured with an oscilloscope

volatile uint16_t cnt_hdl=0;
volatile uint32_t tim=0;

void Pin4Intr() {
  REG_PIOC_SODR = 0x01000000;  // set D6
  cnt_hdl++;
  REG_PIOC_CODR = 0x01000000;  // clear D6
}
//
void setup() {
  //
  Serial.begin(115200);

  REG_PIOC_PER = 0x00800000;   // assign TICPIN (7, PC23) to peripheral IO controller PIOC
  REG_PIOC_OER = 0x00800000;   // set TICPIN for ouput

  REG_PIOC_PER = 0x01000000;   // assign D6 (PC24) to peripheral IO controller PIOC
  REG_PIOC_OER = 0x01000000;   // set D6 for ouput

// Test Interrupt timing
// Pin7 generates  interrupt signal which is connected to Pin4
attachInterrupt(digitalPinToInterrupt(4), Pin4Intr, RISING);
while (1) {
  REG_PIOC_SODR = 0x00800000;  // set TICPIN -> Intr.
    // here interrupt should fire
//    tim++;
  REG_PIOC_CODR = 0x00800000;  // clear TICPIN
  tim += 10;
}
}

Following figure shows the signals on an oscilloscope when interrupts are firing. Channel 2 (yellow) is the signal on Pin7, channel 1 (red) on Pin6 (from interrupt handler Pin4Intr).

  1. it takes 912 ns from positive edge of the interrupt signal until the code of the interrupt handler (pulse on Pin6) is executed.
  2. between two interrupt pulses it takes 2500 ns.
  3. execution of the code of the interrupt- handler takes 176 ns (write register twice and increment a variable).
    TestMitIntr

When interrupts are deactivated the time between interrupt pulses (pin7) is shown in following figure for three different code sequences within the while loop which are shown beneath the images.
TestOhneIntr

Observations (one clock period of a Due with 84 MHz is 12 ns):
• entering and leaving the interrupt handler takes 2016 ns (2500 – 308 – 176) corresponding to 169 clock periods :unamused:
• incrementing a variable takes 155 ns (pulse widths of cases 1 and 2) corresponding to 13 clock periods :unamused:
• adding a constant takes 48 ns (pulse widths of cases 2 and 3) corresponding to 4 clock periods
• execution of very short interrupt handler code takes 176 ns corresponding to 15 clock periods :unamused:
• changing the sequence of statements changes execution time (pulse distances of case 2 and 3) :unamused:
• overhead of a while loop takes 200 ns (pulse distance – pulse width in case 3) corresponding to 17 clock periods :unamused:

With a Due I expected to have a speedy controller. The measureing results are disappointing, especially the long interrupt and while loop overheads. I also don't understand how incrementing a variable can take 13 clock periods.

Is there a possibility to speed up the code?

The Arduino Due runs at 84 MHz, which means each clock cycle is approximately 11.9 ns
176 ns is then 176/11.9 ~15 clock cycles and indeed the pin changes are likely 1 cycle so that leaves 13 for the increment.

your variable is volatile, so cnt_hdl++ actually involves

  • fetching the cnt_hdl's 4 bytes from RAM
  • actually incrementing
  • storing the result in RAM

that could be from 4 to 7 cycles so indeed not what you see.

Can you try to see if a wrong alignment (not likely) could cause a pipeline stall and define your variable as

volatile uint32_t cnt_hdl __attribute__((aligned(4))) = 0;

looking at the actual assembly code generated for the ISR could also help understand and there might also be some additional issues like Interrupt handling might cause pipeline flushes or stalls... I don't know enough about the architecture.

You really need to look at the (assembly) code produced to analyze this sort of thing..
Fetching the variable includes load the variable's address into a register. Probably at least two cycles.
Fetching the value from the address (at lest two more cycles.)
Probably the increment of the register is only one cycle.
Storing is two cycles.
(except that flash memory doesn't run at 84MHz, so add wait states for each fetch from flash, including the instruction fetches. There is some sort of cache or "flash accelerator", but it's hard to say exactly how it works, especially in an ISR rather than sequential execution.)

Just out of curiosity I did a similar test with an ESP32-C3 (160MHz, RISC-V) and tell platformio to produce the asm code.
Just focusing in the interrupt function:

volatile uint16_t cnt_hdl = 0;
void IRAM_ATTR pin_tgl_isr() {
  REG_WRITE(GPIO_OUT_W1TS_REG, BIT9);
  cnt_hdl++;
  REG_WRITE(GPIO_OUT_W1TC_REG, BIT9);
}

Without the cnt_hdl++ instruction the pulse width is 50ns in the oscilloscope.
With the cnt_hdl++ instruction it takes 100ns.

The asm code:

# src/main.cpp:12:   REG_WRITE(GPIO_OUT_W1TS_REG, BIT9);
	li	a3,1610629120		# tmp75,
	li	a2,512		# tmp76,
# src/main.cpp:13:   cnt_hdl++;
	lui	a4,%hi(.LANCHOR0)	# tmp78,
# src/main.cpp:12:   REG_WRITE(GPIO_OUT_W1TS_REG, BIT9);
	sw	a2,8(a3)	# tmp76, MEM[(volatile uint32_t *)1610629128B]
# src/main.cpp:13:   cnt_hdl++;
	addi	a4,a4,%lo(.LANCHOR0)	# tmp77, tmp78,
	lhu	a5,0(a4)	#, cnt_hdl
	addi	a5,a5,1	#, tmp80, cnt_hdl
	slli	a5,a5,16	#, _2, tmp80
	srli	a5,a5,16	#, _2, _2
	sh	a5,0(a4)	# _2, cnt_hdl
# src/main.cpp:14:   REG_WRITE(GPIO_OUT_W1TC_REG, BIT9);
	sw	a2,12(a3)	# tmp76, MEM[(volatile uint32_t *)1610629132B]

So, it seems that the instruction cnt_hdl++ is doing quite a few things.

But, if I remove the volatile modifier of the variable declaration, then the cost of the instruction is only about 14ns, instead of 50ns.
The asm code:

# src/main.cpp:13:   cnt_hdl++;
	lui	a5,%hi(.LANCHOR0)	# tmp78,
	addi	a5,a5,%lo(.LANCHOR0)	# tmp77, tmp78,
	lhu	a4,0(a5)	#, cnt_hdl
# src/main.cpp:12:   REG_WRITE(GPIO_OUT_W1TS_REG, BIT9);
	li	a3,1610629120		# tmp75,
	li	a2,512		# tmp76,
	sw	a2,8(a3)	# tmp76, MEM[(volatile uint32_t *)1610629128B]
# src/main.cpp:13:   cnt_hdl++;
	addi	a4,a4,1	#, tmp82, cnt_hdl
	sh	a4,0(a5)	# tmp82, cnt_hdl
# src/main.cpp:14:   REG_WRITE(GPIO_OUT_W1TC_REG, BIT9);
	sw	a2,12(a3)	# tmp76, MEM[(volatile uint32_t *)1610629132B]

And the time between pulses, so from one interrupt to the next, is about 2300ns.

As already suggested, look at the assembly code to really see what happens. With microcontrollers though, if you need more speed, it's generally very inexpensive to simply buy it, rather than pulling your hair out trying to optimize for possibly small gains.

The Due is certainly fast compared to the earliest Arduinos but there's plenty of higher clocked hardware around - look at the Teensy range for example.

Unless there is an actual increment instruction, BOTH operands must be fetched from memory and stored in separate registers before the increment can be executed.

Thank you all for your explanations and suggestions.
But now I got some more questions:

  1. The Due controller is an ARM type, i.e. RISC architecture. Does it really take 2 cyles to access memory?

  2. Interrupt overhead (enter and leave handler) of an UNO or MEGA2560, clocked with 16 MHz, is about 5 us. Due is clocked 5 times faster. So I expected an overhead of about 1 us. My measurements show 2 us. Is the architecture of Due worse ?

  3. If the flash is the bottleneck does it really help to buy a controller with higher clock rate?

  4. Finally, how can I get an assembler listing for my code with Arduino IDE 1.8.19?

Yes if my memory serves me well, there is no instruction that says load this 32 bit register with the 4 consecutive bytes at this ram address. Instead you load the address in a register and then ask to load another register with the content pointed by the first register. If the compiler is smart (it is) it will notice you need the same address to write the incremented value so won’t throw that register away.

You need to save the context during the ISR, on more complex architectures the context is richer, more stuff to stack up.

1 Like

If the compiler is smart (it is) ...

From a smart compiler I would expect that it checks what registers are really used within the interrupt handler and push only the used ones on the stack.
Is the compiler so smart?

They could be smarter. They usually play safe and typically save a standard set of registers in interrupt handlers to ensure compliance with the ABI, which dictates which registers must be preserved across function calls to maintain program stability. It’s also due to the complexity of making the decisions in potential risks of violating ABI conventions, especially in the context of nested interrupts or re-entrant code if I remember well some explanations I read long ago.

Apparently: Documentation – Arm Developer (But see also the footnote: "Neighboring load and store single instructions can pipeline their address and data phases. This enables these instructions to complete in a single execution cycle.")
And to load a particular variable (given the address) into a register takes two instructions and two actual memory accesses.
There are additional complications in that memory is usually behind "bus controllers" that can have contention (ie from DMA) and introduce additional latency.
Determining timing of a particular C statement, even given the assembly listing, is non-trivial. (but there are cycle counters, and there's always the trusty scope.)
Trying to write code with deterministic timing is ... frustrating.
(peripherals can be behind SEVERAL layers of bus controllers, perhaps clocked at lower speeds, so they're even slower to access.)

image

the ARM hardware pushes a bunch of context (8 words) on the stack, compared to just the PC on an AVR. On a good day, this means the ISR does not have to save any additional context. But it does mean more latency than most 8bit cpus, in terms of cycle count (it takes 12 cycles, if all memory is zero-wait state.) (12 cycles for return from ISR, too.)

This is actually pretty comparable at the C level, since the AVR compiler always saves some additional context in the ISR, even if it doesn't actually need to.

(Hmm. 5us for an AVR sounds higher than I'd expect. Is that using the "attachInterrupt()" Arduino code (which ... is pretty awful.))

  • If the flash is the bottleneck does it really help to buy a controller with higher clock rate?

I dunno. RISC architectures are predicated on ideas like "accessing memory is uncommon; most operations will be done on registers (so we give you a lot of registers!", and "the instruction fetch path, at least, will be significantly accelerated by caches." I've often had the thought that RISC architectures move a lot of the complexity that used to be in the CPU, into the memory/cache system.

  • Finally, how can I get an assembler listing for my code with Arduino IDE 1.8.19?

Find the "objdump" utility, something like:

c:\Users\westf\AppData\Local\Arduino15\packages\arduino\tools\arm-none-eabi-gcc\4.8.3-2014q1\arm-none-eabi\bin\objdump -SC due_speed.ino.elf

(make a symlink, or move a copy into your path...)

Ps: the Due attachInterrupt() is pretty ugly too. The PIO interrupt is one per port, so it has to decode which pin in the port caused the interrupt. That means saving an extra 4 registers and doing a bit of work... (this disassembly does not include the user-provided ISR)

void PIOA_Handler(void) {
   80920:	b538      	push	{r3, r4, r5, lr}
  uint32_t isr = PIOA->PIO_ISR;
   80922:	4b0b      	ldr	r3, [pc, #44]	; (80950 <PIOA_Handler+0x30>)
   80924:	6cdc      	ldr	r4, [r3, #76]	; 0x4c
 */
__attribute__( ( always_inline ) ) static __INLINE uint8_t __CLZ(uint32_t value)
{
  uint8_t result;
  
  __ASM volatile ("clz %0, %1" : "=r" (result) : "r" (value) );
   80926:	fab4 f384 	clz	r3, r4
   8092a:	b2db      	uxtb	r3, r3
  uint8_t leading_zeros;
  while((leading_zeros=__CLZ(isr))<32)
   8092c:	2b1f      	cmp	r3, #31
   8092e:	d80d      	bhi.n	8094c <PIOA_Handler+0x2c>
  {
    uint8_t pin=32-leading_zeros-1;
   80930:	f1c3 031f 	rsb	r3, r3, #31
    if(callbacksPioA[pin]) callbacksPioA[pin]();
   80934:	b2dd      	uxtb	r5, r3
   80936:	4b07      	ldr	r3, [pc, #28]	; (80954 <PIOA_Handler+0x34>)
   80938:	f853 3025 	ldr.w	r3, [r3, r5, lsl #2]
   8093c:	b103      	cbz	r3, 80940 <PIOA_Handler+0x20>
   8093e:	4798      	blx	r3
    isr=isr&(~(1<<pin));
   80940:	2301      	movs	r3, #1
   80942:	fa03 f505 	lsl.w	r5, r3, r5
   80946:	ea24 0405 	bic.w	r4, r4, r5
   8094a:	e7ec      	b.n	80926 <PIOA_Handler+0x6>
  }
}

Here's you user ISR (no additional context saving!) (looks like you could save a cycle by making cnt_hdl a uint32_t, since the compiler adds the (unnecessary) utxh instruction.

00080148 <Pin4Intr()>:
volatile uint16_t cnt_hdl = 0;
volatile uint32_t tim = 0;

void Pin4Intr() {
  REG_PIOC_SODR = 0x01000000;  // set D6
   80148:	4a05      	ldr	r2, [pc, #20]	; (80160 <Pin4Intr()+0x18>)
   8014a:	f04f 7380 	mov.w	r3, #16777216	; 0x1000000
   8014e:	6013      	str	r3, [r2, #0]
  cnt_hdl++;
   80150:	4a04      	ldr	r2, [pc, #16]	; (80164 <Pin4Intr()+0x1c>)
   80152:	8811      	ldrh	r1, [r2, #0]
   80154:	3101      	adds	r1, #1
   80156:	b289      	uxth	r1, r1
   80158:	8011      	strh	r1, [r2, #0]
  REG_PIOC_CODR = 0x01000000;  // clear D6
   8015a:	4a03      	ldr	r2, [pc, #12]	; (80168 <Pin4Intr()+0x20>)
   8015c:	6013      	str	r3, [r2, #0]
   8015e:	4770      	bx	lr
   80160:	400e1230 	.word	0x400e1230
   80164:	200704b8 	.word	0x200704b8
   80168:	400e1234 	.word	0x400e1234

Thanks again for so many explanations.
Now I have a more realistic view what can be expected from a Due.

(alas, any ARM. Or any RISC processor in general, I suppose. Tensilica (ESP), RISC-V, MIPS (PIC32)...)

Vendor flash and cache implementations seem to vary a lot (ARM provides a cache option, but I think most of the chips I've seen with cache do their own implementation.) Nearly all the ARMs have some form of "Flash acceleration", but trying to figure out exactly how those operate is ... annoying.

(also, some of the new chips (ESP, RP2xxx) normally execute from SPI flash, which is slower than on-chip flash, especially for non-contiguous access. Some of the ISR response times people were seeing with rp2040 were pretty awful. Fortunately, there's usually an option to put time-critical code in RAM, which is somewhat more deterministic (and faster.))