8 nop operations take 9 Arduino Due clock cycles

In another thread posting I stated that 8 nop operations take 9 clock cycles on Arduino Due.

I thought this fact may be interesting on its own because I read elsewhere that asm volatile("nop"); would take exactly one clock cycle which is not correct for many nops -- I measures 888 nops and that took 999 clock cycles.

I used another statement for very short delay additions in the posting that used 1.25 clock cycles just because I find 5/4 easier to calculate with than 9/8 as human.

I would be interested in any C/C++/asm statement that would take exactly one clock cycle per statement in case of many statements because that would allow something like "delay for exactly X clock cycles" more easily.

Hermann.

Flash wait states

You are right, I learned that from ard_newbie, he provided sketch that only used 1003 clock cycles for 1000 NOPs:
https://forum.arduino.cc/index.php?topic=406165.msg2797704#msg2797704

But the reduction from "4+1" to "3+1" wait states his sketch did is really dangerous, at least the Due I tested the sketch with most time just did hang.

Hermann.

If you find yourself needing very precise timing and for NOP instructions to take 1 clock per instruction you can declare a function containing the NOPs as being RAM resident. Then the code will run from RAM instead of FLASH and you won't have to worry about wait states.

Add this line before the function that should be RAM resident:
attribute ((long_call, section (".ramfunc")))

I guess I don't know for sure that this will fix it but I don't see why it wouldn't. However, function calls have overhead so you'd have to compensate for that.

Perhaps I understood you wrong, but what you propose increased runtime.

Below sketch reports 1134 with your line commented out, with commented in 1505 gets reported for the 1000 NOP statements:

//__attribute__ ((long_call, section (".ramfunc")))
void setup() {
  uint32_t t0,t1,t2;
  
  t0 = SysTick->VAL;
  asm volatile(".rept 1000\n\tNOP\n\t.endr"); 
  t1 = SysTick->VAL;
  
  Serial.begin(57600);
  while (!Serial) {}
  Serial.println( ticks_diff(t0, t1) );
}

void loop() {}

// only for durations <1ms
//
uint32_t ticks_diff(uint32_t t0, uint32_t t1) {
  return ((t0 < t1) ? 84000 + t0 : t0) - t1;
}

Hermann.

Well I did say I wasn't sure that it'd fix the issue. :wink: It was just a guess. I knew that you could run code from RAM instead of the FLASH so I thought that maybe it'd be possible to use that ability to prevent any flash wait states. But, I also don't know for sure how RAM based functions are executed. Are they copied to RAM immediately upon start up, the first time they're run, or every time? I don't know. If it has to copy the function to RAM every time then that would explain the long time. You could try doing the test twice to see if it only does it the first time. Or, look at the assembly dump. I'll try to do that later on when I get a chance to see how RAM based functions are implemented.

I didn't look this up previously because the only use I've had for RAM functions was to mess with GPNVM bits and FLASH memory and you can't do those things while your code is executing from FLASH. So, you set them as RAM functions and then you can do whatever you want to FLASH and the GPNVM bits.

Another idea would be to determine the proper machine code bytes for the NOPs and a return and place them in memory somewhere in RAM. Then manually execute a long jump to the address. It doesn't really need the standard function prologue and epilogue since you won't ever be changing any registers (well, except for the instruction counter) Essentially you could have an array that is stored in RAM and then jump to the beginning of it. That seems kind of over complicated but it would ensure that nothing is messing with what you're trying to do.

Sounds interesting, I placed the NOPs into seperate function (other than "setup()") and called two times. Result is better than 1505 (1125 1132), but not better as without the long call:

__attribute__ ((long_call, section (".ramfunc")))
uint32_t f(void) {
  uint32_t t0,t1;

  t0 = SysTick->VAL;
  asm volatile(".rept 1000\n\tNOP\n\t.endr"); 
  t1 = SysTick->VAL;

  return ticks_diff(t0, t1);  
}

void setup() {
  Serial.begin(57600);
  while (!Serial) {}
  
  Serial.println( f() );
  Serial.println( f() );
}

void loop() {}

// only for durations <1ms
//
uint32_t ticks_diff(uint32_t t0, uint32_t t1) {
  return ((t0 < t1) ? 84000 + t0 : t0) - t1;
}

Your idea with jumping into RAM machine code sounds interesting.
I did not know how to find the opcode for "nop", but google search helped. It is "BF 00". But what should come after the 1000 NOPs to return to normal program flow? Because you want to jump a return is unlikely to help.

A NOP (or any other assembler statement) taking exactly 1 clock cycle would be really helpful for me. I captured rising edge of two laser sensors on pins D2(B.25) and D22(B.26), the time from one value of port B to next (column) is only 15.6ns(!) on a (temporarily) 192MHz overclocked Due. I do adjust to the start of the rising edge (F9... both LOW, FD... D22 high, FF... D2 and D22 high) by asm volatile(".rept 78\n\tNOP\n\t.endr"); but incrementing number of NOPs by one sometimes increments clock ticks measured by a higher value. A NOP taking exactly 1 clock cycle is what would really help to adjust

Hermann.