Inaccuracy of delayMicroseconds()

I have a timing-critical application, and I’m trying to use delayMicroseconds to control a pulse width. It works wonderfully at higher values, but at ~10 it’s off by a number of clock cycles.

#define STROBE_PORT PORTE
const int strobePin = 4;
unsigned int strobeTime = 10;

void setup()
{
pinMode(2, OUTPUT);
digitalWrite(2, LOW);
}

void loop()
{
if ( correctSerialInput )
{
powerPulse();
}
}

void powerPulse()
{
  if ( strobeTime <= 10000 )
  {
  cli();
  sbi(STROBE_PORT, strobePin);
  delayMicroseconds(strobetime);
  cbi(STROBE_PORT, strobePin);
  sei();
  }
}

Pulse width: 9.675us

The critical code is everything between sbi() and cbi(), as that controls the width of the pulse. So I thought I’d take a look at the assembly code:

  sbi(STROBE_PORT, strobePin);
  36:      74 9a             sbi      0x0e, 4      ; 14
  delayMicroseconds(strobetime);
  38:      c9 01             movw      r24, r18
  3a:      0e 94 00 00       call      0      ; 0x0 <_Z10powerPulsev>
                  3a: R_AVR_CALL      delayMicroseconds
  cbi(STROBE_PORT, strobePin);
  3e:      74 98             cbi      0x0e, 4      ; 14

There’s a movw and a call: 1 + 5 = 6. Now, taking a look at the function-

delayMicroseconds:

#if F_CPU >= 16000000L
      // for the 16 MHz clock on most Arduino boards

      // for a one-microsecond delay, simply return.  the overhead
      // of the function call yields a delay of approximately 1 1/8 us.
      if (--us == 0)
   0:      01 97             sbiw      r24, 0x01      ; 1
   2:      01 f0             breq      .+0            ; 0x4 <delayMicroseconds+0x4>
                  2: R_AVR_7_PCREL      .text.delayMicroseconds+0x12
            return;

      // the following loop takes a quarter of a microsecond (4 cycles)
      // per iteration, so execute it four times for each microsecond of
      // delay requested.
      us <<= 2;
   4:      88 0f             add      r24, r24
   6:      99 1f             adc      r25, r25
   8:      88 0f             add      r24, r24
   a:      99 1f             adc      r25, r25

      // account for the time taken in the preceeding commands.
      us -= 2;
   c:      02 97             sbiw      r24, 0x02      ; 2
      // we can't subtract any more than this or we'd overflow w/ small delays.
      us--;
#endif

      // busy wait
      __asm__ __volatile__ (
   e:      01 97             sbiw      r24, 0x01      ; 1
  10:      01 f4             brne      .+0            ; 0x12 <delayMicroseconds+0x12>
                  10: R_AVR_7_PCREL      .text.delayMicroseconds+0xe
  12:      08 95             ret

Instruction - cycles
sbiw - 2
breq - 1 (condition is false)
add - 1
adc - 1
add - 1
adc - 1
sbiw - 2
(the us–; is from a different #ifdef branch and isn’t actually compiled)

Once in the loop it starts subtracting from the us variable, but its changed since it was passed:
r25/r24 = 10
sbiw 1 → 9
us <<= 2 → 36
sbiw 2 → 34

For us = 34 to us = 2, sbiw takes 2 clock cycles, brne takes 2 (condition true).

33*4 = 132

In the last loop, us reaches 0, so brne only takes 1 cycle, plus a return: 2+1+5 = 8.

Summing up the cycles in delayMicroseconds: 9 + 132 + 8 = 149.
Add 6 from before: 155 clock cycles.

155/16 = 9.6875us which is approximately what I’m seeing.

Adding up all of the clock cycles involved with the function, you get:

5 + 9 + 4( 4(us-1) - 2 - 1 ) + 3 + 5

Simplifying yields: 16us - 6. This is an error of 6 clock cycles or 0.375us. At 10us, this is a -3.75% error which is too high for my application, and quite possibly for others.

A simple solution would be as follows:
Replace the “us -= 2;” with “us -= 1;”. This would reduce the error to -2 clock cycles, which could be remedied by adding 2 nop’s.

It might be wise to just indicate the 6 cycle error in the reference. There is still some overhead in setting up the function, and that can vary. Reducing the error to 0 means the setup overhead will cause an unavoidable positive error. If a user knows about the error, it would be easy to compensate by figuring out your setup overhead, and adding nop’s to bring it to 6 cycles.

a difference of exactly 5 cycles. What gives?

Call/return overhead?

Just a guess.

Still working out the math. Hold on :P

Alright, the math is worked out. There's an inherent error of 6 clock cycles in the function which could easily be solved.