Accurate microsecond delays for all clocks...


I’ve been working with an Arduino at 1Mhz (using the internal 8MHz clock and the divide by 8 option) … all was working well until I tried to use the OneWire library … digging around I found that the problem was the delayMicroseconds() function which only catered for 8MHz and 16MHz clocks. Also, given most of the use cases of this call use constants I felt there was a way to be much more accurate.

So I’ve now got it all working nicely with the following …

A macro to determine if the call is using a constant, if it is then we can use some inline assembly to correctly delay (accurate to the clock cycle) given the compiler will optimise away all of the logic at compile time.

If it’s not a constant then we call a standard function, for delay less than 15 clock cycles it will delay 15ish, for 15 - 20 it’s about 20, then it should be pretty correct (to within 4 cycles.) I believe this is better than the existing routines when using 8 or 16Mhz clocks and it obviously caters for any other clock frequency by using F_CPU.

So in summary, this gives you clock cycle accurate delays when a constant is used, and not bad accuracy for non-constants. My OneWire device is now working perfectly on a 1MHz arduino. (It also works at 16Mhz, but that’s the only other testing I’ve done so far.)

The code … the first bit in wiring.h …

#define delay_cycles(x) if(__builtin_constant_p(x)) { delay_cycles_CONST(x); } else { delay_cycles_VAR(x); }
#define delayMicroseconds(x) delay_cycles(x * clockCyclesPerMicrosecond())

#define delay_cycles_CONST(cycles) { \
   if(cycles&1) asm volatile("nop"); \
   if(cycles&2) asm volatile("nop \n\t nop"); \
   if(cycles&4) asm volatile("push r26 \n\t pop r26"); \
   if(cycles > 7) { \
      asm volatile( \
          "1:    sbiw %0, 1"      "\n\t" \
          "      brne 1b"         "\n\t" \      
          "      nop"             "\n\t" \
          "      nop"             "\n\t" \
          "      nop"             "\n\t" \
          : : "r" (((cycles&0xfff8)-4)>>2) \
     ); \
   } \

void delay_cycles_VAR(unsigned int cycles) __attribute__ ((noinline));

Plus you need to comment out the existing delayMicroseconds prototype.

Then in wiring.c …

void delay_cycles_VAR(unsigned int cycles) {
  if(cycles < 15) return;
  asm volatile(
    "1:  sbiw %0, 1"    "\n\t"      // 2 cycles
    "            brne 1b"            // 2 cycles if branch taken
    : : "r" ((cycles>>2))

And comment out the existing delayMicroseconds routine.

It does cause a few extra bytes of code for a constant based delay, this could probably be tidied up a little but I assumed it’s better to be accurate.


I removed my comments when I put it into a macro (was fighting gcc avoiding my always_inline directive), but I can happily supply the thought process behind it all if it’s not obvious.


I’m curious, are you using the new 2.0 version of OneWire, where I fixed the interrupt safety issues?

Teensyduino has similar code for delayMicroseconds. I used an inline function and I inlined the var case in asm too. Your macro approach nicely eliminates a lot of the special handling for each speed. I do like how you inline non-looping code that doesn’t require any registers for under 8 cycles. Nice! I might try that idea someday. I made 2 looped copies, so the short delays only require a single register, but then zero registers is even better.

Have you looked at how the compiler implements the var case multiplication by clockCyclesPerMicrosecond if the variable is a signed char, signed int or signed long?

Hi Paul,

Yes I am using version 2.0 and it works flawlessly (at least with my DS18B20) on a 1MHz arduino (and one I built with virtually no external components on a prototyping board.)

Basically I want to run a remote temperature sensor (for a pool) that transmits the temperature back to a display. Battery life is a major concern as I'd like to get 1yr+ out of a couple of AA's. So the 1MHz clock makes a big difference, WDT with nothing else powered, and not having those LED's scattered around :-)

I hadn't seen the teensy, it may have done the job I needed - still, it's been fun building it myself!

Interesting point on the var case, at least with small signed variables, but I suspect this would be easily solved with a cast prior to the multiplication ... I'll look at this at the weekend.

Thanks for the input,


i would recommend a tiny change: that

#define delay_cycles(x) if(__builtin_constant_p(x)) { delay_cycles_CONST(x); } else { delay_cycles_VAR(x); }

should be that

#define delay_cycles(x) do if(__builtin_constant_p(x)) { delay_cycles_CONST(x); } else delay_cycles_VAR(x); while (0)

so that u can write:

if (...) delayMicroseconds(5); else ...;

what about interrupts? is the programmer supposed to disable them before using delayMicroseconds()?


is the programmer supposed to disable them before using delayMicroseconds()?

Yes. With version 0018 the Arduino library no longer disables interrupts in delayMicroseconds.

Why not just go for avr-libc?

I have no clue why arduino wraps avr-libc but drops a lot of useful functionality while doing so. The most striking example is the eeprom library. It introduces some overhead and drops almost all of the underlying functionality.

In doubt I would always recommend to have a look at avr-libc before reinventing the wheel.

With regard to disabling the interrupts: the only reasonable way to disable them is if the programmer controlls it. If the subroutine would control it, the pending interrupts would be executed as soon as it enables interrupts again (which typically happens before it returns). So anyone who wants exact delays has to control them anyway.


Ah ... I hadn't noticed these functions in avr-libc. Although they still don't cater for the small delays at 1MHz case ... basically anything less than about 5us won't work properly.

I'll have a look in more detail though, as I'm not convinced about the value of using floating point ... I want to check how the compiler optimises it away, plus the minimum non-constant delay is going to be quite significant using this approach.

I actually used a lot of the WDT and Sleep functionality from this library and it worked really nicely ... can someone explain the relationship between the various libararies and environments? It does seem a little disjointed.



I'm hoping to switch to something like this version of delayMicroseconds() in Arduino 0019: It would be great to have a cleaned up version.

RIDDICK and Udo Klein: you both make good points.

Lee, if you check on the possibility of using the built-in _delay_us(), can you post your findings? It only works when the argument is a compile-time constant, so you'd still need a variable-argument implementation. That implementation might still use _delay_loop_1() and _delay_loop_2(), though.

The Arduino libraries sit on top of avr-libc, but don't always expose as much functionality as it does (or use it in their implementation as much as they should). You should be able to use any avr-libc functions in your Arduino sketches.

One ore comment on this code part:

#define delay_cycles_CONST(cycles) { \
   if(cycles&1) asm volatile("nop"); \
   if(cycles&2) asm volatile("nop \n\t nop"); \
   if(cycles&4) asm volatile("push r26 \n\t pop r26"); \

For cycles == 3 (and ==7) the result will be consume 3 (5) words of memory. However if I understand the datasheet right a delay of 3 cycles should be possible with only 2 words by combining to branch statements like so


depending on the state of the zero flag this will consume 1+2 or 2+1 cycles, thus always 3 cycles. But it will consume only 2 words.

The same trick should be applicable in the loop

      asm volatile( \
          "1:    sbiw %0, 1"      "\n\t" \
          "      brne 1b"         "\n\t" \      
          "      nop"             "\n\t" \
          "      nop"             "\n\t" \
          "      nop"             "\n\t" \
          : : "r" (((cycles&0xfff8)-4)>>2) \

Thus removing another word from memory.


Udo: I'm not sure we have to worry about squeezing every last bit of space out of these functions. I'd prefer clarity here (along with accurate delays).

Mellis: I understand your point. But have a look at the 3rd line.

   if(cycles&1) asm volatile("nop"); \
   if(cycles&2) asm volatile("nop \n\t nop"); \
   if(cycles&4) asm volatile("push r26 \n\t pop r26"); \

This code is already pushing for efficiency. It is even willing to trade this for some side effect. Namely: it will temporarily allocate some stack space. Thus a 4 cycle delay might push a program that is running out of ram over the edge.

So I understand your argument, but looking at the code I concluded that the author wants to push for maximum flash memory efficiency. Otherwise the 4th line should read

   if(cycles&4) asm volatile("nop \n\t nop \n\t nop \n\t nop"); \

Of course it is your decision but I suspect that most of the users will never ever look up the code. Those that do and can read assembler will most probably understand this trick immediately.


Are you guys aware of the builtin AVR GCC function:
extern void __builtin_avr_delay_cycles(unsigned long __n);

Why not use that instead of trying to figure out how to generate code for the desired cycles?

The compiler will generate inline code, in fact better than some of these examples above by using a collection of various instructions to create a delay for the requested number of clock cycles.

That way the higher level macros simply have to do the math to convert wall clock time to CPU cycles and then call the builtin function to generate the code for the cycle delays.

Not wanting to do that?
Then consider using Hans Heinrichs delay routines.
Which are already done and fully debugged.

They work and are being used in the newest version of the ks0108/GLCD library for low level sub microsecond (100s of nanoseconds range) delays.

And for those that may ask “why not use the AVR libC <util/delay.h>”?
The reason is that the <util/delay.h> routines have terrible rounding errors in them, which make them totally unusable when needing delays down in the single microsecond or less than microsecond range.
They give you delays in multiples of 3 clocks, so you end up with delays that are potentially -2/+2 clocks of what delay is really possible given what you ask for with respect to your CPU clock frequency.

Han’s routines do not suffer from these rounding errors and are much more accurate for longer delays as well.

— bill