The delayMicroseconds function fails when called with a value of 0.
Although this is not done normally, it can happen when it is called with a variable which is the result of some expression/calculation.
The bug can be traced back to the
if (--us ==0) statement, which causes an overflow when the param us is 0. The delay is then far longer (about 16384 micros) .
BUG seen in IDE 0.22 + 1.0.0 + 1.0.2
Find test program and patch below : NOTE the patch does not include the 1.0.2 20Mhz addition as I cannot test it .
The patched version of the delayMicroseconds() will return asap when called with 0 . This still takes about 1.2 uSec but that is way better than 16K uSec

Q: can anybody confirm this bug in a 20Mhz duino?
Program to show bug
//
// FILE: delayMicrosecondsBUG.pde
// AUTHOR: Rob Tillaart
// DATE: 2012-11-18
//
// PUPROSE: test delayMicroseconds()
//
void setup()
{
Serial.begin(9600);
Serial.println("start...");
unsigned long m = micros();
for (uint8_t i=0; i<100; i++)
{
delayMicroseconds(0);
}
Serial.println(micros()- m);
m = micros();
for (uint8_t i=0; i<100; i++)
{
delayMicroseconds(1);
}
Serial.println(micros()- m);
m = micros();
for (uint8_t i=0; i<100; i++)
{
delayMicroseconds(2);
}
Serial.println(micros()- m);
m = micros();
delayMicroseconds(20);
Serial.println(micros()- m);
m = micros();
delayMicroseconds(200);
Serial.println(micros()- m);
m = micros();
delayMicroseconds(2000);
Serial.println(micros()- m);
}
void loop()
{}
====================
Find below a patch for the function (for 0.22 - 1.0.0 version):
void delayMicroseconds(unsigned int us)
{
// calling avrlib's delay_us() function with low values (e.g. 1 or
// 2 microseconds) gives delays longer than desired.
//delay_us(us);
#if F_CPU >= 16000000L
// for the 16 MHz clock on most Arduino boards
// for a one-microsecond delay, simply return. the overhead
// of the function call yields a delay of approximately 1 1/8 us.
// PATCH
// if (--us == 0)
// return;
if (us < 2) return;
us--;
// the following loop takes a quarter of a microsecond (4 cycles)
// per iteration, so execute it four times for each microsecond of
// delay requested.
us <<= 2;
// account for the time taken in the preceeding commands.
us -= 2;
#else
// for the 8 MHz internal clock on the ATmega168
// for a one- or two-microsecond delay, simply return. the overhead of
// the function calls takes more than two microseconds. can't just
// subtract two, since us is unsigned; we'd overflow.
// PATCHED LINES
// if (--us == 0)
// return;
// if (--us == 0)
// return;
if (us < 3) return;
us -= 2;
// the following loop takes half of a microsecond (4 cycles)
// per iteration, so execute it twice for each microsecond of
// delay requested.
us <<= 1;
// partially compensate for the time taken by the preceeding commands.
// we can't subtract any more than this or we'd overflow w/ small delays.
us--;
#endif
// busy wait
__asm__ __volatile__ (
"1: sbiw %0,1" "\n\t" // 2 cycles
"brne 1b" : "=w" (us) : "0" (us) // 2 cycles
);
}
TODO: post on the issue list yet ...