Also, micros() is only approximate anyway.
At 16MHz you have a 1/16000000 instruction clock. That's one clock every 63ns. 20µS is just 320 clocks. Some of those are used for getting the micros(), some by the serial ISR, etc.
Serial::write is quite a heavy-weight function when seen from the PoV of ASM:
00000562 <_ZN14HardwareSerial5writeEh>:
size_t HardwareSerial::write(uint8_t c)
562: cf 93 push r28
564: df 93 push r29
566: ec 01 movw r28, r24
{
int i = (_tx_buffer->head + 1) % SERIAL_BUFFER_SIZE;
568: ee 85 ldd r30, Y+14 ; 0x0e
56a: ff 85 ldd r31, Y+15 ; 0x0f
56c: e0 5c subi r30, 0xC0 ; 192
56e: ff 4f sbci r31, 0xFF ; 255
570: 20 81 ld r18, Z
572: 31 81 ldd r19, Z+1 ; 0x01
574: e0 54 subi r30, 0x40 ; 64
576: f0 40 sbci r31, 0x00 ; 0
578: 2f 5f subi r18, 0xFF ; 255
57a: 3f 4f sbci r19, 0xFF ; 255
57c: 2f 73 andi r18, 0x3F ; 63
57e: 30 70 andi r19, 0x00 ; 0
// If the output buffer is full, there's nothing for it other than to
// wait for the interrupt handler to empty it a bit
// ???: return 0 here instead?
while (i == _tx_buffer->tail)
580: df 01 movw r26, r30
582: ae 5b subi r26, 0xBE ; 190
584: bf 4f sbci r27, 0xFF ; 255
586: 8d 91 ld r24, X+
588: 9c 91 ld r25, X
58a: 11 97 sbiw r26, 0x01 ; 1
58c: 28 17 cp r18, r24
58e: 39 07 cpc r19, r25
590: d1 f3 breq .-12 ; 0x586 <_ZN14HardwareSerial5writeEh+0x24>
;
_tx_buffer->buffer[_tx_buffer->head] = c;
592: e0 5c subi r30, 0xC0 ; 192
594: ff 4f sbci r31, 0xFF ; 255
596: 80 81 ld r24, Z
598: 91 81 ldd r25, Z+1 ; 0x01
59a: e0 54 subi r30, 0x40 ; 64
59c: f0 40 sbci r31, 0x00 ; 0
59e: e8 0f add r30, r24
5a0: f9 1f adc r31, r25
5a2: 60 83 st Z, r22
_tx_buffer->head = i;
5a4: ee 85 ldd r30, Y+14 ; 0x0e
5a6: ff 85 ldd r31, Y+15 ; 0x0f
5a8: e0 5c subi r30, 0xC0 ; 192
5aa: ff 4f sbci r31, 0xFF ; 255
5ac: 31 83 std Z+1, r19 ; 0x01
5ae: 20 83 st Z, r18
sbi(*_ucsrb, _udrie);
5b0: ee 89 ldd r30, Y+22 ; 0x16
5b2: ff 89 ldd r31, Y+23 ; 0x17
5b4: 20 81 ld r18, Z
5b6: 81 e0 ldi r24, 0x01 ; 1
5b8: 90 e0 ldi r25, 0x00 ; 0
5ba: 0f 8c ldd r0, Y+31 ; 0x1f
5bc: 02 c0 rjmp .+4 ; 0x5c2 <_ZN14HardwareSerial5writeEh+0x60>
5be: 88 0f add r24, r24
5c0: 99 1f adc r25, r25
5c2: 0a 94 dec r0
5c4: e2 f7 brpl .-8 ; 0x5be <_ZN14HardwareSerial5writeEh+0x5c>
5c6: 28 2b or r18, r24
5c8: 20 83 st Z, r18
// clear the TXC bit -- "can be cleared by writing a one to its bit location"
transmitting = true;
5ca: 81 e0 ldi r24, 0x01 ; 1
5cc: 89 a3 std Y+33, r24 ; 0x21
sbi(*_ucsra, TXC0);
5ce: ec 89 ldd r30, Y+20 ; 0x14
5d0: fd 89 ldd r31, Y+21 ; 0x15
5d2: 80 81 ld r24, Z
5d4: 80 64 ori r24, 0x40 ; 64
5d6: 80 83 st Z, r24
return 1;
}
5d8: 81 e0 ldi r24, 0x01 ; 1
5da: 90 e0 ldi r25, 0x00 ; 0
5dc: df 91 pop r29
5de: cf 91 pop r28
5e0: 08 95 ret
That's 64 instructions, each consisting of 2 bytes, so 128 clock clock cycles (assuming each byte is one clock cycle - it may not be), that's 128*63 = ~8µS - and that's just the Serial.write(). Add to that the Serial.flush() and micros() calls and assignments, plus the overhead of the ISR that will become active the moment you do the Serial.write(), and you can easily fill 20µS.