DigitalWrite is incredibly slow.
Why?
Most of the low speed of digitalWrite() is due to the requirement that it handle a variable value for both the pin number and the value. The "PORTD |= 0x8" form compiles into a single instruction, but it's one that had the port and bit both built-in to the the instruction. As soon as you want to do " |= " you're looking at perhaps 10 instructions. By the time you add translating "pin no" to port/bit and HIGH/LOW to a bit value, and dealing with the timer or A2D that might be in use on the pin, it amounts up...
The loop{} seems to take about 6 uS to go round - why? - it's only a jump to a label.
Not any more. In between loops, the code calls SerialEventRun(). (6us sounds a bit excessive, though. serialEventRun shouldn't do much if you haven't defined serialEvent()
There's a continuous processor stop every millisecond or so - presumably to corect the millis() count. But it takes again about 8uS - it's simple interrupt incrementing a counter - that doesn't take 100 instructions!
Alas, it increments two counters. Two 32bit counters. So, load 4 bytes, add to each byte, store 4 bytes, times 2. That's 40 cycles. Plus the save/restore of ~8 registers. Plus logic to handle the extra 24us. The timer overflow ISR is ... just about 50 instructions long, and many of them are 2-cycle instructions.
Could it be better? Some, probably. You could do the 32bit adds one at a time, and probably get rid of at least 4 of the register save/restores (16 cycles.) Really not worth the loss in clarity, though... (I DID get rid of the second 32bit counter, but it got added back in for compatibility reasons. Or something. Sigh.)
ISR(TIMER0_OVF_vect)
{
45a: 1f 92 push r1
45c: 0f 92 push r0
45e: 0f b6 in r0, 0x3f ;;; status
460: 0f 92 push r0
462: 11 24 eor r1, r1 ;;; known zero register
464: 2f 93 push r18
466: 3f 93 push r19
468: 8f 93 push r24
46a: 9f 93 push r25
46c: af 93 push r26
46e: bf 93 push r27
unsigned long m = timer0_millis;
470: 80 91 1d 01 lds r24, 0x011D ; 0x80011d <timer0_millis>
474: 90 91 1e 01 lds r25, 0x011E ; 0x80011e <timer0_millis+0x1>
478: a0 91 1f 01 lds r26, 0x011F ; 0x80011f <timer0_millis+0x2>
47c: b0 91 20 01 lds r27, 0x0120 ; 0x800120 <timer0_millis+0x3>
unsigned char f = timer0_fract;
480: 30 91 1c 01 lds r19, 0x011C ; 0x80011c timer0_fract
m += MILLIS_INC;
f += FRACT_INC;
484: 23 e0 ldi r18, 0x03
486: 23 0f add r18, r19
if (f >= FRACT_MAX) {
488: 2d 37 cpi r18, 0x7D
48a: 20 f4 brcc .+8 ; 0x494 <__vector_16+0x3a>
m += MILLIS_INC;
48c: 01 96 adiw r24, 0x01
48e: a1 1d adc r26, r1
490: b1 1d adc r27, r1
492: 05 c0 rjmp .+10 ; 0x49e <__vector_16+0x44>
f += FRACT_INC;
if (f >= FRACT_MAX) {
f -= FRACT_MAX;
494: 26 e8 ldi r18, 0x86
496: 23 0f add r18, r19
m += 1;
498: 02 96 adiw r24, 0x02
49a: a1 1d adc r26, r1
49c: b1 1d adc r27, r1
}
timer0_fract = f;
49e: 20 93 1c 01 sts 0x011C, r18 ; 0x80011c timer0_fract
timer0_millis = m;
4a2: 80 93 1d 01 sts 0x011D, r24 ; 0x80011d <timer0_millis>
4a6: 90 93 1e 01 sts 0x011E, r25 ; 0x80011e <timer0_millis+0x1>
4aa: a0 93 1f 01 sts 0x011F, r26 ; 0x80011f <timer0_millis+0x2>
4ae: b0 93 20 01 sts 0x0120, r27 ; 0x800120 <timer0_millis+0x3>
timer0_overflow_count++;
4b2: 80 91 21 01 lds r24, 0x0121 ; 0x800121 <timer0_overflow_count>
4b6: 90 91 22 01 lds r25, 0x0122 ; 0x800122 <timer0_overflow_count+0x1>
4ba: a0 91 23 01 lds r26, 0x0123 ; 0x800123 <timer0_overflow_count+0x2>
4be: b0 91 24 01 lds r27, 0x0124 ; 0x800124 <timer0_overflow_count+0x3>
4c2: 01 96 adiw r24, 0x01 ; 1
4c4: a1 1d adc r26, r1
4c6: b1 1d adc r27, r1
4c8: 80 93 21 01 sts 0x0121, r24 ; 0x800121 <timer0_overflow_count>
4cc: 90 93 22 01 sts 0x0122, r25 ; 0x800122 <timer0_overflow_count+0x1>
4d0: a0 93 23 01 sts 0x0123, r26 ; 0x800123 <timer0_overflow_count+0x2>
4d4: b0 93 24 01 sts 0x0124, r27 ; 0x800124 <timer0_overflow_count+0x3>
}
4d8: bf 91 pop r27
4da: af 91 pop r26
4dc: 9f 91 pop r25
4de: 8f 91 pop r24
4e0: 3f 91 pop r19
4e2: 2f 91 pop r18
4e4: 0f 90 pop r0
4e6: 0f be out 0x3f, r0
4e8: 0f 90 pop r0
4ea: 1f 90 pop r1
4ec: 18 95 reti