Why is micros() so much slower than millis()?

No. Reading a 32bit microsecond count over a pin would typically be MUCH slower than the current micros() function...

Look, your test code demonstrates that micros() is much slower than millis(), but that's been explained: millis() is a very simple function that doesn't do much, so it is VERY fast. micros() does more work - it combines two separate counters (millisecond interrupts and individual timer ticks) with some moderate math. Easily several times more complicated, and millis() may get in-lined as well.

But that doesn't mean that micros() is actually SLOW! Here's the object code generated for micros() on an Uno:

unsigned long micros() {
        unsigned long m;
        uint8_t oldSREG = SREG, t;
     3da:       3f b7           in      r19, 0x3f       ; 63
        
        cli();
     3dc:       f8 94           cli
        m = timer0_overflow_count;
     3de:       80 91 ad 01     lds     r24, 0x01AD
     3e2:       90 91 ae 01     lds     r25, 0x01AE
     3e6:       a0 91 af 01     lds     r26, 0x01AF
     3ea:       b0 91 b0 01     lds     r27, 0x01B0     ;timer0_overflow_count
        t = TCNT0;
     3ee:       26 b5           in      r18, 0x26       ; 38
        if ((TIFR0 & _BV(TOV0)) && (t < 255))
     3f0:       a8 9b           sbis    0x15, 0 ; 21
     3f2:       05 c0           rjmp    .+10            ; 0x3fe <micros+0x24>
     3f4:       2f 3f           cpi     r18, 0xFF       ; 255
     3f6:       19 f0           breq    .+6             ; 0x3fe <micros+0x24>
                m++;
     3f8:       01 96           adiw    r24, 0x01       ; 1
     3fa:       a1 1d           adc     r26, r1
     3fc:       b1 1d           adc     r27, r1
        SREG = oldSREG;
     3fe:       3f bf           out     0x3f, r19       ; 63
        
        return ((m << 8) + t) * (64 / clockCyclesPerMicrosecond());
     400:       ba 2f           mov     r27, r26
     402:       a9 2f           mov     r26, r25
     404:       98 2f           mov     r25, r24
     406:       88 27           eor     r24, r24
     408:       bc 01           movw    r22, r24
     40a:       cd 01           movw    r24, r26
     40c:       62 0f           add     r22, r18
     40e:       71 1d           adc     r23, r1
     410:       81 1d           adc     r24, r1
     412:       91 1d           adc     r25, r1
     414:       42 e0           ldi     r20, 0x02       ; 2
     416:       66 0f           add     r22, r22
     418:       77 1f           adc     r23, r23
     41a:       88 1f           adc     r24, r24
     41c:       99 1f           adc     r25, r25
     41e:       4a 95           dec     r20
     420:       d1 f7           brne    .-12
}
     422:       08 95           ret

(and yes, if you're going to be this nit-picky about the speed of code, you DO need to learn how to interpret the object code!)

I count about 50 cycles for the whole thing; about 3 microseconds execution time. Reading 32bits via a pin would take at least 128 cycles (read a bit, merge into count, loop, times 32...), and that doesn't include clocking or waiting for a slower serial protocol (I2C, used by many clocks, runs at ~400kHz, so that would be 80 microseconds (1280 cycles) just waiting for the bits to arrive.)