Timer0 misconfiguration?

I'm using Timer0 on Atmega328p with 16Mhz external clock to get 1us ticks in interrupt vector. Here is timer configuration.

TCCR0A  = _BV(WGM01); // CTC
TCCR0B  = _BV(CS00); // No prescaler. /8 /64 gives same final result!?! 
OCR0A   = 0;
TIMSK0  = _BV(OCIE0A); // Enable (TIMER0_COMPA_vect) compare match interrupt

Simply incrementing usigned long counter

ISR(TIMER0_COMPA_vect)
{
    microseconds++;
}

I am expecting this interrupt to be called with 8Mhz frequency (16*E6 / 2). So if I print microseconds every second it should give me the difference from previous value of 125.000 ticks. However, I see 250.000 ticks are counted!

Not sure if it's related, but changing prescaler to /8 or /64 yields to same 250.000 ticks.

Where is my miscalculation?

Do you have an inkling of what has to happen to service an interrupt? Are you seriously expecting an ISR to run 8,000,000 times a second?

In one microsecond, your 16 MHz processor executes instructions just 16 times (16 processor clock ticks). Five of those are needed just to enter the interrupt service routine leaving 11 processor clock ticks for your code in the interrupt service routine.

One clock tick is needed to load each byte of microseconds into a register for a total of four leaving seven ticks.

One clock tick is needed to increment the value for each byte for a total of four leaving three ticks.

One clock tick is needed to store each byte of microseconds in memory for a total of four leaving -1 ticks. Your interrupt service routine has used too much time; the next interrupt is already scheduled and you have not finished processing this interrupt (an "overrun" has occurred).

In addition to the things above, the compiler includes several machine instructions of overhead in an interrupt service routine.

In other words, your interrupt service routine takes considerably longer than 1 microsecond to execute. From what you've described, I would guess it takes at least 2 microseconds. Because of the "overrun" problem, your interrupt service routine will execute after every single machine instruction in the rest of your program.

Got it, will use bigger prescaler and consider ISR cycles.
Thanks.