Your problem is that interrupt service routines take a finite time to execute, maybe 2.5 uS in overhead, plus your digitalWrite, etc.
So you can't execute an ISR every 3 or 4 uS and hope to get a 1 uS pulse out of it.
However this sketch (which uses the hardware timer output) reliably outputs 1 MHz on pin 9:
#define myOutputPin 9
void setup ()
{
pinMode (myOutputPin, OUTPUT);
TCCR1A = 0;
TCCR1B = 0;
TCNT1 = 0;
OCR1A = 7; // toggle after counting to 8
TCCR1A |= (1 << COM1A0); // Toggle OC1A on Compare Match.
TCCR1B |= (1 << WGM12); // CTC mode
TCCR1B |= (1 << CS10); // clock on, no pre-scaler
}
void loop () { }
Note that OCR1A is 7, not 16. For one thing it is zero-relative (so you should have used 15 not 16) and it takes two toggles per cycle, so we really want to toggle every 8 clock cycles of the processor (ie. every 500 nS).