how to generate less than microsecond delay?

Nothing discussed actually creates a pulse tho.
To do that, you need to make an output pin go Hi & Lo. (or Lo and back Hi)

Usual fast way is Direct Port Manipulation (vs digitalWrite(pinX, HIGH); and digitalWrite(pinX, LOW); because the IDE adds some smarts to the writing to protect the user:

after setting the bits as an output with pinMode statements in setup.

PORTC = PORTC | B00000001; // sets bit 0, leave rest alone
// add your NOPs here if needed
PORTC = PORTC & B11111110; // clears bit 0, leave rest alone

I usually use it like this to load up shift registers, way quicker than shiftout ( )

PORTC = PORTC & B11111110; // clears bit 0, leave rest alone -- latchClock low
SPI.transfer(upperDataByte); // send byte to shift register
SPI.transfer(lowerDataByte); // send byte to shift register
PORTC = PORTC  |  B00000001; // sets bit 0, leave rest alone -- "latchClock High