Change shiftOut() Clock Speed

I've put the following code in my sketch to speed up the shiftOut() function. I'm using the typical MC74HC595 shift register but the maximum clock frequency for my application is (conservatively) 4 MHz. If my code works, my clock frequency will be considerably faster (see thread below - I'm using Stimmer's code; it takes about 24 nS to write a pin high). I need a "delay" of say ... 250 nS between each clock write cycle. So ... can I modify an existing (e.g. micros()) or write my own function to have better 100 nS resolution? Or a better way to accomplish what I want?

inline void digitalWriteDirect(int pin, boolean val){
  if(val) g_APinDescription[pin].pPort -> PIO_SODR = g_APinDescription[pin].ulPin;
  else    g_APinDescription[pin].pPort -> PIO_CODR = g_APinDescription[pin].ulPin;
}

inline int digitalReadDirect(int pin){
  return !!(g_APinDescription[pin].pPort -> PIO_PDSR & g_APinDescription[pin].ulPin);
}

//High speed shift register function
void shiftOutFast( uint32_t ulDataPin, uint32_t ulClockPin, uint32_t ulBitOrder, uint32_t ulVal )
{
  uint8_t i ;
  for ( i=0 ; i < 8 ; i++ ){
    if ( ulBitOrder == LSBFIRST ){
      digitalWriteDirect( ulDataPin, !!(ulVal & (1 << i)) );
    }
    else{
      digitalWriteDirect( ulDataPin, !!(ulVal & (1 << (7 - i))) );
    }
    digitalWriteDirect( ulClockPin, HIGH );
    //need to delay 200 nS
    digitalWriteDirect( ulClockPin, LOW );	
   //need to delay 200 nS	
  }
}

http://arduino.cc/forum/index.php/topic,129868.15.html

I think the best you can do is use SPI.transfer( ) instead of shiftclk( ).
There are some clock divisors you can use, fastest is 1/2 processor clock speed, or 133nS.
~~http://arduino.cc/en/Reference/SPI~~
[Edit - right answer, wrong processor]

Here's a useful macro:

#define MNOP(x) asm volatile (" .rept " #x "\n\t nop \n\t .endr \n\t")

It outputs nop instructions - each nop wastes approx. 1 clock cycle which is about 12nS. So use MNOP(16); for about 200nS.

Beware that doing delays this way isn't very deterministic, you might find that you add a line somewhere completely different in the code and it changes the length of the delay!

How is 1 clock cycle 12nS when ...
Ah, never mind, didnt realize this was a Due question.

stimmer:
Here's a useful macro:

#define MNOP(x) asm volatile (" .rept " #x "\n\t nop \n\t .endr \n\t")

It outputs nop instructions - each nop wastes approx. 1 clock cycle which is about 12nS. So use MNOP(16); for about 200nS.

Beware that doing delays this way isn't very deterministic, you might find that you add a line somewhere completely different in the code and it changes the length of the delay!

Thanks Stimmer. Saved me twice today! The digitalWriteDirect() is super handy and the MNOP is exactly what I needed.