I think you have some of the loop time in there. I measured the duration of digitalWrite HIGH followed by digitalWrite LOW at 4.5 microseconds.
This picture shows the timing difference compared to direct port IO (about 35 times faster)
The markers at the top are 1us intervals