What exactly is this coding going to accomplish? IMHO 4 microseconds = 64 cycles is plenty of time. However it would help a lot to understand what you are doing and what your constraints actually are. According to the comments it looks like some kind of combustion engine control. But then again I could be wrong.
With regard to the timing: depending on how far you are going to push it even 1 microseconds = 16 cycles latency is not to hard to achieve. For some very specific project I once implemented something with a guaranteed latency of 8 cycles. Unless it is clear what you really need it stays unclear if you need techniques similar to what I did or if something much simpler might suffice (simpler == much easier to debug).