Have I made this hardware SPI transfer as fast as possible?

In the code in the zip file you uploaded:

#define NOP __asm__ __volatile__ ("nop\n");
#define WAIT NOP NOP NOP NOP NOP NOP NOP NOP NOP NOP  // 11 NOPs

10 NOPs. Bad comment.