This brings the Due into line with the Teensy 3.0 performance -- about twice the throughput at twice the clock speed.
But is the Teensy using DMA? If not, then almost all the speedup is due to side-stepping the slow standard SPI library implementation on the Due, rather than on the DMA per se.
In other words, if the standard SPI library was fixed for the Due, you might expect to see similar performance with non-DMA code. No?