Massive Parallel SPI

I don't think I can even get that amount of data through the FTDI.

While the serial connection itself is a bottleneck I suspect the one-byte-at-a-time interrupt in the Arduino core will be the biggest problem.

I agree with @orangeLearner. A few Teensy boards should work.

In any case, you will want to size the outbound data to be evenly divisible into full USB frames. I believe not-full-frames are held for a few milliseconds (Nagle's algorithm meets USB).