Also with DMA it adds up, unless there is more than one DMA channel.
It always adds up.
As far as I know, the only way to retrieve data parallel is to use extra hardware. I mentioned a FPGA, but perhaps more than one Arduino board or perhaps there is a SPI to parallel chip so the Arduino can read the data parallel.
I think that with 5MHz SPI, retrieving multiple sensors with a sample rate of about 1kHz should work. Does retrieving all the data with the SPI code that you showed take more than 10% of the cpu time ?
I don't know how many µs the digitalWrite() for a Arduino Zero takes, perhaps you can optimize that.
I have come to the limit of my knowledge. If no one else has a better idea, try to continue with your SPI code.