sd spi custom driver multiple block read speed?

Hello,

i'm developing a commercial project thus i had to write my own sd card drivers... it's all working fine, but compared to the speed of available libraries like sdfat, the speed is really slow... half speed compared to sdfat!

i'm running code on a DUE, and i'm using the multiple block read command. everything works great, it's just... slow.

what i'm trying to do is load an image on a tft lcd. load time with sdfat is around 300ms and with my code it's 500ms... well the crucial part of the code that does all the job doesn't seem to have anything that eats up processing time, since it's a multiple block read it just repeats reading 512 byte blocks and pushing them to the tft.

the clock is set to 20MHz after sd spi initialization, and i'm using the DUE's hardware spi port.

here's the code if i may share it:

byte lcdBuffer[512];

for (int b = 0; b < numBlocks ; b++) {
  while (SPI.transfer(CSPIN, 0xFF, SPI_CONTINUE) != 0xFE) {}
  for (int i = 0; i < 512; i++) lcdBuffer[b] = SPI.transfer(CSPIN, 0xFF, SPI_CONTINUE);
  SPI.transfer(CSPIN, 0xFF, SPI_CONTINUE); //receive CRC byte 1
  SPI.transfer(CSPIN, 0xFF, SPI_CONTINUE); //receive CRC byte 2
  tft.pushColors(lcdBuffer, 256, first);
  first = false;
}

what could be so slow in this loading-sending to tft part of the code?
or maybe i have to dig deeeeep into the DMA stuff? (scared)
any ideas?
thank you!