Faster SPI on the Zero?

I modify the RAW function like so, so it just reads one block from the card:
(Note the two f.read(buffer, 512) lines.)

// This function blits a full screen, raw, 16 bit 565 RGB color image to the display from the SD card.
void rawFullSPI(char *filename) { 
  
  File f;
  uint8_t buffer[512]; // Buffer two full rows at a time - 512 bytes.  This is the same the size of an SD card block.
  uint8_t *b, *bmax; // Pointers into the buffer.
  
  // Specify size of region to be drawn.

    tft.writeCommand(SSD1351_CMD_SETCOLUMN);
    tft.writeData(0);
    tft.writeData(127);
    
    tft.writeCommand(SSD1351_CMD_SETROW);
    tft.writeData(0);
    tft.writeData(127);


  // Draw bitmap.
    
    tft.writeCommand(SSD1351_CMD_WRITERAM); // Tell display we're going to send it image data in a moment. (Not sure if necessary.) 
    digitalWrite(my_dc, HIGH); // Set DATA/COMMAND pin to DATA.    
    
    f = SD.open(filename); // Open file for reading.
    f.read(buffer, 512);

    for (byte row = 0; row < 128; row+=2) { // 2.79FPS without SPI_FULL_SPEED in SPI.begin, 3.75FPS with it.
      
      //f.read(buffer, 512); // Read the next two rows from the card into the image buffer. 
      // 2.79FPS when doing this read. 6.42 FPS when not doing this read.  (2.3x as fast)    
      // With new block transfer optimization, 7.15 FPS when doing this read, and 20.18 FPS when not doing this read.
      // The reason the screen goes white when doing this is because the buffer we're using to transmit is also the receive buffer, so it is overwritten on the first go round.
      
      /*
      b = buffer;
      bmax = b+512; // Calcuate when we should stop and read the next two rows.   
      
      digitalWrite(my_cs, LOW); // Tell display to pay attention to the incoming data.
             
      while (b < bmax) { // Write both rows to the display.
        SPI.transfer(*b); // Write low byte.
        b++;
      }
      */

      // Moving all the extra stuff here  outside the for loop and getting rid of SD reads gives 24.75 FPS, which is still slower than expected. 24.23 FPS with these in the loop.
      // Skipping the file opening and closing for three different images by using a loop inside this function does not improve performance much.  Still 24.71 FPS.
      // Unrolling the transfer loop didn't seem to improve things at all.  
         
      digitalWrite(my_cs, LOW); // Tell display to pay attention to the incoming data.

      //noInterrupts(); // 7.65 -> 7.70 FPS
      SPI.beginTransaction(SPISettings(12000000, MSBFIRST, SPI_MODE0)); 
      SPI.transfer(buffer, 512); 
      SPI.endTransaction();
      //interrupts();
                 
      digitalWrite(my_cs, HIGH); // Tell display we're done talking to it for now, so the next SD read doesn't corrupt the screen.
      
    }  
  
    f.close(); // Close the file.
        
}

With that change I get over 24fps, but of course the images don't display properly. They don't need to though, I'm just testing how fast I can output a full screen of data.

But even though 24fps seems fast, it's only half of what it should be capable of and I can't for the life of me figure out why. I've checked everything I can think of. I even went so far as to make sure the sercom lib was calculating the baud rate correctly. And there is no 2x multiplier bit for the SPI like there is on the AVR so that can't be set wrong either.