Faster SPI on the Zero?

So in testing out the SPI library on the Zero with my OLED display I've noticed it's slow. Really slow.

When blitting full screen images on an Atmega328 running at 16MHz with a highly optimized SPI routine (direct register access) which was transmitting the data as fast as possible - 8MHz, I was getting almost 9 FPS.

On the Zero, with its 48MHz clock speed, and SPI bus capable of running at 12MHz, but using the built in SPI library block transfer function, I'm getting a whole 3.75 FPS.

So rather than being 1.5x faster, it's 2.4x slower. :frowning:

Now in investigating this, the first issue I found was that using SPI.setClockDivider() no longer works. I guess we have to use SPI.beginTransaction now. Perhaps that is because the SD library now uses that and it messes with it.

I also noticed that the SD library, for reasons that seem questionable (because people might connect their SD cards with long wires) forces half-speed on you, with no way to override it without editing the library. And of course those edits will inevitably be reverted when you upgrade the IDE, and most people won't even know to make that change if they're not getting the performance they expect.

But anyway, I changed that to default to SPI_FULL_SPEED, and I put this code into my demo:
SPI.beginTransaction(SPISettings(12000000, MSBFIRST, SPI_MODE0));

These two changes increased the speed to what I quoted above. Not exactly lightning quick.

I did notice however that when I removed the SD card streaming from the equation, the blitting sped up 3x. So for some reason, streaming the data from the SD card is 2x as slow as writing the same amount of data to the screen. I believe this may have been the case with the 328P as well. I don't know why it's the case, and I can't see any reason it should be the case, but I'm just putting that out there.

I know the data has to be copied to an intermediate 512 byte buffer when its read from the card, and then that data is then copied to the buffer I supply the library with, but seeing as the Zero runs at 48MHz, it doesn't seem like that extra copy operation should double the time it takes to stream the data off the card.

But anyway, back to the SPI library...

I have not made any progress here yet myself, but I thought I'd post what information I have now so if anyone has any suggestions they can provide them and I have all the info in one spot.

So, first, we have this function, which is the one that needs to be optimized:

void SPIClass::transfer(void *buf, size_t count)
{
  // TODO: Optimize for faster block-transfer
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);
  for (size_t i=0; i<count; i++)
    buffer[i] = transfer(buffer[i]);
}

That calls this function:

byte SPIClass::transfer(uint8_t data)
{
  // Writing the data
  _p_sercom->writeDataSPI(data);

  // Read data
  return _p_sercom->readDataSPI() & 0xFF;
}

And that calls these functions:

void SERCOM::writeDataSPI(uint8_t data)
{
  while( sercom->SPI.INTFLAG.bit.DRE == 0 )
  {
    // Waiting Data Registry Empty
  }

  sercom->SPI.DATA.bit.DATA = data; // Writing data into Data register

  while( sercom->SPI.INTFLAG.bit.TXC == 0 || sercom->SPI.INTFLAG.bit.DRE == 0 )
  {
    // Waiting Complete Transmission
  }
}

uint16_t SERCOM::readDataSPI()
{
  while( sercom->SPI.INTFLAG.bit.DRE == 0 || sercom->SPI.INTFLAG.bit.RXC == 0 )
  {
    // Waiting Complete Reception
  }

  return sercom->SPI.DATA.bit.DATA;  // Reading data
}

It took me a while to track it all down in all the various directories, some of it is in the hidden Arduino15 user directory, and some in the libraries directory under the main Arduino folder, not to be confused with the libraries directory in your documents. Anyway, I just linked to the GitHub repository to make things easier for everyone. That's the newest version of the code anyway. Plus I get syntax highlighting unlike when I open the code in wordpad cause Atmel Studio takes two minutes to open. :stuck_out_tongue:

But anyway... that code looks fairly simple. It shouldn't be too difficult to optimize it, by doing something along the lines of what we achieved here:

http://forum.arduino.cc/index.php?topic=129824.0

I'm not suggesting we add NOPs to the code of course. And I don't think it's necessary, since the Zero runs at 48MHz while its SPI can only run at 12MHz. There should be sufficient spare time for the extra comparison and jump not to affect the final speed.

Just something along the lines of this:

 SPDR = *thisLED--; // Initiate first byte transfer and decrement address of *thisLED.

   do {        
     while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED.
   } while (thisLED != lastLED);
   
   while (!(SPSR & _BV(SPIF))); // Wait for last byte to finish transfer.

I'm not sure it will be that straightforward though, since I have been doing a bit of googling and I saw something about the SPI having a double buffer on the receiver. But maybe that was just an issue when trying to write only when not reading.

Anyway I've said all I've got to say for now. Now I'm gonna figure out exactly what that code is doing and see if its possible to interleave the reads and the writes first of all. Then if that helps I'll look into using pointers to speed up the buffer access. Ultimately I hope to speed this up around 4.5X from what I'm getting right now. Hopefully that speedup will carry over to the SD library as well if its using the transfer function, which I haven't checked. That would give an even greater speedup.

did you know this thoughts about SPI with DMA ?
http://forum.arduino.cc/index.php?topic=344029.msg2371173#msg2371173

I assume with DMA you could do SPI writes in the background, which would free up a lot of CPU time, though I don't think it would directly speed up SPI transfers much, and it would probably be more difficult to work with since you then probably need to deal with race conditions where you start a transfer, do something else, then begin another transfer while the first is still completing. Adding that might require major changes to the SPI library. The change I'm proposing should be fairly simple, just requiring that one transfer function to be replaced.

Thanks for pointing that out though, I'll have a look at it.

So, I've got something that seems to work up and running:

I added the following to SERCOM.H and SERCOM.CPP right after the readdataSPI functions in both:

void transferDataSPI(void *buf, uint32_t count);
void SERCOM::transferDataSPI(void *buf, uint32_t count)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);

   for (size_t i=0; i<count; i++) {
     sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer. 
while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer. 
   } 

}

With that code, displaying the images went from 3.75 FPS to 4.8 FPS.

If I remove the code that reads the SD card however, I get 19.69 FPS.

Since approximately the same amount of data should be transferred between screen and SD card, if the SD library were operating at 100% I should be getting 9.8 FPS with the block transfer code as currently written, though I expect I may see some performance improvement once I change that for loop to pointer arithmetic.

I believe the reason the SD lib is so slow is its not using the block transfer function. If it were, then when I had the code screwed up earlier my non-block transfer BMP test code that I run before the fast raw bitmap code I'm benchmarking here would not have worked; but it did.

very small optimization (zero experience with zero)

void SERCOM::transferDataSPI(void *buf, uint32_t count)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);
  while(count-- > 0)
  {
    sercom->SPI.DATA.bit.DATA = *buffer;            // Initiate byte transfer. 
    while (sercom->SPI.INTFLAG.bit.RXC == 0);       // Wait for data to be available in the receive buffer.
    *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF;   // Read received byte, then increment pointer into buffer. 
  } 
}

it removes the i from the loop and comparing with zero is faster than comparing with non zero.
footprint might be smaller too

Thanks!

It doesn't appear to have changed the speed at all, but it makes the code cleaner, so I'll go with that. I tried storing an end pointer as well but that just made the code messier and also didn't improve the speed any.

Most of the speed losses seem to be in the SD library. I think maybe the default SPI functions got slower, which slowed that down. But it looks like it will be fairly straightforward to implement the block transfers in its block read function. That should give a major speed boost.

this is very interesting,
let us know, if you can find out more about that

The SERCOM works fine. Albeit with a 12MHz maximum SCK.

The SPI library is what slows things down. An individual SPI.transfer() ends up with a lot of overhead. i.e. gaps between bytes.
There should be overloaded methods for SPI.transfer() of a block. And if that is written efficiently, will make a huge difference to SD.

Look at Rob's example. Mind you, this can be made faster by rearranging the loop.

Alternatively, if the SD uses DMA, you end up with the best of both worlds.

David.

The SERCOM works fine. Albeit with a 12MHz maximum SCK.

The SPI library is what slows things down. An individual SPI.transfer() ends up with a lot of overhead. i.e. gaps between bytes.

I don't see how you can say the SERCOM works fine and distinguish them from one another like that when the SPI library relies heavily on the SERCOM library and the reason the SPI library has a lot of overhead is because SERCOM doesn't have a block transfer function built in, and calling the write and read functions repeatedly in the SPI transfer function creates all that overhead.

I mean one could write the SPI library to do the SPI setup and transfer and such on it's own and never touch the SERCOM library, but it seems to be written such that everything else is done in SERCOM and trying to figure out how to move all the code into the SPI library without messing something else up (the SPI library now operates on transactions, so there's a lot of variables behind the scenes to keep track of that and the state of the SPI, and handle SPI during interrupts as well) was making my head spin so I just went along with it. I figured it had a better chance of becoming an official change this way as well.

Look at Rob's example. Mind you, this can be made faster by rearranging the loop.

Uh, Rob's example is my code with one minor change to how the loop is incremented. :slight_smile:

There should be overloaded methods for SPI.transfer() of a block. And if that is written efficiently, will make a huge difference to SD.

It won't, because the SD library as written does not call SPI.transfer() with the number of bytes to transfer, it just calls it in a loop. But I'll be fixing that today and will let you know how it goes.

So just wanted to throw out there that when looking into the SD card library I noticed it seems to be missing a lot of SPI.begintransaction() / endtranaction() calls. This is probably bad, and a bug, since it's my understanding you now need to put those around each block of SPI transactions so that SPI transactions in interrupts can work properly.

The main effect this would have on most people though is that if you change the speed with begintransaction somewhere in your program after the SD library is initialized it will change the speed at which SD communication occurs. And it would also mess it up if your code uses a different SPI mode. But that's not anything new really since that's how it's been since before these new commands were introduced.

Anyway, I'm having a little trouble getting my changes to work. I kept it simple and just sped up full block reads, using the original slow method to handle the start and end of partial blocks (because right now I don't have a function to do block SPI transfers that simply toss the data away), but something's wrong because my images aren't displaying. I'm not sure if that's because my code isn't leaving the SPI bus in the right state, or because the data being read isn't being put into the buffer though.

I commented where I added new lines:

//------------------------------------------------------------------------------
/**
 * Read part of a 512 byte block from an SD card.
 *
 * \param[in] block Logical block to be read.
 * \param[in] offset Number of bytes to skip at start of block
 * \param[out] dst Pointer to the location that will receive the data.
 * \param[in] count Number of bytes to read
 * \return The value one, true, is returned for success and
 * the value zero, false, is returned for failure.
 */
uint8_t Sd2Card::readData(uint32_t block,
        uint16_t offset, uint16_t count, uint8_t* dst) {
  uint16_t n;
  if (count == 0) return true;
  if ((count + offset) > 512) {
    goto fail;
  }
  if (!inBlock_ || block != block_ || offset < offset_) {
    block_ = block;
    // use address if not SDHC card
    if (type()!= SD_CARD_TYPE_SDHC) block <<= 9;
    if (cardCommand(CMD17, block)) {
      error(SD_CARD_ERROR_CMD17);
      goto fail;
    }
    if (!waitStartBlock()) {
      goto fail;
    }
    offset_ = 0;
    inBlock_ = 1;
  }

#ifdef OPTIMIZE_HARDWARE_SPI
  // start first spi transfer
  SPDR = 0XFF;

  // skip data before offset
  for (;offset_ < offset; offset_++) {
    while (!(SPSR & (1 << SPIF)))
      ;
    SPDR = 0XFF;
  }
  // transfer data
  n = count - 1;
  for (uint16_t i = 0; i < n; i++) {
    while (!(SPSR & (1 << SPIF)))
      ;
    dst[i] = SPDR;
    SPDR = 0XFF;
  }
  // wait for last byte
  while (!(SPSR & (1 << SPIF)))
    ;
  dst[n] = SPDR;

#else  // OPTIMIZE_HARDWARE_SPI

	SPI.beginTransaction(settings); // *** NEW ***

  // skip data before offset
  for (;offset_ < offset; offset_++) {
    spiRec();
  }

  // transfer data
  //for (uint16_t i = 0; i < count; i++) {
  //  dst[i] = spiRec();
  //}

	SPI.transfer(dst, count); // *** NEW ***

#endif  // OPTIMIZE_HARDWARE_SPI

  offset_ += count;
  if (!partialBlockRead_ || offset_ >= 512) {
    // read rest of data, checksum and set chip select high
    readEnd();
  }

#ifdef OPTIMIZE_HARDWARE_SPI
#else  // OPTIMIZE_HARDWARE_SPI
	SPI.endTransaction(); // *** NEW ***
#endif  // OPTIMIZE_HARDWARE_SPI

  return true;

 fail:
  chipSelectHigh();
  return false;
}

Added some checks before and after the block transfer loop in SERCOM to make sure it wasn't that that wasn't completing correctly, but to no avail.

while( sercom->SPI.INTFLAG.bit.DRE == 0 )
{
    // Waiting Complete Transmission
}

Ah, damn I just realized what the problem is. I think.

I noticed the spireceive() function in the SD library is sending 0xFF to get data back. I seem to recall it's required to transmit that to get data back from the card, but I've just been sending it whatever data is already in the buffer thinking it just needed clocks to send data back.

So I've gotta roll a new function for SERCOM to send a fixed value but return the data in the specified buffer.

Yup, that was the problem all right!

So right now, here are the benchmarks:

Default libraries, drawing 128x128 16bit images:
BMPs - 0.46 FPS
RAW - 2.79FPS

After changing SPI_HALF_SPEED to SPI_FULL_SPEED in the SPI library:
BMPs - 0.56 FPS
RAW - 3.75 FPS

After adding a block transfer function to SERCOM + SPI libraries:
BMPs - 1.06 FPS
RAW - 4.69 FPS

After adding full block transfer acceleration to the SD library:
BMPS - 1.46 FPS
RAW - 7.15 FPS

And when drawing RAW images without streaming them from the SD card, I get 20.18 FPS.

It should be noted that when transferring 128x128x16 bit images, I should theoretically be getting 12,000,000 / 262144 = 45.77 FPS.

Or am I mistaken? Half that is 22.88 FPS which is extremely close to what I am getting. Is the 12MHz max SPI bus speed the number of times per second the pin can change and thus half the number of bits it can transmit per second, or is it the number of high/low transitions, meaning the pin changes 24M times a second and 12M bits can be transferred? I'm not sure.

There's still one last optimization I need to make and that's support for accelerating partial block transfers in the SD lib. That will require yet another transfer function so I can transmit X bytes, while discarding those received. That'll be easy enough to implement. I don't know if it's going to have any effect on the speed though because I don't actually know if any partial block transfers are going on here.

Anyway here's the two functions I have so far:

void SERCOM::transferDataSPI(void *buf, uint32_t count)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);

  while(count-- > 0) {
     sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer.
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
  }

}


void SERCOM::transferDataSPI(void *buf, uint32_t count, uint8_t transmit)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);

  while(count-- > 0) {
     sercom->SPI.DATA.bit.DATA = transmit; // Initiate byte transfer.
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
  }

}

Hi Shawn,

Good work for these optimizations !

Also, even if I think that you found it out, there is a delayMicroseconds(1) in the original Adafruit library for the Oled display, that could be removed without causing any issue. You can then simply replace the "spitransfer" calls by "SPI.transfer".

I've been able to go from 16 FPS to 23 FPS while running my 3D Vector demo with those simple modification coupled with SPI.setClockDIvider(4).

Could you please provide you're source code so I can test it on my similar setup? :slight_smile:

Well, after adding the partial block read upgrade, I'm not seeing any improvement, so that optimization was a bust. I guess I'll remove that bit from the SD lib and just leave the full block upgrade in place since I'm not 100% certain about the values I plugged in for the block sizes and offsets on the partial block stuff and I'd rather not have strange bugs pop up later.

Not sure where to go from here. I'm not super-strong at C++ so I don't know if maybe there's some overhead with the SPI library function wrapping the sercom library function, and I don't know if I could safely move the code from the sercom library function into the SPI library.

I'm also not certain the SD library is now actually running at full speed despite my change to have it default to SPI_FULL_SPEED.

AloyseTech:
Hi Shawn,

Good work for these optimizations !

Also, even if I think that you found it out, there is a delayMicroseconds(1) in the original Adafruit library for the Oled display, that could be removed without causing any issue. You can then simply replace the "spitransfer" calls by "SPI.transfer".

Heh, the delayMicroseconds() was the first thing to go. That also only would affect the BMP drawing code, if I were in fact using drawPixel as they did in the original example, which was also one of the first things I got rid of.

Here's the code I'm using for the raw image drawing:

// This function blits a full screen, raw, 16 bit 565 RGB color image to the display from the SD card.
void rawFullSPI(char *filename) { 
  
  File f;
  uint8_t buffer[512]; // Buffer two full rows at a time - 512 bytes.  This is the same the size of an SD card block.
  uint8_t *b, *bmax; // Pointers into the buffer.
  
  // Specify size of region to be drawn.

    tft.writeCommand(SSD1351_CMD_SETCOLUMN);
    tft.writeData(0);
    tft.writeData(127);
    
    tft.writeCommand(SSD1351_CMD_SETROW);
    tft.writeData(0);
    tft.writeData(127);


  // Draw bitmap.
    
    tft.writeCommand(SSD1351_CMD_WRITERAM); // Tell display we're going to send it image data in a moment. (Not sure if necessary.) 
    digitalWrite(my_dc, HIGH); // Set DATA/COMMAND pin to DATA.    
    
    f = SD.open(filename); // Open file for reading.
    //f.read(buffer, 512);
    
    for (byte row = 0; row < 128; row+=2) { // 2.79FPS without SPI_FULL_SPEED in SPI.begin, 3.75FPS with it.
      
      f.read(buffer, 512); // Read the next two rows from the card into the image buffer. 
      // 2.79FPS when doing this read. 6.42 FPS when not doing this read.  (2.3x as fast)    
      // With new block transfer optimization, 7.15 FPS when doing this read, and 20.18 FPS when not doing this read.
      // The reason the screen goes white when doing this is because the buffer we're using to transmit is also the receive buffer, so it is overwritten on the first go round.
      
      /*
      b = buffer;
      bmax = b+512; // Calcuate when we should stop and read the next two rows.   
      
      digitalWrite(my_cs, LOW); // Tell display to pay attention to the incoming data.
             
      while (b < bmax) { // Write both rows to the display.
        SPI.transfer(*b); // Write low byte.
        b++;
      }
      */

      digitalWrite(my_cs, LOW); // Tell display to pay attention to the incoming data.

      SPI.beginTransaction(SPISettings(12000000, MSBFIRST, SPI_MODE0)); // Adding this boosts speed to over 9FPS when not reading from SD card.  So reading from SD card is 2x as slow as this?
      SPI.transfer(buffer, 512);  
      SPI.endTransaction();
                 
      digitalWrite(my_cs, HIGH); // Tell display we're done talking to it for now, so the next SD read doesn't corrupt the screen.
      
    }  
    
    f.close(); // Close the file.
        
}

I've been able to go from 16 FPS to 23 FPS while running my 3D Vector demo with those simple modification coupled with SPI.setClockDIvider(4).

Could you please provide you're source code so I can test it on my similar setup? :slight_smile:

I'll zip everything up as it stands right now so you can try it out for yourself and post that shortly.

As for SPI.setClockDivider(4) I found that no longer works reliably, because for example, the SD library changes the settings with SPI.beginTransaction() and they aren't reverted after. You're supposed to use beginTransaction() before you start sending data now so you can use multiple SPI devices with different data rates and such.

Here ya go, all the modified source files, plus the demo code, the bitmaps, and the raw images:
http://rabidprototypes.com/wp-content/uploads/2015/11/pixel_speedtest.zip

Note that I'm not really trying to optimize the BMP drawing code here, just the RAW images. There's a ton of optimization that could be done to that BMP reader I'm sure, but what I'm concerned with is getting SPI block transfers and SD card reading as fast as possible.

I found more more small optimization, which increases the speed of RAW image reading to 7.36 FPS, but this optimization breaks the correct behavior of the SPI library, because instead of the buffer you passed to the transfer function having the received data in it, I simply discard the data:

// This function transmits a buffer but discards the recieved data.
void SERCOM::transferDataSPI(void *buf, uint32_t count)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);

  while(count-- > 0) {
	//sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer.
     sercom->SPI.DATA.bit.DATA = *buffer++; // Initiate byte transfer.
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     //*buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
  }

}

Also I checked whether that & 0xFF is needed. Seems it is not, but it also doesn't affect the speed if removed.

I think I'm gonna try to modify the loop so that the byte transfer is initiated just before the jump instead of jumping after reading the received byte. The processor is probably fast enough that this optimization won't have much if any effect, but it helped a lot on the AVR.

Thanks for the code, I'll try that ASAP.

Have you tried to replace each "spiwrite(c)" by "SPI.transfer(c)" in SSD1351.cpp ? Since we use hardware SPI, the test on _sid is not necessary I guess and so we always use the SPI.transfer method anyway. It will maybe not speed up the bliting from SD but could help speed things up with standard graphical function.

Off topic : what tool do you use to convert a BMP or any image into raw 16bit file?

Huh, so by rearranging the loop so the jump happens while the byte transfer is happening, I get that same speed boost to 7.33 FPS without discarding the received data:

void SERCOM::transferDataSPI(void *buf, uint32_t count)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);

/*
  while(count-- > 0) {
 sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer.
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
  }
*/
  
  sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer.

  while(count-- > 1) {
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
     sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer.
  }

  while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
  *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; ; // Read received byte, then increment pointer into buffer.

}

And if I discard the received bytes, which again, is not compatible with how the SPI.transfer() function is supposed to work I get 7.58 FPS:

// This function transmits a buffer but discards the recieved data.
void SERCOM::transferDataSPI(void *buf, uint32_t count)
{
  uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);

/*
  while(count-- > 0) {
 sercom->SPI.DATA.bit.DATA = *buffer; // Initiate byte transfer.
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     *buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
  }
*/
  
  sercom->SPI.DATA.bit.DATA = *buffer++; // Initiate byte transfer.

  while(count-- > 1) {
     while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
     //*buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; // Read received byte, then increment pointer into buffer.
     sercom->SPI.DATA.bit.DATA = *buffer++; // Initiate byte transfer.
  }

  while(sercom->SPI.INTFLAG.bit.RXC == 0); // Wait for data to be available in the receive buffer.
  //*buffer++ = sercom->SPI.DATA.bit.DATA & 0xFF; ; // Read received byte, then increment pointer into buffer.

}

Of course, part of the reason I'm not seeing as big a speed increase as I expect is the SD card reading is slower. When I remove that from the equation, I get 24.23 FPS, which is a full 4 FPS faster than the previous best. But it's also still half as fast as what I should be getting if my calculations are correct, so I'm not sure what's going on there.