So in testing out the SPI library on the Zero with my OLED display I've noticed it's slow. Really slow.
When blitting full screen images on an Atmega328 running at 16MHz with a highly optimized SPI routine (direct register access) which was transmitting the data as fast as possible - 8MHz, I was getting almost 9 FPS.
On the Zero, with its 48MHz clock speed, and SPI bus capable of running at 12MHz, but using the built in SPI library block transfer function, I'm getting a whole 3.75 FPS.
So rather than being 1.5x faster, it's 2.4x slower.
Now in investigating this, the first issue I found was that using SPI.setClockDivider() no longer works. I guess we have to use SPI.beginTransaction now. Perhaps that is because the SD library now uses that and it messes with it.
I also noticed that the SD library, for reasons that seem questionable (because people might connect their SD cards with long wires) forces half-speed on you, with no way to override it without editing the library. And of course those edits will inevitably be reverted when you upgrade the IDE, and most people won't even know to make that change if they're not getting the performance they expect.
But anyway, I changed that to default to SPI_FULL_SPEED, and I put this code into my demo:
SPI.beginTransaction(SPISettings(12000000, MSBFIRST, SPI_MODE0));
These two changes increased the speed to what I quoted above. Not exactly lightning quick.
I did notice however that when I removed the SD card streaming from the equation, the blitting sped up 3x. So for some reason, streaming the data from the SD card is 2x as slow as writing the same amount of data to the screen. I believe this may have been the case with the 328P as well. I don't know why it's the case, and I can't see any reason it should be the case, but I'm just putting that out there.
I know the data has to be copied to an intermediate 512 byte buffer when its read from the card, and then that data is then copied to the buffer I supply the library with, but seeing as the Zero runs at 48MHz, it doesn't seem like that extra copy operation should double the time it takes to stream the data off the card.
But anyway, back to the SPI library...
I have not made any progress here yet myself, but I thought I'd post what information I have now so if anyone has any suggestions they can provide them and I have all the info in one spot.
So, first, we have this function, which is the one that needs to be optimized:
void SPIClass::transfer(void *buf, size_t count)
{
// TODO: Optimize for faster block-transfer
uint8_t *buffer = reinterpret_cast<uint8_t *>(buf);
for (size_t i=0; i<count; i++)
buffer[i] = transfer(buffer[i]);
}
That calls this function:
byte SPIClass::transfer(uint8_t data)
{
// Writing the data
_p_sercom->writeDataSPI(data);
// Read data
return _p_sercom->readDataSPI() & 0xFF;
}
And that calls these functions:
void SERCOM::writeDataSPI(uint8_t data)
{
while( sercom->SPI.INTFLAG.bit.DRE == 0 )
{
// Waiting Data Registry Empty
}
sercom->SPI.DATA.bit.DATA = data; // Writing data into Data register
while( sercom->SPI.INTFLAG.bit.TXC == 0 || sercom->SPI.INTFLAG.bit.DRE == 0 )
{
// Waiting Complete Transmission
}
}
uint16_t SERCOM::readDataSPI()
{
while( sercom->SPI.INTFLAG.bit.DRE == 0 || sercom->SPI.INTFLAG.bit.RXC == 0 )
{
// Waiting Complete Reception
}
return sercom->SPI.DATA.bit.DATA; // Reading data
}
It took me a while to track it all down in all the various directories, some of it is in the hidden Arduino15 user directory, and some in the libraries directory under the main Arduino folder, not to be confused with the libraries directory in your documents. Anyway, I just linked to the GitHub repository to make things easier for everyone. That's the newest version of the code anyway. Plus I get syntax highlighting unlike when I open the code in wordpad cause Atmel Studio takes two minutes to open.
But anyway... that code looks fairly simple. It shouldn't be too difficult to optimize it, by doing something along the lines of what we achieved here:
http://forum.arduino.cc/index.php?topic=129824.0
I'm not suggesting we add NOPs to the code of course. And I don't think it's necessary, since the Zero runs at 48MHz while its SPI can only run at 12MHz. There should be sufficient spare time for the extra comparison and jump not to affect the final speed.
Just something along the lines of this:
SPDR = *thisLED--; // Initiate first byte transfer and decrement address of *thisLED.
do {
while (!(SPSR & _BV(SPIF))); SPDR = *thisLED--; // Wait for transfer of byte over SPI bus to complete, then transfer *thisLED into SPDR register, and decrement address of *thisLED.
} while (thisLED != lastLED);
while (!(SPSR & _BV(SPIF))); // Wait for last byte to finish transfer.
I'm not sure it will be that straightforward though, since I have been doing a bit of googling and I saw something about the SPI having a double buffer on the receiver. But maybe that was just an issue when trying to write only when not reading.
Anyway I've said all I've got to say for now. Now I'm gonna figure out exactly what that code is doing and see if its possible to interleave the reads and the writes first of all. Then if that helps I'll look into using pointers to speed up the buffer access. Ultimately I hope to speed this up around 4.5X from what I'm getting right now. Hopefully that speedup will carry over to the SD library as well if its using the transfer function, which I haven't checked. That would give an even greater speedup.