Happiness is SdFat with DMA SPI

I decided to implement DMA SPI for SdFat on Due. I couldn't be more pleased with the results!

Here are bench.ino results for very large, 32 KB, reads and writes with this class 4 Sandisk 8 GB card:

Type is FAT32

File size 40MB
Buffer size 32768 bytes
Starting write test. Please wait up to a minute
Write 3835.82 KB/sec
Maximum latency: 45560 usec, Minimum Latency: 7815 usec, Avg Latency: 8538 usec

Starting read test. Please wait up to a minute
Read 4324.64 KB/sec
Maximum latency: 12697 usec, Minimum Latency: 7495 usec, Avg Latency: 7576 usec

The Arduino standard SD.h library is based on a very old version of SdFat that I tried to keep small enough to run on 168 AVR processors.

After some patching to make the standard SD.h library work with 32 KB reads I get this result on Due with the above SD card:

Type is FAT32

File size 40MB
Buffer size 32768 bytes
Starting write test. Please wait up to a minute
Write 229.87 KB/sec
Maximum latency: 805951 usec, Minimum Latency: 130704 usec, Avg Latency: 142545 usec

Starting read test. Please wait up to a minute
Read 434.30 KB/sec
Maximum latency: 77666 usec, Minimum Latency: 72686 usec, Avg Latency: 75448 usec

Happiness is a factor of 16 in write speed and a factor of nearly 10 in read speed over the standard SD library.

I am running SPI at 42 MHz. I built an SD shield with very short wire connecting the SD socket directly to the ISP connector.

Your work with SD is always astounding! Your work with Teensy 3.0 was very good too!

Are the results from Due DMA about 50% faster than you have got from Teensy 3.0 so far?

This brings the Due into line with the Teensy 3.0 performance -- about twice the throughput at twice the clock speed.

But is the Teensy using DMA? If not, then almost all the speedup is due to side-stepping the slow standard SPI library implementation on the Due, rather than on the DMA per se.

In other words, if the standard SPI library was fixed for the Due, you might expect to see similar performance with non-DMA code. No?

Please share the modifications with us.

Can't what for that.

pico,

You need DMA to get the speedup on Due. There is no FIFO in the SPI controller. The Teensy FIFO allows SPI to go at almost full speed, 24 MHz, without DMA.

For Due, I wrote non-DMA SPI optimized for SD cards and got about twice the standard SPI library speed. The standard SPI library trades speed for features which is probably a good thing.

I will post this version of SdFat soon. I need to do a lot more stress testing.

I have one puzzle, sometimes a DMA read hangs at 42 MHz. To drive read, I use one DMA channel to send a stream of 0XFF bytes to the SPI controller. Data is read from the SPI controller using a second DMA channel. I sometimes get a hang when the 0XFF byte is in the same SRAM bank as the receive buffer.

I need to investigate options. I am using byte transfers for SPI. Other high speed interfaces like Ethernet and HSMCI can use 32-bit transfers. I am using a DMA channel for receive that has an eight deep FIFO. I will try a channel with a 32 entry FIFO.

Looks like the problem with reads at 42 MHz only occurs if the data buffer for the read is in the top 32 KB of memory.

SRAM for the SAM3X is two banks, SRAM0 - 64 KB, and SRAM1 - 32 KB. The stack is at the top of SRAM1 so when the interrupt for sysTick happens, registers are pushed on the stack in SRAM1. This blocks access to SRAM1 by the DMA controller and I get an SPI overrun error. If I disable interrupts during the 512 byte DMA transfer, no error occurs. Too bad the SPI controller doesn't have a FIFO. At 42 MHz, the SPI bus delivers a byte about every 200 ns.

It takes about 100 usec for a 512 byte transfer so it's too long to disable interrupts.

Most people will have their buffers in the first 64 KB of SRAM0 and access to to this bank works fine with 42 MHz SPI. Also SPI at 28 MHz works with buffers in SRAM1.

DMA at 28 MHz is still fast:

Type is FAT16
File size 40MB
Buffer size 61440 bytes
Starting write test. Please wait up to a minute
Write 2834.49 KB/sec
Maximum latency: 90293 usec, Minimum Latency: 20612 usec, Avg Latency: 21666 usec

Starting read test. Please wait up to a minute
Read 3056.74 KB/sec
Maximum latency: 20401 usec, Minimum Latency: 19993 usec, Avg Latency: 20098 usec

The sending of 0XFF bytes to drive SPI clock have nothing to do with the problem.

Could you post the sections of code you use to set up and trigger the DMA? I have ideas but I don't want to make a load of suggestions you have probably already tried. I'd experiment with priorities to try and get the DMA above the Cortex read/write, and also experiment with locking options.

Hi.

I'm struggling to raise speed in SPI.

I've bought a Sandisk Extreme Pro for this tests now.

Even with this card, I can't go faster then clock/8.

From what I understand from your reports, this is a limitation of due and standard SDlib because it doesn't have buffers in SPI port.

Am I understanding things right?

I've currently modified the SDClass::begin function to do the init in low speed, and then raise speed.

boolean SDClass::begin(uint8_t csPin) {
  /*

    Performs the initialisation required by the sdfatlib library.

    Return true if initialization succeeds, false otherwise.

   */
   boolean state;
   state=card.init(SPI_QUARTER_SPEED, csPin);
   state&=volume.init(card);
   state&=root.openRoot(volume);
   card.setSckRate(2);
   
   return state;

Only when I send "2" in the "card.setSckRate(2)" I get the slideshow running.

With DMA, you're able to init the lib at 42MHz?

stimmer,

Edit: after posting this I changed RX FIFO handling to DMAC_CFG_FIFOCFG_ASAP_CFG and things seem to work better but still errors.

I attached a zip file with a test library and a sketch that demonstrates the problem (you must be logged in to see the attachment).

Put the DueSPI folder in libraries and run the due_spi sketch to see the problem.

I will be interested to see if changing Bus Matrix arbitration can make this work.

alvesjc,

You may be able to increase the speed of the Standard SD library. I wrote the base code for the Standard SD library about three years ago so I have no interest in that old code.

DueSPI.zip (2.32 KB)

Ok, just to confirm I am understanding your example correctly: Is the return value of spiRec an error count? Running the example as is I get no lines from SRAM0 rtn and screensful of SRAM1 rtn: . After making the change to ASAP in spiDmaRX I get two error lines in SRAM1 rtn: on average.

Edit: If I also make the change DMAC->DMAC_GCFG = DMAC_GCFG_ARB_CFG_FIXED; in spiBegin as well as ASAP in spiDmaRX I am getting no errors.

Update: Still no errors after 750 loops. 100 loops takes 228 seconds, which if I've calculated correctly would be a data rate of 4491Kbytes/s or 35.9 Mbits/s (looks about right)

The return from spiRec are error bits, 0X2 means a timeout waiting for DMA to finish and 0X1 means SPI overrun was set.

I tried both changes with SdFat and it seems to work. I will stress test with some long file reads.

Can't say I feel confident, wish I understood the Bus Matrix and DMAC controller better.

At least a test with 100 MB which is about 200,000 block reads ran O.K.

Type is FAT16
File size 100MB
Buffer size 81920 bytes
Starting write test.
Write 4019.24 KB/sec
Maximum latency: 83141 usec, Minimum Latency: 19453 usec, Avg Latency: 20376 usec

Starting read test.
Read 4387.86 KB/sec
Maximum latency: 18959 usec, Minimum Latency: 18598 usec, Avg Latency: 18667 usec

OK - at least a 100MB test running OK is a good sign.

I reverted back to the original (ALAP and ROUND_ROBIN) configuration and tried to solve the issue purely through the matrix controller. These settings appear to be working:

(in spiBegin:)
  MATRIX->MATRIX_WPMR=0x4d415400;
  MATRIX->MATRIX_MCFG[1]=1;
  MATRIX->MATRIX_MCFG[2]=1;
  MATRIX->MATRIX_SCFG[0]=0x01000010;
  MATRIX->MATRIX_SCFG[1]=0x01000010;
  MATRIX->MATRIX_SCFG[7]=0x01000010;

(What this does is to stop masters from hogging the buses too much and keeps the bus slaves connected to their last master. There are even stronger options if this is not enough)

It may be an idea to apply both fixes (I tried, no errors after 100 loops) in the hope that if there's an obscure situation where one fix doesn't work, the other one does.

stimmer,

Thanks, I will probably put the Bus Matrix stuff in. Maybe with conditionals so it's there if problems occur when I post this for test.

I was about to try the Bus Matrix next.

The SAM3X is new for me, I mostly work with STM32.

Configuring these things when your new feels like just try all possible combinations and something will work. Trouble is the number of combination for n choose m is mighty large. I tried both DMAC_CFG_FIFOCFG_ASAP_CFG and DMAC_GCFG_ARB_CFG_FIXED but not at the same time.

Edit: I have now run a lot of tests with 2 GB files using just the two DMAC changes. I have probably have written and read 20 GB of data without an error. It only takes a bit over eight minutes to read a 2 GB file.

Fantastic - this will be the obvious choice for logging applications!

Impressive!

fat16lib, do you think that DMA-based transfer could be integrated in current SPI lib? for instance by adding a method for block transfers?
I guess we should forget about setBitOrder feature...

I think a block transfer function could be added. Maybe something like

  bool SPIClass::transfer(uint8_t* rxBuf, uint8_t* txBuf, uint16_t size);

You need to allow either rxBuf or txBuf to be null. The transfer size is limited to a 16-bt field unless you use chained buffers.

The main problem is that SPI.h uses the variable chip select mode. This mode can be used with DMA but doesn't seem like what users would want.

How would this be integrated into the standard SPI lib so that you could use multiple SPI devices at once without them all or none having to use DMA?

Strikes me as a pity that you need to go the DMA route to get performance that is equivalent to a simple FIFO buffer as found on the Teensy chip. Seems like a lot of added complexity with all the attendent issues just to get to a baseline of a reasonably efficient SPI performance.

I gather from what you've written there would be little or no point in terms of a speed benefit in doing DMA SPI for the Teensy 3, for example?

Do you think there is really no hope to get the poor Due SPI performance improved to any significant degree without resorting to DMA?

You could use DMA or not on any device or any transfer to a given device. SAM3X DMA is just an engine to move memory to/from a device register with the appropriate handshake/flow control.

I get twice the performance on SAM3X as on Teensy 3.0. I run SAM3X at 42 MHz and Teensy 3.0 at 24 Mhz. I use a lot of tricks on Teensy so DMA on Due is simpler.

I suspect there would be a slight improvement on Teensy 3.0 with DMA.

DMA could be a real advantage with a RTOS. If you run the SD task at low priority, more CPU would be available. Even better a task could wait on a DMA done interrupt.

Finally, I have the option in my SPI layer of using optimized SPI without DMA. It runs much faster than the standard SPI library but has none of the options.

I don't see what you have against SAM3X DMA.

fat16lib:
I don't see what you have against SAM3X DMA.

I have nothing against DMA per se, it's just that by your own account, it seems to be adding a great deal of complexity to the SPI implementation. Of course, the simpler the solution the more robust, generally speaking.

But if you say the Teensy 3.0 FIFO code is even more complex, it's probably a better way of going -- and perhaps even means the Teensy would benefit from a DMA implementation after all, if only to make things less complex.

So I'm planning to try out your improved DUE SPI / SD card library this weekend on my laser tag system. I'll let you know how it goes. I currently use the ATmega328P with your old SD library, but use file indexing to open files quickly. Not very user friendly for coding, but it works. Maybe I can retire the file indexing if this works much faster?

What is the status on this being implemented into the DUE library non beta?

http://code.google.com/p/beta-lib/downloads/list