New SdFat optimized for Mega and Teensy 3.0

I have posted a new SdFat beta - SdFatBeta20121020.zip Google Code Archive - Long-term storage for Google Code Project Hosting..

This beta supports AVR boards and the new ARM Teensy 3.0 board Teensy 3.0 + header : ID 1044 : $19.95 : Adafruit Industries, Unique & fun DIY electronics and kits.

Performance is greatly improved for large reads and writes.

The following benchmarks were done using this ATP 1GB Industrial Grade SD card http://www.newegg.com/Product/Product.aspx?Item=9SIA12K0CT6829.

The results are in increasing order of performance.

Mega 2560 using the standard Arduino SD.h library with 4096 byte reads and writes.

File size 5MB
Buffer size 4096 bytes
Starting write test. Please wait up to a minute
Write 265.95 KB/sec
Maximum latency: 84184 usec, Minimum Latency: 14144 usec, Avg Latency: 15388 usec

Starting read test. Please wait up to a minute
Read 314.07 KB/sec
Maximum latency: 14752 usec, Minimum Latency: 13020 usec, Avg Latency: 13035 usec

Mega 2560 using the new SdFat library with 4096 byte reads and writes.

File size 5MB
Buffer size 4096 bytes
Starting write test. Please wait up to a minute
Write 658.38 KB/sec
Maximum latency: 65816 usec, Minimum Latency: 6036 usec, Avg Latency: 6210 usec

Starting read test. Please wait up to a minute
Read 616.40 KB/sec
Maximum latency: 7624 usec, Minimum Latency: 6628 usec, Avg Latency: 6638 usec

Teensy 3.0 at 96 MHz using the new SdFat library with 4096 byte reads and writes.

File size 5MB
Buffer size 4096 bytes
Starting write test. Please wait up to a minute
Write 1776.44 KB/sec
Maximum latency: 65790 usec, Minimum Latency: 2146 usec, Avg Latency: 2300 usec

Starting read test. Please wait up to a minute
Read 2037.15 KB/sec
Maximum latency: 2356 usec, Minimum Latency: 1999 usec, Avg Latency: 2008 usec

Here is my best Teensy 3.0 result using an old Corsair 1GB SD (no longer available) with 8192 byte reads and writes.

File size 10MB
Buffer size 8192 bytes
Starting write test. Please wait up to a minute
Write 2002.05 KB/sec
Maximum latency: 6777 usec, Minimum Latency: 4007 usec, Avg Latency: 4089 usec

Starting read test. Please wait up to a minute
Read 2121.47 KB/sec
Maximum latency: 4231 usec, Minimum Latency: 3853 usec, Avg Latency: 3860 usec

Great! Would it be possible to publish the SPI clock speed used, please? Thanks..

pito,

The SPI clock for SD.h on the Mega was 4 MHz. The standard Arduino SD library has no option for SPI clock speed.

I did a test at 8 MHz with SD.h by editing the source for begin(). Doubling the SPI clock speed increased the write speed from 265.95 KB/sec to 376.29 KB/sec and the read speed from 314.07 KB/sec to 463.86 KB/sec.

File size 5MB
Buffer size 4096 bytes
Starting write test. Please wait up to a minute
Write 376.29 KB/sec
Maximum latency: 78644 usec, Minimum Latency: 9808 usec, Avg Latency: 10873 usec

Starting read test. Please wait up to a minute
Read 463.86 KB/sec
Maximum latency: 9940 usec, Minimum Latency: 8784 usec, Avg Latency: 8823 usec

The SPI clock speed for Mega was 8 MHz with the new version of SdFat.

The SPI clock speed for Teensy 3.0 was 24 MHz.

The key to high speed writes is to use a record size that is a multiple of 512 bytes. This insures that very little data needs to be copied to the cache. In these tests with the new SdFat, only directory entries and the FAT table need to be cached.

Using a record size that is a power of two increases performance slightly. This insures that writes are aligned with file clusters.

It is very important to use a freshly formatted SD card so that a contiguous file is created. SD cards perform very poorly for random writes since the internal flash for most cards have a very large page size. The entire page must be rewritten to newly erased flash for a random write.

The new SdFat still has only one block of cache so overhead is dramatically increased if write data must be cached. This means that the cache block must be used for both write data and the FAT table.

I am considering options to use more cache but this adds complexity and there is still the overhead of copying data to the cache. Adding cache would not improve the above results but would help when record size is not a multiple of 512.

I've seen the K20 ref manual and the SPI includes 4words deep FIFO for transmit and receive, moreover it supports 32bit SPI transfers. Try to use the FIFO - it helped us much with pic32 retrobsd Sdcard driver (I wrote a small routine for tx/rx with "enhanced buffering" as it is called at MCHP, DMA not used, but it has got similar performance finally).

I am using the FIFO on Teensy 3.0.

The maximum SPI frame size for the K20 is 16 bits, not 32 bits.

I have not tried 16 bit frames yet since the Freescale examples were a mess. I suspect 16 bit frames could help a lot.

I plan to try DMA in the future.

The absolute limit for SPI at 24 MHz is 3000 KB/sec but I don't think I will get close to that for 4 KB writes, even with DMA.

pito,

I was able to try 16-bit frames. The write rate increased from 1776.44 KB/sec to 2013.34 KB/sec.

The overhead is increased since a byte swap is required. I form the 16-bit word to be sent like this:

    uint16_t w = *src++ << 8;
    w |= *src++;

..the ref manual says it can do 32bit transfers as well.. :slight_smile:

The RX FIFO appears to be 32-bits wide but:

3.9.2.5 RX FIFO Size
SPI supports up to 16-bit frame size during reception.

The TX FIFO is 32-bits wide but the high 16-bits are command bits.

43.3.7 DSPI PUSH TX FIFO Register In Master Mode (SPIx_PUSHR)
PUSHR provides the means to write to the TX FIFO. Data written to this register is
transferred to the TX FIFO . 8- or 16-bit write accesses to the Data Field of PUSHR
transfers the 16 bit Data field of PUSHR to the TX FIFO. Write accesses to the
Command Field of PUSHR transfers the 16 bit Command Field of PUSHR to the TX
FIFO. The register structure is different in Master and Slave modes. In Master mode, the
register provides 16-bit command and data to the TX FIFO. In Slave mode, the 16 bit
Command Field of PUSHR is reserved.

Even if you could send 32 bits, the byte order in memory is not in the correct order due to nature of a little-endian fetch from memory to a 32 bit register.

See above for the problem with 16-bit transfers.

The datasheet is not too clear in spots. See point 8 in this list:

TOP TEN THINGS ENGINEERING SCHOOL DIDN'T TEACH YOU (from Rich Ries via Embedded Muse)
10. There are at least 10 types of capacitors.
9. Theory tells you how a circuit works, not why it does not work.
8. Not everything works according to the specs in the databook.
7. Anything practical you learn will be obsolete before you use it,
except the complex math, which you will never use.
6. Always try to fix the hardware with software.
5. Engineering is like having an 8 a.m. class and a late afternoon lab
every day for the rest of your life.
4. Overtime pay? What overtime pay?
3. Managers, not engineers, rule the world.
2. If you like junk food, caffeine and all-nighters, go into software.

  1. Dilbert is not a comic strip, it's a documentary.

:~
Figure 43-1. DSPI Block Diagram - shows 32bit data path to the shift register (and and extra for command)
43.1.2 "SPI frames longer than 16 bits can be supported using the continuous selection format"..
43.3.9 "Eight- or sixteen-bit read accesses to the POPR have the same effect on the RX FIFO as 32-bit read accesses"..
43.4.2 "The SPI frames can be 32 bits long."..

  1. Sales Managers, not engineers, rule the world.
    :slight_smile:

Yes you can read SPIx_POPR as a 32 bit register but only 16 bits are data. See 3.9.2.5

Figure 43-1. DSPI Block Diagram - shows 32bit data path to the shift register (and and extra for command)

Figure 43-1 is wrong. See rule 8 in my previous post. See 43.3.7 for the format of PUSHR.

43.1.2 "SPI frames longer than 16 bits can be supported using the continuous selection format"..

Continuous selection format just insures that CS remains low, it has nothing to do with the FIFOs.

CONT
Continuous Peripheral Chip Select Enable
Selects a continuous selection format. The bit is used in SPI Master mode. The bit enables the selected
PCS signals to remain asserted between transfers.
0 Return PCSn signals to their inactive state between transfers.
1 Keep PCSn signals asserted between transfers.

43.4.2 "The SPI frames can be 32 bits long."..

Frames can be any size but a max of 16 bits can be transferred to/from the FIFOs. Unfortunately the data sheet uses Frame size for two things.

In the CTAR 43.3.3 it is how many bits wide the data field is in a FIFO.

In other places it is how many bits are sent in a transfer while CS is low.

Again the datasheet is a clue for Kinetis and point 8 above applies to many sections.

I am seriously thinking about to draw a small board for the stm32f407 64pin. The chip itself is 2x the price of that kinetis, but offers ~4x stock cpu clock, 1MB flash, 192kB ram, an SDIO, FPU, etc..

Where you able to use the ChipSelect-feature too? I set all the PCS bits in PUSHR but it never touches the CS state. This is my actual code:

inline uint16_t RF12_T3::rf12_xfer(uint16_t data) {
//  digitalWriteFast(10, LOW);
  SPI0_PUSHR = (1<<26) | (B11111 << 16) | data;    // send data (clear transfer counter, select all CS)
  while (! SPI0_TCR) ; // loop until transfer is complete
//  digitalWriteFast(10, HIGH);
  return SPI0_POPR;
}

Configuration is like this:

  // enables and configures SPI module
  // 16MHz 16bit transfers on CTAR0
  SPI0_CTAR0 = 0xF8010000;

Sorry for being a bit off-topic...

I don't use the chip select feature in the SPI module since SD transfers are so large.

To use this feature you must select the proper pin multiplexing mode with the Port Control Module.

See chapter 10 and 11 of the chip data sheet.

Chapter 10
Signal Multiplexing and Signal Descriptions
10.1 Introduction
To optimize functionality in small packages, pins have several functions available via
signal multiplexing. This chapter illustrates which of this device's signals are multiplexed
on which external pin.
The Port Control block controls which signal is present on the external pin. Reference
that chapter to find which register controls the operation of a specific pin.

Chapter 11
Port control and interrupts (PORT)
11.1 Introduction
NOTE
For the chip-specific implementation details of this module's
instances see the chip configuration chapter.
11.1.1 Overview
The port control and interrupt (PORT) module provides support for port control, and
external interrupt functions. Most functions can be configured independently for each pin
in the 32-bit port and affect the pin regardless of its pin muxing state.
There is one instance of the PORT module for each port. Not all pins within each port are
implemented on a specific device.

I am not very knowledgeable about the Kinetis chip, Paul Stoffregen helped me configure the SPI to be compatible with his software for Teensy 3.0.

The pin multiplexing mode – of course! That has to be the solution. Thank you!

  1. Great work on the awesome library!

  2. I may be out to lunch here, but I have a pickle of a problem driving me nuts with SPI;
    --If I understand correctly, the Teensy 3.0 has multiple SPI ports. (yay/nay?) They are numbered on the pinout sheet but there isn't a lot of data available yet -- and the proper datasheet is WAY above my understanding. (appreciate the patience with my noobtacular questions)
    --Can the SDFat library be pointed to the secondary spi port (1 or 2 instead of 0)?

Long story short I'm interfacing an SD card and a non-standard spi device (addressable led strand that accepts clock/data) with FastSPI library. It worked on arduino but I needed more speed, so I moved to Teensy. I fake chip/slave select with a transistor interrupting the clock line while the sd reads - so far so good. -- the new library reads the files exceptionally well (an order of magnitude faster!) but when I go to use the spi port while I'm not reading/writing anything, I still get garbled data out. I was hoping to move one of the competing libraries to a separate spi port, or (gasp) bitbang it. :frowning: I don't think I can maintain the immense datarate of hardware spi bitbanging it.

Thanks in advance and again great work!
-Jamie