SdFat tests on GIGA R1

I did a test of SdFat on GIGA R1.

The default SdFat config has very poor performance.

I used the bench example.

I used SPI1 with pins 10-13.

This definition is required in the bench example:

#define SD_CONFIG SdSpiConfig(SD_CS_PIN, DEDICATED_SPI, SPI_CLOCK, &SPI1)

SPI_CLOCK is 50 MHz.

Here is the result:

Type is exFAT
Card size: 64.09 GB (GB = 1E9 bytes)

Manufacturer ID: 0X1B
OEM ID: SM
Product: EC1S5
Revision: 3.0
Serial number: 0X158D576A
Manufacturing date: 11/2020

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
387.99,16777,1310,1317
387.75,1632,1312,1318

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
392.53,1330,1289,1303
392.66,1328,1289,1302

I did a test at 60 MHz. Yes SPI1 seems to support 60 MHz. This slightly improved the performance.

I got about 420 KB/sec.

I then edited SdFatConfig.h and selected array SPI transfer. Here are the changes.

#define SPI_DRIVER_SELECT 1  // was zero

#define USE_SPI_ARRAY_TRANSFER 1  // was zero

Here are the results:

Type is exFAT
Card size: 64.09 GB (GB = 1E9 bytes)

Manufacturer ID: 0X1B
OEM ID: SM
Product: EC1S5
Revision: 3.0
Serial number: 0X158D576A
Manufacturing date: 11/2020

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1763.56,7335,285,289
1769.17,2506,285,288

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1855.17,281,272,275
1855.17,281,272,275

Better but here are results at 62.5 MHz on a Pi Pico RP2040 with a PIO SPI driver I wrote:

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
7173.60,91,70,71
7173.60,91,70,71

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
7142.86,108,70,71
7173.60,94,70,71

SPI, the port on the ISP like connector seems to support a max rate of 30 MHz.

Since there are huge gaps of dead time between bytes it doesn't matter much.

At 30 MHz you get about 1350 KB/sec.

Here is the program I used to investigate SPI on Giga with results for SPI write.

#include "SPI.h"
uint8_t buf[512];

// Need this since SD is full duplex.
void rxArray(SPIClass* spi, uint8_t* b, size_t n) {
  memset(b, n, 0XFF);
  spi->transfer(b, n);
}

// Need this to since send data is const
void txArray(SPIClass* spi, const uint8_t* b, size_t n) {
  // Compile may complain about variable dimension array.
  uint8_t tmp[n];
  memcpy(tmp, b, n);
  spi->transfer(tmp, n);
}

void rxBytes(SPIClass* spi, uint8_t* b, size_t n) {
  for (size_t i = 0; i < n; i++) {
    b[i] = spi->transfer(0XFF);
  }
}

void txBytes(SPIClass* spi, const uint8_t* b, size_t n) {
  for (size_t i = 0; i < n; i++) {
    spi->transfer(b[i]);
  }
}

void test(SPIClass* spi, uint32_t mhz) {
  spi->begin();
  spi->beginTransaction(SPISettings(mhz * 1000000UL, MSBFIRST, SPI_MODE0));
  uint32_t m1 = micros();
  txBytes(spi, buf, sizeof(buf));
  m1 = micros() - m1;
  uint32_t m2 = micros();
  txArray(spi, buf, sizeof(buf));
  m2 = micros() - m2;
  Serial.print((float)mhz);
  Serial.print(',');
  Serial.print(8.0*sizeof(buf)/m1);
  Serial.print(',');
   Serial.println(8.0*sizeof(buf)/m2); 
  spi->endTransaction();
  spi->end();
}
void setup() {
  Serial.begin(9600);
  while (!Serial) {}
  Serial.println("ready");
  while (!Serial.available()) {}
  Serial.println("SPI requested rate MHz vs actual");
  Serial.println("request,byte,array");
  for (uint32_t mhz = 1; mhz <= 60; mhz++) {
    test(&SPI1, mhz);
  }
}

void loop() {
}

Here is the result. The first column is requested rate, second is actual rate for transfer(data) and the third is actual for transfer(array, count).

Rates in MHz.

request,byte,array
1.00,0.62,0.68
2.00,1.07,1.36
3.00,1.07,1.36
4.00,1.65,2.49
5.00,1.66,2.49
6.00,1.65,2.49
7.00,1.66,2.48
8.00,2.31,4.57
9.00,2.30,4.57
10.00,2.29,4.57
11.00,2.29,4.57
12.00,2.29,4.58
13.00,2.29,4.57
14.00,2.29,4.57
15.00,2.89,7.66
16.00,2.90,7.68
17.00,2.88,7.70
18.00,2.88,7.67
19.00,2.87,7.64
20.00,2.88,7.68
21.00,2.87,7.71
22.00,2.89,7.67
23.00,2.90,7.64
24.00,2.88,7.68
25.00,2.89,7.66
26.00,2.88,7.63
27.00,2.87,7.68
28.00,2.87,7.68
29.00,2.88,7.66
30.00,3.30,11.54
31.00,3.32,11.47
32.00,3.31,11.60
33.00,3.32,11.54
34.00,3.30,11.47
35.00,3.30,11.64
36.00,3.31,11.54
37.00,3.30,11.57
38.00,3.30,11.57
39.00,3.29,11.51
40.00,3.30,11.64
41.00,3.30,11.57
42.00,3.30,11.47
43.00,3.30,11.60
44.00,3.30,11.57
45.00,3.31,11.44
46.00,3.32,11.57
47.00,3.30,11.47
48.00,3.30,11.60
49.00,3.30,11.54
50.00,3.31,11.51
51.00,3.27,11.60
52.00,3.28,11.57
53.00,3.32,11.51
54.00,3.29,11.64
55.00,3.30,11.54
56.00,3.30,11.47
57.00,3.28,11.60
58.00,3.30,11.41
59.00,3.31,11.47
60.00,3.57,15.88

Here is a logic analyzer trace of the 60 MHz array transsfer:

The logic analyzer clock is 500 MHz so the sample are 2 ns. The SPI clock pulses are a bit distorted since they are not a multiple of 2 ns.

The time between bytes is 550 ns. At 60 MHz with no gaps the time would be 133.3 ns. So the effective rate is about 18.2 MHz.

Here is the program I used:

#include <SPI.h>
const uint8_t CS = 10;
uint8_t buf[] = {0xaa, 0x55, 0, 0xff};
void setup() {
  pinMode(CS, OUTPUT);
  digitalWrite(CS, HIGH);
  SPI1.begin();
  SPI1.beginTransaction(SPISettings(60000000UL, MSBFIRST, SPI_MODE0));
  Serial.begin(9600);
  while (!Serial) {}
  Serial.println("ready");
  while (!Serial.available()) {}
  digitalWrite(CS, LOW);
  SPI1.transfer(buf, sizeof(buf));
  digitalWrite(CS, HIGH); 
  Serial.println("Done");
}
void loop() {}

Hi @fat16lib - I need to reread some of this and see what things to try changing.

I have been playing around some with the different File systems on the GIGA and Portenta H7 boards and I have also found some issues with their speed.

Also the Portenta has an SDIO setup for an SDCard, which works slightly faster:
As I showed in the other thread:

And you confirmed something I saw on the Portenta, the SDFat on SPI did not work above 30mhz. Note: The SDIO is running using their own library and not SDFat. Not sure how hard it would be to integrate it into SDFat.

Side note: I have a display driver for the ILI9341 that works on these two boards, where I bypassed the SPI library and output directly to the hardware registers, which removed those gaps. Also optionally using DMA...

It is not worth using a SDIO driver with such a small gain. A good DMA SPI driver would be about four times faster than the current SPI driver.

A SDIO driver does not help much if does it does transfers like the Teensy DMA driver.

Only huge transfers go fast with modern cards. I was able to do that on Teensy with the FIFO driver.

The Teensy FIFO driver starts a transfer and keeps the SD in block transfer mode as long as possible.

Modern SD cards have huge flash pages and pipe-lined buffering so they need huge transfers to go fast.

That is the reason for dedicated SPI mode. In shared SPI mode block mode ends when I raise chip select.

Here is an example that demonstrates how important correct handling of block transfers is.

This is a Due with a DMA SPI driver I wrote.

In dedicated SPI mode where I maintain block transfer mode:

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4524.89,2806,110,111
4528.99,127,110,111

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4508.57,113,111,111
4516.71,113,111,111

Here is shared SPI mode with same SPI driver and SPI clock but small block transfers because of shared SPI.

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
456.70,33478,949,1119
457.16,31790,949,1118

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1056.86,1798,338,482
1058.87,2369,338,481

About a factor of ten fasrer for write and over four for read.

One more thing, note that shared mode can have a write latency of over 30 ms. Dedicate write latency is much shorter.

1 Like

Let me emphasis how important it is to select the proper SPI driver mode for SdFat performance.

You will get a factor of four better performance if you edit SdFatConfig.h and select these options.

Make this change here

#define SPI_DRIVER_SELECT 1 // was zero

And thids change here

#define USE_SPI_ARRAY_TRANSFER 1 // was zero

I can no longer select the best mode for every board since there are too many Arduino style boards.

It is sad that a Aduino Due with a DMA driver is so much faster than GIGA R1.

1 Like

Thanks, I tried making these changes and ran the image viewer sketch I mentioned and it did drop down the time on the larger images from maybe 30 seconds to 11-13 seconds.

I am using Shared SPI, as both the display and the SD are both on the only SPI buss defined on the Portenta.

I see the SPI buss showing better utilization of the SPI buss. This is when it is reading the image in from the SDCard: (Channels are, MOSI, SCLK, MISO, CS tft, DC TFT, CS SD). Showing some glitches on the CS pins...

Here is the display update, where I am not using DMA...

Using DMA does not look too different:

I have not studied their SDIO code much, I know the header files for it are:

#ifdef ARDUINO_PORTENTA_H7_M7
#include "SDMMCBlockDevice.h"
#include "FATFileSystem.h"
#endif

Also it does not create normal c++ classes like SDFat, but instead like their USB File system code, it uses an SDIO c interface, like: fsopen, fsclose, fsread... I would assume that it should support DMA operations as well. I have not tried that yet. Took me awhile to figure out how to get anything to work with DMA as I did not find any examples in any of the code... I also verified that with simple Arduino sketches, neither the DMA1 nor DMA2 objects are enabled, when the setup() code is called, nor are any of the DMA Channels showing any usage.

May try looking at it at some point. But I am simply just doing this for the curiosity.

1 Like

I am using Shared SPI, as both the display and the SD are both on the only SPI buss defined on the Portenta.

Too bad - it's costing you a huge factor of performance.

I never seem to to be able to explain the SD transfer problem. It's not the rate to send a 512 byte block.
The 512 bytes goes at SPI or SDIO speed. Then if you use shared SPI or a typical SDIO implementation without the correct block transfer mode the SD goes busy for a long time before allowing another block to be transferred.

So even if the SDIO transfer is at 50 MHz for a 512 byte block, the busy time will kill performance for amost all the SDIO implementations I know.

I have not seen a STM32 SDIO driver that can handle this problem. I could not do it with Teensy DMA.

It is extreme with Teensy DMA write. Read is not much better.

Here is Teensy 4.1 with 512 byte DMA SDIO at 50 MHz:

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
177.76,21173,1516,2879
177.12,21376,1602,2890

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2165.30,348,197,236
2164.36,347,197,236

A factor of a hundred slower for write with DMA. The average transfer time for a block is about 2885 usec. Almost all is SD busy programming a 32-64 KB flash page for each 512 byte write then copying to a new page for the next write.

Maybe a 100 MB/sec internal rates and burning flash.

Here is Teensy 4.1 with 512 byte FIFO SDIO at 50 MHz:

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
21095.70,3495,22,23
21185.08,3662,22,23

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
22829.59,133,22,22
22829.59,133,22,22

Note the average write latency is 23 usec. more than a factor of 100 faster.

2 Likes

Started playing along with this and with a Sans Disk Ultra A1 64gb card 60mhz gives me an error but 58mhz works but not seeing much of a performance improvement over 30 Mhz

Type any character to start
FreeStack: -12032
Type is exFAT
Card size: 63.86 GB (GB = 1E9 bytes)

Manufacturer ID: 0X3
OEM ID: SD
Product: SC64G
Revision: 8.0
Serial number: 0XCE8B50DD
Manufacturing date: 12/2019

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.
-----------------------------------------------------------------------
    write speed and latency read speed and latency
    speed,max,min,avg       speed,max,min,avg
CLK KB/Sec,usec,usec,usec   KB/Sec,usec,usec,usec
--- ----------------------  ---------------------
16  339.98,37734,1475,1503  341.67,1526,1469,1497
    341.00,13939,1472,1498  341.28,1526,1484,1499

30  388.29,9825,1296,1316   392.25,1331,1288,1304
    387.18,37172,1297,1320  392.29,1327,1289,1303
    
50  388.93,10025,1291,1313  392.29,1328,1285,1303
    387.00,14520,1291,1314  388.90,1337,1290,1315
    
55  387.27,37628,1297,1320  391.33,1334,1285,1307
    388.60,2059,1295,1315   391.33,1331,1285,1307

58  384.62,45491,1294,1318  388.78,1341,1287,1315
    388.81,9960,1293,1314   391.95,1335,1285,1305```
1 Like

Tried to repeat this test and again not much difference
30Mhz

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1356.77,1233,372,376
1343.64,36133,372,379

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1377.70,377,367,370
1377.32,378,367,370

55Mhz

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1353.83,5088,372,376
1344.00,31641,372,378

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1378.84,377,367,370
1378.46,377,367,370

But definitely alot better than with using SPI Arrays.

Started playing along with this and with a Sans Disk Ultra A1 64gb card 60mhz gives me an error but 58mhz works but not seeing much of a performance improvement over 30 Mhz

That's because the SPI library uses a simple factor of two divisor.

If you request a given clock rate it uses a divisor of 60 MHz that gives a value that is less or equal to the requested rate.

When you request 58 MHz it use a divisor of two so you get 30 MHz for all requests in this range:

30 MHz <= request < 60 MHz

If the SPI driver was properly implemented with DMA, 30 MHz would give something near this:

Rate KB/sec = 30,000,000/8 = 3,750 KB/sec. Probably a bit less like 3,400 KB/sec

With my Pi Pico SPI driver at 62.5 MHz I get 7,170 KB/sec Instead of the theoretical 7,800 KB/sec.

Thanks for the explanation - was wondering - kind of disappointing.

Been doing a bit of googling and poking around mBed and came across a couple of interesting tidbits (note this is more just to document this DMA with SPI that I found):
Information on Asynchronous SPI limitation on High Speeds #14689

### Asynchronous SPI limitation

The current Asynchronous SPI implementation will not be able to support high speeds (MHz Range).
The maximum speed supported depends on

  • core operating frequency
  • depth of SPI FIFOs (if available).
    For application that require optimized maximum performance, the recommendation is to implement the DMA-based SPI transfer.
    The SPI DMA transfer support shall be implemented on a case-by-case based on below example
    GitHub - ABOSTM/mbed-os at I2C_SPI_DMA_IMPLEMENTATION_FOR_STM32L4

and yes SPI_ASYNCH is enabled for both cores from what I found.

That link would lead you the following commits
Comparing ARMmbed:master...ABOSTM:I2C_SPI_DMA_IMPLEMENTATION_FOR_STM32L4 · ARMmbed/mbed-os (github.com)
which actually came from here: STM32 DMA support · Issue #10057 · ARMmbed/mbed-os (github.com)

Haven't done much else but may be somethings in common with the ILI9341 dma code. Started giving me a headache looking at it :slight_smile:

1 Like

@Merlin513

Been doing a bit of googling and poking around mBed and came across a couple of interesting tidbits

mbed seemed like a good idea when it was selected by Arduino for earlier boards. I was using it on a LPC project.

mbed has turned into a nightmare for many companies. This is part of a post that mirrors my experience.

winston_orwell_smith

mbed was great in the early days when it only supported a couple of LPC microcontroller targets. It was a higher abstraction layer that was well thought out. It was more intuitive, versatile and easy to use than the arduino libraries in my mind.

And then they started adding hundreds of other microcontroller targets. This was the first point of failure for mbed. Every microcontroller that was ported to mbed had varying degrees of support. But peripheral support on each microcontroller wasn't documented. So you'd end up buying an mbed supported board thinking it was fully supported only to find out it isn't. You then end up wrestling with lower level APIs to get your task done....which defeats the purpose of using mbed in the first place.

You then had this library ecosystem evolve that only supported some targets but not others...the whole thing became a hot mess real quick.

Adding further insult to injury, almost every port of mbed ran on top of, and relied on the target microcontroller's vendor HAL Libraries. In some cases the vendor HAL itself had 2-3 layers of code abstraction. This made mbed very bloated and if the HAL libraries where buggy, the higher level mbed just didn't function as intended despite being bug free, because of their reliance on the lower level HALs. Finding these bugs is not easy and again spending hours hunting for them, defeats the purpose of what mbed was initially trying to be....an easy to use, fast prototyping, bug-free, high level API for microcontrollers that still offered a high degree of granularity/control.

You could try to add STM32 DMA SPI but then why not start over for all other broken mbed features.

My answer is to buy Arduino mbed boards so I can support users of my libraries but never use them in one of my personal projects.

Think I will agree with you with you. The H7 seems to be really nice but with all the layers with Mbed that I am seeing and what is or is not implemented its a bit frustrating.

What I have also noticed is that some folks are building with STMCube and using that code to run their projects but don't know much about how to do that.

1 Like

@Merlin513

What I have also noticed is that some folks are building with STMCube and using that code to run their projects but don't know much about how to do that.

STM32Cube is a little better but I have fought a battle to get DMA SPI support in the STMicroelectronics Arduino, STM32duino, package which is based on STM32Cube and failed.

STM32Cube is helpful getting clocks and startup code working.

It has the layers of HAL problem since there are so many STM32 chips and the peripherals vary between chip families. Plan on a headache going through the layers.

Edit: there is generic support in STM32duino for STM32H747XI but I wouldn't use it if you want an improvement over Arduino.

Bare metal STM32Cube might be O.K. for a important custom project.

I have built so really successful STM32 projects using ChibiOS with its HAL. Don't know if it has any support for STM32H747XI .

The author of ChibiOS works for ST and designed a great OS and HAL for STM32.

The STM32H747XI is an incredible chip with some of the best peripherals of any chip.

The ADCs can automatically sequence through a list of channels with custom settings for each channel and deliver the results at over DMA at over 3 MSPS.

The SDMMC controller is capable of up to 104 MB/sec from a SD card

Almost every other peripheral has the possibility of extreme performance.

To cost/resources to deliver this performance probably is beyond what Arduino can afford.

2 Likes

Yes as far as I can tell, they have not implemented DMA for anything yet. As I mentioned, I ran a sketch that showed that none of the DMA stuff was configured when my sketch started up. To be fair my only did that... But my query of the code bases did not find any examples either.

I was taking a look at what is happening on the SDIO port when I try to load an image:
The Red channel is the CLK...

WHen you zoom in some, there are pretty good gaps between things. My quick and dirty debug showed that the code is calling their block driver to read mostly 1K reads some 512...


Each of those smaller sections appears to be about 1K clocks.

May try something similar on another board with SDIO to see how they compare...

@ KurtE

I was taking a look at what is happening on the SDIO port when I try to load an image:
The Red channel is the CLK...

I keep trying to tell you the handling of the SD by this driver will cause the SD to go busy for about 2 ms.

Looks like that is what I see on your trace.

I have no interest in this driver. It will never be really fast unless you do 32 KB transfers on 32KB boundaries.

Here is a SDIO driver I wrote for Pi Pico RP2040. It runs at 62.5 MHz clock. A 512 byte transfer takes 17.1 μs and the gap between blocks is 2.4 μs.

Notice the gaps in clock, Channel 0, with my driver. That allows me to use a multi-block read or write to the SD with pauses between blocks. I keep the pipeline going without sending commands for each 512 byte block.

Notice I have no command/response on CMD, channel 1. Your traces have activity on CMD.

Here is the output from bench for 512 byte reads and writes.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
25641.03,4755,38,39
25706.94,4709,38,39

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
25239.78,903,39,40
25239.78,1235,39,40

1 Like

Yep - as you said about a 2ms gap. Which obviously is slow... Will punt for now. Hard to know how much time I want to spend on this, when I am just doing it for the fun of it. So far, I am not seeing very much interest or feedback from those who use or sell these boards.

1 Like