Uno R4 Poor SPI Performance

I just received a R4 Minima and R4 Wifi. Since I am the author of the SdFat library, I did some performance tests. The results were very disappointing. I hoped the R4 would perform better than the R3 since it has a max SPI clock rate of 24 MHz vs 8 MHz for the R3.

Here are the results:

R3 single byte transfer standard SPI library:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
417.64,1244,1216,1219
417.64,1244,1216,1219

R4 single bytes transfer standard SPI library:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
82.98,7164,5160,6162
82.98,7176,5160,6163

R4 array transfer standard SPI library:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
429.96,2650,634,1182
430.40,2638,634,1181

R3 array transfer with my custom functions:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
689.18,760,728,736
689.18,760,728,736

I also tried the SD.h library which is based on a modified 2009 version of SdFat. It also has very poor performance and has no array transfer option.

I then ran the following SPI test program on R3 and R4 with the standard SPI library.

#include "SPI.h"
#define CS_PIN 10
#define SPI_CLOCK 24000000

void send(bool loop) {
  uint8_t data[] = {'A', 'B', 'C'};
  SPI.beginTransaction(SPISettings(SPI_CLOCK, MSBFIRST, SPI_MODE0));
  digitalWrite(CS_PIN, LOW);
  if (loop) {
    for (size_t i = 0; i < sizeof(data); i++) {
      SPI.transfer(data[i]);
    }
  } else {
    SPI.transfer(data, sizeof(data));
  }
  digitalWrite(CS_PIN, HIGH);
  SPI.endTransaction();
}

void setup() {
  Serial.begin(9600);
  pinMode(CS_PIN, OUTPUT);
  digitalWrite(CS_PIN, HIGH);
  SPI.begin();
  send(true);
  send(false);
}
void loop() {
}

Here are results.

Uno R3 byte transfer:
Clock is 8 MHz.
1.75 us/byte or 571 KB/sec

This is what i expected, a fair gap between bytes.

Uno R4 bye transfer:
Clock is 24 MHz.
11.87 us/byte or 84.2 KB/sec
Huge gaps between bytes.

Uno R4 array transfer
Clock is 24 MHz.
2.75 us/byte or 364 KB/sec

Strange clock. The first byte has eight pulses evenly spaced. After the first byte there are seven pulses, a big gap and the final pulse.

I hope Arduino has some fix for the SPI library. This is huge disappointment. The R4 seemed like a great replacement for the R3.

Guess I should post an issue on Github.

hey @fat16lib

Thank you for reporting this.
I'm a huge fan of your lib, and getting these boards out in the will definitely bring out issues such as this one.
For the R4 our FW team worked having backwards compatibility as a main priority, but performances are indeed important.
Please post to GitHub if you have the chance, but I'll be forwarding this to our FW team.
Your captures are the best kind of bug reporting :slight_smile:

will get someone to chime in on this thread
:v:

1 Like

@fat16lib

was chatting with one of my team mates and he pointed out that while the AVR SPI directly manipulates registers, for the R4 core we tend to wrap our own API around Renesas FSP.

There is a transfer array, though.
Wanna take a shot at it?

Any feedback is more than welcome :pray:

It looks like there is now a related report here:

ubidefeo

The above R4 result use array transfer to get a big improvement over single byte.

It is still very slow for 24 MHz SPI. I would expect SdFat to deliver about 2,500 KB/sec. I get about 430 KB/sec.

I implemented an array transfer for SAMD21 that gets 1360 KB/sec at 12 MHz. This is about three time the Arduino Core SAMD21 result.

Here is a link to that mod:

I posted some ideas and comments on the Arduino Github site here:

I also posted comments about array transfer here.

1 Like

hey @fat16lib

that's an interesting issue and I'm happy one of our Firmware Team members is a part of it.
I'm sure any input will be looked at closely.
Interesting to see also Rudolph in the conversation, he works on the FTDI FT81x library which has a SPI API and benefits from high speeds.

Alex is very skilled, hope he can chime in and support the discussion.
I'm not part of the Firmware Team but having worked on the R4 I'm trying to keep some conversations alive and highlighted :wink:

Thank you for the feedback and the investigation.
Fingers crossed it makes waaaaaay better :slight_smile:

There is a version of the SPI library here.

It proves it is possible to get about 2,500 KB/Sec transfer rate at 24MHz, the max supported SPI clock.

I don't know if it will ever be accepted by Arduino since it is not based on FSP and the best API in this version for fast transfers to devices like SD cards is:

void transfer(const void* txBuf, void* rxBuf, size_t count)

This API is clearly unacceptable to Arduino even though it is available in FSP and in many third-party board support packages. Also the functionality is in MicroPython and most RTOSs.

Just ran the bench test (changed clock to 24mhz) using the SPI library update here
Reworked SPI class with direct register access to fix #28 for R_SPI by RudolphRiedel · Pull Request #45 · arduino/ArduinoCore-renesas (github.com)

Seems like a big improvement:

FreeStack: 27604
Type is FAT32
Card size: 31.91 GB (GB = 1E9 bytes)

Manufacturer ID: 0X3
OEM ID: SD
Product: ACLCD
Revision: 8.0
Serial number: 0X882B3DA3

FILE_SIZE_MB = 5
BUF_SIZE = 512 bytes
Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
601.40,4294967133,836,840
603.57,4294967133,836,847

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
584.25,4294967162,865,871
584.25,4294967162,865,871

Done

Notice something strange in the Merlin513 post of performance? A latency of 4294967133μs.

Turns out micros() has a bug that randomly causes an added 1000μs. See this.

Also this could be the performance for SD read/write if Arduino didn't deny this proposed common SPI API: void transfer(const void* txBuf, void* rxBuf, size_t count)

Ignore max latency due to micros bug.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2245.17,4294966524,218,218
2245.17,4294966515,218,218

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2261.42,4294966513,216,217
2262.44,4294966513,216,217

Bill
Hope that the micros issue is fixed for the next release as it is used in many places, i.e., sketches and libraries.

Also in your last post you are showing high transfer rates:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2245.17,4294966524,218,218
2245.17,4294966515,218,218

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
2261.42,4294966513,216,217
2262.44,4294966513,216,217

How did you get those rates?

I wrote this function for SPI R4 with help from RudolphRiedel :

void ArduinoSPI::transfer(const void* txBuf, void* rxBuf, size_t count) {
  size_t n32 = count / 4;
  if (n32) {
   // Save regs
    uint32_t spcr = _spi_ctrl.p_regs->SPCR;
    uint32_t spdcr = _spi_ctrl.p_regs->SPDCR;
    uint32_t spcmd0 = _spi_ctrl.p_regs->SPCMD[0];
    uint32_t spdcr2 = _spi_ctrl.p_regs->SPDCR2;

    _spi_ctrl.p_regs->SPCR = R_SPI0_SPCR_SPE_Msk | R_SPI0_SPCR_MSTR_Msk;
    _spi_ctrl.p_regs->SPDCR = R_SPI0_SPDCR_SPLW_Msk;
    _spi_ctrl.p_regs->SPDCR2 = 1;
    _spi_ctrl.p_regs->SPCR2 = R_SPI0_SPCR2_SCKASE_Msk;
    // 32-bit transfer - can't find a symbol for 2
    _spi_ctrl.p_regs->SPCMD[0] = (spcmd0 & ~R_SPI0_SPCMD_SPB_Msk) | (2 << R_SPI0_SPCMD_SPB_Pos);
    const uint32_t* tx32 = (const uint32_t*)txBuf;
    uint32_t* rx32 = (uint32_t*)rxBuf;
    size_t ir = 0;
    size_t it = 0;
    while (it < 2 && it < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPTEF) {
        _spi_ctrl.p_regs->SPDR = txBuf ? tx32[it] : 0XFFFFFFFF;
        it++;
      }
    }
    while (it < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPRF) {
        uint32_t spdr = _spi_ctrl.p_regs->SPDR;
        _spi_ctrl.p_regs->SPDR = txBuf ? tx32[it] : 0XFFFFFFFF;
        if (rxBuf) {
          rx32[ir] = spdr;
        }
        ir++;
        it++;
      }
    }
    while (ir < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPRF) {
        uint32_t spdr = _spi_ctrl.p_regs->SPDR;
        if (rxBuf) {
          rx32[ir] = spdr;
        }
        ir++;
      }
    }
    _spi_ctrl.p_regs->SPCR = spcr;
    _spi_ctrl.p_regs->SPDCR = spdcr;
    _spi_ctrl.p_regs->SPCMD[0] = spcmd0;
    _spi_ctrl.p_regs->SPDCR2 = spdcr2;
  }
  if (count != 4 * n32) {
    uint8_t* rx = (uint8_t*)rxBuf;
    const uint8_t* tx = (const uint8_t*)txBuf;
    for (size_t i = 4 * n32; i < count; i++) {
      uint8_t tmp = transfer(txBuf ? tx[i] : 0XFF);
      if (rxBuf) {
        rx[i] = tmp;
      }
    }
  }
}

You won't be seeing it in the official SPI driver since it does not conform to the "Standard API".

You won't see this speed. I put it in my private version of SPI.h.

You can get better performance by using using the transfer(buf, count) that will be accepted. This requires filling buf with 0XFF for read and memcpy of buf to a tmp buffer for write.

SdFatConfig.h has options to select this API. Set these two defines:

#define SPI_DRIVER_SELECT 1
#define USE_SPI_ARRAY_TRANSFER 1

You may need the latest version of the SPI library here. It is still in development.

Edit: Here is the performance I get with the RudolphRiedel version that may be the basis for the standard SPI.h.

Ignore max latency due to the micros() bug. It's quite good.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1627.07,4294966613,304,304
1627.60,4294966601,304,303

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1698.37,4294966588,291,294
1698.37,4294966588,291,294

Just had a chance to get back to this - was busy getting the USB_HOST_SHIELD_2.0 library working with these boards.

Anyway did as you said and for a 32GB sansdisk ultra getting very similar results:

Starting write test, please wait.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1611.86,4294966601,304,305
1555.69,4294966601,304,316

Starting read test, please wait.

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1711.74,4294966586,289,290
1713.50,4294966586,289,289

Big difference in performance.

They really should consider this for part of the SPI library. Can come in handy for some of the display stuff.

It does? I thought a pure read could have anything the the tx buffer? Or is this just an SD thing?

Or is this just an SD thing?

in SPI mode, SD cards and some other storage devices look for commands while sending read data.

If you put an SD card in multiple block read mode, you stop by sending a stop transmission command.

You can abort a read at any point. This means if you send junk you will have occasional mysterious failures.

For a while the Arduino SD.h library had this bug. SD.h is based on a 2009 version of my SdFat library. Someone modified it to read without filling the buffer.

I wish Arduino would realize SPI needs an improved SPI API for some devices. They used the FSP call write_read(txbuf, rxbuf, count) for the Arduino transfer(buf, count) by calling write_read(buf, buf, count).

The same is true on STM32, the vendor supplies the needed API but Arduino blocks access on the STM32 pro boards.

Edit: Probably doesn't matter for Portenta H7. Single byte SPI transfer is about half as fast as an Uno R3.

For the fun of it, I thought I would try out using an Adafruit ILI9341 shield and see if it works with the new board. This case trying with MINIMA.

After dealing with the compiler error of wiring_private.h does not exist, which has been discussed before. I simply added empty header file to the UNO R4 install.

Then built and ran it. And the screen was crawling! I am running the stock graphictest example sketch. I only slightly modified it and added:

while (!Serial && millis() < 5000);

As to have it wait for the Serial port to be ready as to not lose the initial outputs.
The timing speeds.

ILI9341 Test!
Display Power Mode: 0x94
MADCTL Mode: 0x48
Pixel Format: 0x5
Image Format: 0x80
Self Diagnostic: 0xC0
Benchmark                Time (microseconds)
Screen fill              9127137
Text                     369105
Lines                    3557961
Horiz/Vert Lines         739891
Rectangles (outline)     470810
Rectangles (filled)      18944595
Circles (filled)         2014517
Circles (outline)        1557067
Triangles (outline)      823753
Triangles (filled)       6083081
Rounded rects (outline)  884688
Rounded rects (filled)   18826490
Done!

So I tried it on some other boards. Including my old Rev 1 UNO. Sorry only UNO I have except another older one from Seeeduino, which I am not sure you can build for anymore.

The output to the display from the original UNO

ILI9341 Test!
Display Power Mode: 0x94
MADCTL Mode: 0x48
Pixel Format: 0x5
Image Format: 0x80
Self Diagnostic: 0xC0
Benchmark                Time (microseconds)
Screen fill              1496760
Text                     154492
Lines                    1268800
Horiz/Vert Lines         125416
Rectangles (outline)     82808
Rectangles (filled)      3107196
Circles (filled)         465876
Circles (outline)        541560
Triangles (outline)      281660
Triangles (filled)       1339884
Rounded rects (outline)  243744
Rounded rects (filled)   3132648
Done!

Where quick calculations show the screen fill was 6 times faster.

Note running the released SPI library. WIll try to update to the newer one to see how much it helps.

Update: The new library fixed the major slow down:

LI9341 Test!
Display Power Mode: 0x94
MADCTL Mode: 0x48
Pixel Format: 0x5
Image Format: 0x80
Self Diagnostic: 0xC0
Benchmark                Time (microseconds)
Screen fill              1235494
Text                     77757
Lines                    746807
Horiz/Vert Lines         101685
Rectangles (outline)     65691
Rectangles (filled)      2565266
Circles (filled)         318439
Circles (outline)        324945
Triangles (outline)      168301
Triangles (filled)       858611
Rounded rects (outline)  159260
Rounded rects (filled)   2564182
Done!

1 Like

As an aside, a 4 wire 32-bit SPI transfer at 8MHz bit-rate (RA4M1's max slave rate) using SPI1 as a master and SPI0 as a slave takes about 7uS.
That's timed with my inline direct register accessing code - without a function call/return.
The RA4M1's peripheral status signalling etc. seems to have several sets of clock delays in them, testing with pin-state direct high/low writes shows that the status flags take a while to propagate - it's taken me a few days to work the wrinkles out! :slight_smile:

Hi, How do you get these pulses' results? Any app or software?

Does the latest Renesas Core SPI update (July 2023: ArduinoCore-renesas/libraries/SPI at main · arduino/ArduinoCore-renesas · GitHub) improve performance?

If not, is there code that can be used to get better performance than the 328P 8MHz SPI?

Hi KurtE

I am trying to do just this, get the Adafruit ILI9341 library to work properly on my R4 Wifi. I am currently compiling with the latest arduino IDE (2.3.2). So I assume from the pull-request Reworked SPI class with direct register access to fix #28 for R_SPI by RudolphRiedel · Pull Request #45 · arduino/ArduinoCore-renesas · GitHub that this includes the fixed driver. However I am getting even worse results than your original ones.

Could you give me some pointers on how/where to start with modifying to get it fixed?

Regards