UNO R4 SPI Performance and 16 bit transfers

DrN · August 15, 2024, 10:26pm

I hope to use the UNO R4 SPI to read an external ADC, and I would like to be able to clock the reads as fast as the part can possibly do it.

For completeness, I'll mention that the ADC is an MCP33131D. It requires a 700nsec sampling time and the transfer time for 16 bits at 12MHZ should be 1.3 usec. So altogether that is 2usec. We might expect to clock the ADC at something close to 500 KSPS.

Okay, here is the code for a (perhaps idealized) single 16 bit read, followed by an oscilloscope trace for the CNVST pin (lower trace) and the SPI clock (upper).

Notice that from the trailing edge of the CNVST there is 2.5 usecs before the transfer starts, and then the transfer occurs not as a single 16 bit transfer but rather as two 8 bit transfers separated by yet another 1.2usecs. That is a huge problem.

digitalWrite(CNVSTPIN,HIGH;
DELAYNANOSECONDS(700)
digitalWrite(CNVSTPIN,LOW);

dataword = SPI.transfer16(0xFFFF);

UNOR4SPI01

Now for comparison, here is what that same operation looks like in the Teensy 4.0. The second image is with scope set to a faster time scale to show you that it is really a single 16 bit transfer and the setup time is actually just a little under 180 nsecs.

TeensySPI01

TeensySPI02

As it turns out, the NXP part has a setting to control the number of bits per transfer. Do we have that in the RA4M1?

The setup time might be consistent with the processor clock speeds (600MHZ/48MHZ = 12.5, and 2.5usec/180nsec = 13).

But then with the slower clock, that means that for loops, we might want to factor out some of the SPI setup and just do whatever is the minimum to trigger each transfer.

Recall (at the top), we need to toggle a line between transfers, so we cannot just do this as a long contiguous block transfer.

Any help?

DrN · August 15, 2024, 11:00pm

Thanks

Actually, this looks interesting.

On quick perusal it looks like for more than 4 bytes it does 32 bit transfers. So that seems to show how to do it. Aside, I found it by searching for the reference to the command register where the transfer length is set.

That said, it looks like too much work to ramp up on the part and implement the extension. The author or maintainer would likely do it in far less time.

github.com

arduino/ArduinoCore-renesas/blob/731b943a4b2d6973e05d4cb303e7d8356dd29d51/libraries/SPI/SPI.cpp#L221C51-L317C1


      
          void ArduinoSPI::transfer(void *buf, size_t count)
          {
              if (NULL == buf) {
                  return;
              }
          
              if (_is_sci) {
                  _spi_cb_event[_cb_event_idx] = SPI_EVENT_TRANSFER_ABORTED;
          
                  _write_then_read(&_spi_sci_ctrl, buf, buf, count, SPI_BIT_WIDTH_8_BITS);
          
                  for (auto const start = millis();
                      (SPI_EVENT_TRANSFER_COMPLETE != _spi_cb_event[_cb_event_idx]) && (millis() - start < 1000); )
                  {
                      __NOP();
                  }
                  if (SPI_EVENT_TRANSFER_ABORTED == _spi_cb_event[_cb_event_idx])
                  {
                      end();
                  }

This file has been truncated. show original

ptillisch · August 16, 2024, 12:21am

In order to make all relevant information available to any who are interested in this subject, I'll share a link to the formal feature request @DrN submitted to the Arduino developers:

github.com/arduino/ArduinoCore-renesas

Need true 16 bit SPI transfer

opened 11:25PM - 15 Aug 24 UTC

drmcnelson

type: enhancement topic: code

Hi, I hope to use your library to read an ADC, I need to be able to do the t…ransfers with best possible speed. One application for example is the readout for a linear CCD (4k of 16 bit samples). In my lab we read long records at 500KSPS, to study slow charge mobility at microampere currents in organic electronics. We are trying to adapt this now to the UNO R4, (an RA4M1). I found some issues with the transfer16(). Here is code and a scope images for transfer16() on the UNO R4. The upper trace is the SPI clock. The lower trace is CS pin, which I am using to assert the CNVST of the ADC. Data is available 700 nsec after the rising edge of CNVST and falling edge enables SPI transfer. Notice that there is a 2.5usec setup time for the transfer, and then the transfer occurs as two 8 byte transfers, separated by 1.2usec rather than as a single 16 bit transfer. I checked in your code, and indeed it is done as two 8 bit transfer. So, I think what is need is a true 16 bit transfer16(). You already do something like that in the buffered transfer (the next function in the source code file after tansfer16()). Also, for reading a long record from the ADC, it would be great is some of that setup time could be moved outside the loop. I hesitate to do this myself. Looking at your code , you did a great job and wou are already quite expert in the part, Thank you ``` digitalWrite(CNVSTPIN,HIGH; DELAYNANOSECONDS(700) digitalWrite(CNVSTPIN,LOW); dataword = SPI.transfer16(0xFFFF); ``` ![UNOR4SPI01](https://github.com/user-attachments/assets/c9712d1f-6ceb-40f7-91eb-a19e4ea59934)

DrN · August 16, 2024, 2:49am

Was that a faux pas?

It is not really a feature request, it's a correction. The 16 bit transfer is not working correctly the way it is coded. And ultimately, the patch has to get into the library. So, posting the issue to the owner seems like the right thing to do.

Sorry if it was a misstep.

DrN · August 16, 2024, 2:57am

It is not clear how DMA would work for a series of 16 bit transfers to read some number of words from the ADC.

Recall that the CNVST pin has to be pulsed before each 16 bit transfer.

And most often the reads are clocked anyway.

GolamMostafa · August 16, 2024, 5:07am

Figure-1:

1. Fig-1 says that any host can read the 16-bit data (for MCP33131D-10 device) by generating 16 SCLK pulses. This is a 16-bit operation.

2. If UNOR3 is used and SPI.transfer() is executed, then the user must issue two consecutive (optional delay in-between) transfer() methods.

3. If UNOR3 is used and SPI.transfer16() is executed, the UNOR3 breaks the said command into two and issues two consecutive SPI.transfer() instructions and a delay (as is seen in your scope traces) in-between. It is beause that the SPI Network of ATmega328P MCU of Arduino UNO is byte (8-bit) oriented.

4. The Step-2 & 3 are equally applicable for UNOR4 as the High Level SPI methods are designed for 8-bit UNOR3 and then borrowed to UNOR4 unchanged to maintain compatibility (post #10 @Delta_G ).

5. To avoid delay in-between, one can just do the Bitbang and complete the 16-bit transfer in just 1.3 us.

ptillisch · August 16, 2024, 6:09am

Submitting the issue definitely is not.

I do prefer for people to cross-link when they post about the same subject matter to multiple platforms. The reason is that the information that comes from the resulting discussion on one platform might be very valuable to the interested parties who see the post on another platform (keep in mind that forum posts and GitHub issue often serve as valuable references to many people who find them in search results during research on related subjects for years to come).

If you don't cross-link between the two, the interested parties on one platform are likely to be unaware of the existence of relevant discussion on other platforms. This often results in duplication of effort. That is the sole reason why I added a link to the GitHub issue here, as I stated.

DrN · August 16, 2024, 1:02pm

My hope is that what I am doing will be generally useful, at least for aspiring and underfunded scientists. And it is important to use official libraries as far as possible.

For the 8 bit platforms it will be what it is, but the way the 16 bit transfer is coded is really a serious misstep in the SPI library. Look again at the traces, it takes a solid 8usec per 16 bit transfer!!! That is pretty terrible. If the R4 is to be a toy, well then fine. I think the R4 deserves better than that.

Imagine specs or readme's that read something like "1MSPS on the Teensy 4 and 100KSPS on the Arduino R4". Or, "for this sensor input board, you need a T4, not an R4". It seems like the R4 should be fast enough and this kind of constraint should be totally unnecessary

Practical use cases:

A) For the CCD interfaces, the popular Toshiba for example, you need at east 500KSPS

B) I am also working on a photon counting input (with a SiPM). It tags each photon with the arrival time down to 25 picosecond resolution. For reasons of quantum statistics, the count rate is typically about 10% of the trigger rate.

And so forth....

DrN · August 16, 2024, 1:14pm

Okay, great thank you. I am trying to make the case that this really needs to be fixed.

Aside, the digital write and read seem quite (and unnecessarily) slow also. Is there a digitalWriteFast() etc for the R4? It is much better?

For those who are interested, there is an update to my tool for measuring timing, here

DrN · August 16, 2024, 3:07pm

touche'

The api would be a nice way to make code at least somewhat portable across platforms, if it was done well.

For example, that benchmarking runs on both R4 and T4 using the same calls for most of it, There is a conditional compile section for the part that only the T4 does.

I am at this moment setting up a baseline program to talk to the board and runs scripts, and record the outputs, with the intent of dumping outputs for the R4 and T4. We'll see how things compare in a moment.

GolamMostafa · August 16, 2024, 3:30pm

This is nothing to do with the RA4M1 MCU. It is feature of the SPI.h Library that introduecs the delay in-between 8-bit transactions.

To avoid the delay, you may proceed with direct bitbang.

DrN · August 16, 2024, 4:48pm

Okay, here are the results, please see the attached files.

All of this is at the above github link, see post #15. The source code there is updated also, and there is a python host side program and scripts to generate the results.

Notice that the R4 spends about 72 cycles for digitalRead(), 21 for digitalWrite() and 120 for the improvised version of digitalToggle()

The Teensy 4 does it in 26 cycles for digitalRead(), 12 cycles for digitalWite() and 54 for digitalToggle().

Interrupt latency for the R4 is 182 cycles, for the T4 it is 120 cycles. The worst case latencies are 312 vs 198. In other words they fluctuate by 71% and 60% respectfully.

Now for the SPI, the R4 actually spends fewer cycles in the SPI transfer call at 327, compared to 515 for the T4. But, the T4 MCU is 12 times faster. See the scope images for the actual timing.

There are some other interesting quirks in when TimerOne launches the first iteration of the attached function one platform vs the other, see the scope images on the github.

So, there you have it.

Teensy4benchmark.log (3.2 KB)
UNOR4benchmark.log (2.6 KB)

GolamMostafa · August 16, 2024, 5:03pm

When SPI.transfer(0x38) is executed, 0011 1000 are transmitted over the MOSI line while the clocking pulses are asserted on the SCK line -- is it not a bitbang (carried out by hardware)?

DrN · August 16, 2024, 6:30pm

The Renesas code though would mean another conditional compile section for the UNOR4. It's okay, I would rather they fix the SPI library.

Actually there is one trick I could use, using the manufacturer libraries for the T4 and for the R4, as it stands now, that setup time is just lost time, I could run the CNVST pin toggles in that time instead.

For the T4, its not needed because the setup us less than the conversion time and pin write are fast, so I only need adjust the wait to get the SPI going at the earliest allowed time.

For the R4, the pin write by itself is longer than the conversion time.

DrN · August 16, 2024, 6:47pm

Well, yes, I am using digitalWrite().

You are advocating abandoning large parts of the Arduino API to make it work.

That's not a great testimonial to the API.

Having a uniform API for many different boards was one of the attractions of doing this.

So far it looks like things are moving towards just getting it going on the R4.

DrN · August 16, 2024, 8:06pm

Again, I am not b-tching about it, I am trying to make the case for fixing it.

You are saying it is pointless, tney did a fast hack job, they don't care and no matter what, they will not fix it

The kind of error in that spi::transfer16() does look like that. The very next function does do 32 bit transfers. So the author certainly knows how to do it. That makes them look terrible.

It is widely known that software sells hardware. It would be in Arduino's interest to be more concerned about the api.

DrN · August 16, 2024, 8:34pm

Okay, I give up. Can I use the FSP in the Arduino IDE?

What do I need to download or install? And, where is the documentation?

Thank you

Here is what did it for me. Following is a picture of the plastic base plate and UNO R4. Notice where the cut-outs are relative to the board outline.

When I first got the R4, it was not working. Then I saw that the base plate was pushing on the cable. After removing it from the baseplate it worked.

I am wondering whether writing up all of the above, in a review and posting it widely, would help move things in a better direction at the manufacturer, or just turn everybody in the community against the writer.

UNOR4_base_plate_mismatch.p600

DrN · August 16, 2024, 8:35pm

Great, thank you. Where is the documentation? Are there some example codes?

DrN · August 16, 2024, 8:40pm

Yes, I know how to google.

Even if I had found that, I would still have to ask if that matches what is inside the IDE environment.

Asking somebody who has experience with it, is the most efficient first step.

This goes to why other primates are not building rocket ships.

(I wonder how apes sign "do you know how to google", probably something like hand on tuchish).

DrN · August 16, 2024, 9:58pm

Great that sounds like a time-saver, thank you.

Topic		Replies	Views
Uno R4 Poor SPI Performance UNO R4 Minima	20	4625	June 2, 2024
How to set SPI clock to 76.9kHz by using the UNO R3? 3rd Party Boards	45	18230	May 6, 2021
Compilation error: "wiring_private.h: No such file or directory" UNO R4 Minima	20	6610	May 30, 2025
Thinking about the SPI library ...? Libraries	41	1202	April 11, 2024
Missing SPI Library Programming	6	4912	July 18, 2022

UNO R4 SPI Performance and 16 bit transfers

Related topics