UNO R4 SPI Performance and 16 bit transfers

I hope to use the UNO R4 SPI to read an external ADC, and I would like to be able to clock the reads as fast as the part can possibly do it.

For completeness, I'll mention that the ADC is an MCP33131D. It requires a 700nsec sampling time and the transfer time for 16 bits at 12MHZ should be 1.3 usec. So altogether that is 2usec. We might expect to clock the ADC at something close to 500 KSPS.

Okay, here is the code for a (perhaps idealized) single 16 bit read, followed by an oscilloscope trace for the CNVST pin (lower trace) and the SPI clock (upper).

Notice that from the trailing edge of the CNVST there is 2.5 usecs before the transfer starts, and then the transfer occurs not as a single 16 bit transfer but rather as two 8 bit transfers separated by yet another 1.2usecs. That is a huge problem.

digitalWrite(CNVSTPIN,HIGH;
DELAYNANOSECONDS(700)
digitalWrite(CNVSTPIN,LOW);

dataword = SPI.transfer16(0xFFFF);

UNOR4SPI01

Now for comparison, here is what that same operation looks like in the Teensy 4.0. The second image is with scope set to a faster time scale to show you that it is really a single 16 bit transfer and the setup time is actually just a little under 180 nsecs.

TeensySPI01

TeensySPI02

As it turns out, the NXP part has a setting to control the number of bits per transfer. Do we have that in the RA4M1?

The setup time might be consistent with the processor clock speeds (600MHZ/48MHZ = 12.5, and 2.5usec/180nsec = 13).

But then with the slower clock, that means that for loops, we might want to factor out some of the SPI setup and just do whatever is the minimum to trigger each transfer.

Recall (at the top), we need to toggle a line between transfers, so we cannot just do this as a long contiguous block transfer.

Any help?

Thanks

Actually, this looks interesting.

On quick perusal it looks like for more than 4 bytes it does 32 bit transfers. So that seems to show how to do it. Aside, I found it by searching for the reference to the command register where the transfer length is set.

That said, it looks like too much work to ramp up on the part and implement the extension. The author or maintainer would likely do it in far less time.

In order to make all relevant information available to any who are interested in this subject, I'll share a link to the formal feature request @DrN submitted to the Arduino developers:

1 Like

Was that a faux pas?

It is not really a feature request, it's a correction. The 16 bit transfer is not working correctly the way it is coded. And ultimately, the patch has to get into the library. So, posting the issue to the owner seems like the right thing to do.

Sorry if it was a misstep.

It is not clear how DMA would work for a series of 16 bit transfers to read some number of words from the ADC.

Recall that the CNVST pin has to be pulsed before each 16 bit transfer.

And most often the reads are clocked anyway.


Figure-1:

1. Fig-1 says that any host can read the 16-bit data (for MCP33131D-10 device) by generating 16 SCLK pulses. This is a 16-bit operation.

2. If UNOR3 is used and SPI.transfer() is executed, then the user must issue two consecutive (optional delay in-between) transfer() methods.

3. If UNOR3 is used and SPI.transfer16() is executed, the UNOR3 breaks the said command into two and issues two consecutive SPI.transfer() instructions and a delay (as is seen in your scope traces) in-between. It is beause that the SPI Network of ATmega328P MCU of Arduino UNO is byte (8-bit) oriented.

4. The Step-2 & 3 are equally applicable for UNOR4 as the High Level SPI methods are designed for 8-bit UNOR3 and then borrowed to UNOR4 unchanged to maintain compatibility (post #10 @Delta_G ).

5. To avoid delay in-between, one can just do the Bitbang and complete the 16-bit transfer in just 1.3 us.

Submitting the issue definitely is not.

I do prefer for people to cross-link when they post about the same subject matter to multiple platforms. The reason is that the information that comes from the resulting discussion on one platform might be very valuable to the interested parties who see the post on another platform (keep in mind that forum posts and GitHub issue often serve as valuable references to many people who find them in search results during research on related subjects for years to come).

If you don't cross-link between the two, the interested parties on one platform are likely to be unaware of the existence of relevant discussion on other platforms. This often results in duplication of effort. That is the sole reason why I added a link to the GitHub issue here, as I stated.

My hope is that what I am doing will be generally useful, at least for aspiring and underfunded scientists. And it is important to use official libraries as far as possible.

For the 8 bit platforms it will be what it is, but the way the 16 bit transfer is coded is really a serious misstep in the SPI library. Look again at the traces, it takes a solid 8usec per 16 bit transfer!!! That is pretty terrible. If the R4 is to be a toy, well then fine. I think the R4 deserves better than that.

Imagine specs or readme's that read something like "1MSPS on the Teensy 4 and 100KSPS on the Arduino R4". Or, "for this sensor input board, you need a T4, not an R4". It seems like the R4 should be fast enough and this kind of constraint should be totally unnecessary

Practical use cases:

A) For the CCD interfaces, the popular Toshiba for example, you need at east 500KSPS

B) I am also working on a photon counting input (with a SiPM). It tags each photon with the arrival time down to 25 picosecond resolution. For reasons of quantum statistics, the count rate is typically about 10% of the trigger rate.

And so forth....

Okay, great thank you. I am trying to make the case that this really needs to be fixed.

Aside, the digital write and read seem quite (and unnecessarily) slow also. Is there a digitalWriteFast() etc for the R4? It is much better?

For those who are interested, there is an update to my tool for measuring timing, here

touche'

The api would be a nice way to make code at least somewhat portable across platforms, if it was done well.

For example, that benchmarking runs on both R4 and T4 using the same calls for most of it, There is a conditional compile section for the part that only the T4 does.

I am at this moment setting up a baseline program to talk to the board and runs scripts, and record the outputs, with the intent of dumping outputs for the R4 and T4. We'll see how things compare in a moment.

This is nothing to do with the RA4M1 MCU. It is feature of the SPI.h Library that introduecs the delay in-between 8-bit transactions.

To avoid the delay, you may proceed with direct bitbang.

Okay, here are the results, please see the attached files.

All of this is at the above github link, see post #15. The source code there is updated also, and there is a python host side program and scripts to generate the results.

Notice that the R4 spends about 72 cycles for digitalRead(), 21 for digitalWrite() and 120 for the improvised version of digitalToggle()

The Teensy 4 does it in 26 cycles for digitalRead(), 12 cycles for digitalWite() and 54 for digitalToggle().

Interrupt latency for the R4 is 182 cycles, for the T4 it is 120 cycles. The worst case latencies are 312 vs 198. In other words they fluctuate by 71% and 60% respectfully.

Now for the SPI, the R4 actually spends fewer cycles in the SPI transfer call at 327, compared to 515 for the T4. But, the T4 MCU is 12 times faster. See the scope images for the actual timing.

There are some other interesting quirks in when TimerOne launches the first iteration of the attached function one platform vs the other, see the scope images on the github.

So, there you have it.

Teensy4benchmark.log (3.2 KB)
UNOR4benchmark.log (2.6 KB)

When SPI.transfer(0x38) is executed, 0011 1000 are transmitted over the MOSI line while the clocking pulses are asserted on the SCK line -- is it not a bitbang (carried out by hardware)?

The Renesas code though would mean another conditional compile section for the UNOR4. It's okay, I would rather they fix the SPI library.

Actually there is one trick I could use, using the manufacturer libraries for the T4 and for the R4, as it stands now, that setup time is just lost time, I could run the CNVST pin toggles in that time instead.

For the T4, its not needed because the setup us less than the conversion time and pin write are fast, so I only need adjust the wait to get the SPI going at the earliest allowed time.

For the R4, the pin write by itself is longer than the conversion time.

Well, yes, I am using digitalWrite().

You are advocating abandoning large parts of the Arduino API to make it work.

That's not a great testimonial to the API.

Having a uniform API for many different boards was one of the attractions of doing this.

So far it looks like things are moving towards just getting it going on the R4.

Again, I am not b-tching about it, I am trying to make the case for fixing it.

You are saying it is pointless, tney did a fast hack job, they don't care and no matter what, they will not fix it

The kind of error in that spi::transfer16() does look like that. The very next function does do 32 bit transfers. So the author certainly knows how to do it. That makes them look terrible.

It is widely known that software sells hardware. It would be in Arduino's interest to be more concerned about the api.

Okay, I give up. Can I use the FSP in the Arduino IDE?

What do I need to download or install? And, where is the documentation?

Thank you

Here is what did it for me. Following is a picture of the plastic base plate and UNO R4. Notice where the cut-outs are relative to the board outline.

When I first got the R4, it was not working. Then I saw that the base plate was pushing on the cable. After removing it from the baseplate it worked.

I am wondering whether writing up all of the above, in a review and posting it widely, would help move things in a better direction at the manufacturer, or just turn everybody in the community against the writer.

UNOR4_base_plate_mismatch.p600

Great, thank you. Where is the documentation? Are there some example codes?

Yes, I know how to google.

Even if I had found that, I would still have to ask if that matches what is inside the IDE environment.

Asking somebody who has experience with it, is the most efficient first step.

This goes to why other primates are not building rocket ships.

(I wonder how apes sign "do you know how to google", probably something like hand on tuchish).

Great that sounds like a time-saver, thank you.