Ethernet2 (UDP) SPI transfers have a lot of dead time

Are you communicating with other hosts on your local ethernet? Or will you be talking to hosts on the internet, accessed through routers and high-latency links?

I just did a comparison between 28MHz and 42MHz, although the clocking was indeed faster, the deadtime increases to keep the 100 byte transfer at about 53us in both cases. I couldn't get 84MHz to work, it seemed to still clock at 42 :frowning:

Yes, that matches my findings in reply 26. The SAM3X SPI hardware works without deadtime at up to 16.8MHz clock. Beyond this rate, the deadtime increases to cancel out the byte transfer time improvement because a wall has been hit.

According to the datasheet, DMA will optimize transfer rate ... I guess the DMA improvement would be noticed only if the SPI clock is set higher than 16.8MHz.

This looks interesting...

32.7.3.9 Peripheral Deselection with DMAC
When the Direct Memory Access Controller is used, the chip select line will remain low during the whole transfer since the TDRE flag is managed by the DMAC itself. The reloading of the SPI_TDR by the DMAC is done as soon as TDRE flag is set to one.

Did you try turbospi.h library for Sam3x ( GitHub - anydream/TurboSPI: This is another Arduino SPI library separated from SdFat library, it uses SPI registers and DMA (Arduino Due only) to accelerate SPI communication ) ?

Yeah, now that's more like it!

SPI SCK @ 42MHz, no deadtime, 100 bytes transferred in 19.04µs!

5.25 MBps (42Mbps)

#include <TurboSPI.h>

TurboSPI    g_SPI;
DigitalPin  g_PinCS, g_PinRS;
uint8_t     g_Buffer[100];  // some data buffer to transfer
uint8_t     g_Divisor = 2;  // transfer speed set to MCU's clock divide by 2

void setup()
{
  // setup pins
  g_PinCS.Begin(45);
  g_PinRS.Begin(47);

  g_PinCS.PinMode(OUTPUT);
  g_PinRS.PinMode(OUTPUT);

  // setup SPI
  g_SPI.Begin();

  // fill the buffer with data
  for (uint8_t i = 0; i < sizeof(g_Buffer); i++) {
    g_Buffer[i] = i + 1;
  }
}

void loop()
{
  // setup speed and select slave
  g_SPI.Init(g_Divisor);
  g_PinCS.Low();

  // set some pins
  g_PinRS.High();

  // transfer data to slave
  g_SPI.Send(g_Buffer, sizeof(g_Buffer));

  // unselect slave
  g_PinCS.High();
}

Neither that time, this was SPI only test of 100 bytes.

dlloyd:
Yes, that matches my findings in reply 26. The SAM3X SPI hardware works without deadtime at up to 16.8MHz clock. Beyond this rate, the deadtime increases to cancel out the byte transfer time improvement because a wall has been hit.

Oops, I forgot to include my 21MHz result, it was slower than 28MHz, about 65us if memory serves...
Have you done this?

// Edit occurances of SPI_CSR_DLYBCT(1) to SPI_CSR_DLYBCT(0) in
//  C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.cpp
// and C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.h

dlloyd:
Yeah, now that's more like it!

SPI SCK @ 42MHz, no deadtime, 100 bytes transferred in 19.04µs!

That's rather good. It isn't obvious to me if it receives at the same time as transmitting, if it does then this improves the SPI part of my project (I'm effectively building an Ethernet to FPGA bridge, I don't want to control the wiznet directly hence the Due. I have the FPGA and Due talking via SPI full duplex).
The real task now is to try and modify the Ethernet library to use TurboSPI, not a task I'm looking forward to, there goes the weekend :smiley:

Yes, it should receive at the same time. The MISO line is on pin 47.

Been doing a LOT of work and experimentation with the Due SPI/DMA and finally decided to implement all SPI transfers via DMA.

The throughput is... great!

I have an OLED1351 @ 128*128 rgb565 and can fill it at > 20 fps off SD card.

The SD card has a DIV=4 and OLED DIV=5 and they both play nicely.

I found in the low level routines that called write8() it made a large difference as to whether the function was inline or not.

I got probably a 20-30% speedup by forcing inline.

I also broke the write8() functions down into 2 to cut down on overhead of setting the same registers multiple times.

 __INLINE__ uint8_t cDMA_spi_send_do_wait_buffer() 
{
 while (!due_dma_dmac_channel_transfer_done(DUE_DMA_SPI_TX_CH)) {}

 while ((SPI0->SPI_SR & SPI_SR_TXEMPTY) == 0) {}

 // leave RDR empty
 return  SPI0->SPI_RDR;
}

// new routines 8 bit send -- DMA --

__INLINE__ void cDMA_spi_send_again(uint8_t b, bool wait)
{
 __src8 = b;

 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_SADDR = (uint32_t)&__src8;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CTRLA = DMAC_CTRLA_BTSIZE(1) | DMAC_CTRLA_SRC_WIDTH_BYTE | DMAC_CTRLA_DST_WIDTH_BYTE;
 due_dma_dmac_channel_enable(DUE_DMA_SPI_TX_CH);
 
 if (wait)
 {
 cDMA_spi_send_do_wait_buffer();
 }
}

__INLINE__ void cDMA_spi_send(uint8_t b, bool wait)
{
// due_dma_dmac_channel_disable(DUE_DMA_SPI_TX_CH);
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_DSCR = 0;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_DADDR = (uint32_t)&SPI0->SPI_TDR;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CTRLB = DMAC_CTRLB_SRC_INCR_INCREMENTING | DMAC_CTRLB_SRC_DSCR | DMAC_CTRLB_DST_DSCR | DMAC_CTRLB_FC_MEM2PER_DMA_FC | DMAC_CTRLB_DST_INCR_FIXED;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CFG = DMAC_CFG_DST_PER(DUE_DMA_SPI_TX_IDX) | DMAC_CFG_DST_H2SEL | DMAC_CFG_FIFOCFG_ALAP_CFG | DMAC_CFG_SOD; 

 cDMA_spi_send_again(b, wait);
}

[code/]
cDMA_spi_send(b1, false);

// do some stuff here as have some cycles b4 the request ends


cDMA_spi_send_do_wait_buffer(); // now wait

cDMA_spi_send_again(b2, true);
cDMA_spi_send_again(b3, true);

With the ability to control whether to wait or not it allows some work to be done "for free". For the video streamer it means the pixel processing and looping is basically done for free as it's done while I would usually be waiting for the DMA request to end.

Also have the 16 bit send functions that don't require changing modes etc which also made a huge difference.

Just waiting on an AD5330 DAC so I can test the sound output with video. ATM the video is running 2x - 4x the normal speed so confident I should be able to support video with sound.

Made a SPIDevice class I use for all my projects. It will do things like check the DIV every time the chip is selected. However, it's important to only reset the DIV if necessary as it's a costly operation.

bool cDMA_spi_check_div(uintX_t sckDivisor, bool dma)  // check .. really need to do before each send to make sure each device is at correct speed etc
{
	// may be SPI lib or DMA 

	if (dma && last_div_dma != sckDivisor) 
	{
		last_div_dma = sckDivisor;

		SPI0->SPI_CR = SPI_CR_SPIDIS;   //  disable SPI
		SPI0->SPI_CR = SPI_CR_SWRST; // reset SPI
		SPI0->SPI_MR = SPI_PCS(DUE_DMA_SPI_CHIP_SEL) | SPI_MR_MODFDIS | SPI_MR_MSTR; // no mode fault detection, set master mode
		SPI0->SPI_CSR[DUE_DMA_SPI_CHIP_SEL] = SPI_CSR_SCBR((uint8_t)sckDivisor) | SPI_CSR_NCPHA; // mode 0, 8-bit,						

		SPI0->SPI_CR |= SPI_CR_SPIEN; // enable SPI
		
		return true;
	}

	return false;	
}

I suppose the other thing worth mentioning is I totally gutted sdfat to include an external library for all SPI and made it fat32 only. It's about as lean and mean as I can get.

AFAIK inlining prevents the compiler to add a prologue (push { registers}) and prologue (pop {registers} ) to a function call, so logically it should be faster at the price of a larger code size.

PDC DMA and AHB DMA are surely the best options on a DUE to speed up every time they are available.

Just a quick followup to this very old thread...

I recently released Ethernet library version 2.0.0, which brings my many optimizations originally written for Teensy to all Arduino boards, including Arduino Due.

W5200 & W5500 now utilize SPI.transfer(buffer, size). I made many other optimizations, including native register I/O to avoid the slow digitalWrite on Due, and important higher level optimizations. Details and benchmarks here:

https://www.pjrc.com/arduino-ethernet-library-2-0-0/

To get version 2.0.0, just use the library manager to update your Ethernet lib.

Do you have any like for like timing comparisons with the official wiznet library?

This library, right?

I installed the "Arduino IDE 1.5.x" version just now. It doesn't work with my Seeed W5500 shield (with w5100.h edited to select W5500 - this lib doesn't auto-detect which chip you have).

It does work with my Arduino Ethernet R3 shield. The speed measures 9.83 kbytes/sec. Ethernet 2.0.0 gets 109.73 kbytes/sec.

I switched to Arduino Uno. Wiznet's library does work with the Seeed W5500 shield when using Uno. The speed is 139.99 kbytes/sec. For comparison, Ethernet 2.0.0 gets 329.00 kbytes/sec on the same test with Uno, and 689.69 kbytes/sec with Due.

I also retested W5100 (Arduino Ethernet R3). Wiznet's library get 10.17 kbytes/sec (yes, slightly faster than 9.83 kbytes/sec it gets with Arduino Due). Ethernet 2.0.0 gets 82.66 kbytes/sec when using W5100 on Uno, and 109.73 kbytes/sec with Due.

Without a doubt, Ethernet 2.0.0 is much faster than Wiznet's library.

For one final test, I put the Arduino.org Ethernet2 shield on Arduino Due. Wiznet's library does work with this shield. I don't know why it fails on the Seeed W5500 shield. Both work on Due with Ethernet 2.0.0.

Arduino Due with the W5500-based Arduino.org Ethernet2 speed is 394.80 kbytes/sec. Ethernet 2.0.0 gets 695.35 kbytes/sec with that shield on Due.

Thanks for sharing those results. Is there a result for the wiznet library on the 5500/Due?

weird_dave:
Thanks for sharing those results. Is there a result for the wiznet library on the 5500/Due?

Yes.

"Arduino Due with the W5500-based Arduino.org Ethernet2 speed is 394.80 kbytes/sec. Ethernet 2.0.0 gets 695.35 kbytes/sec with that shield on Due."

With all these optimizations in Ethernet 2.0.0 (removing the many prior bottlenecks in the Ethernet library), I believe these benchmarks would at least doubled on Arduino Due if someone were to optimize the SPI.transfer(buffer, length) well. Much of the hard work for those SPI optimizations has been done in the messages on this thread. But hardly anyone will ever benefit until someone goes to the trouble of actually updating Due's SPI library.

OK, I took the Arduino.org Ethernet2 library not to mean the Wiznet library since they are different (or were last time I checked).
When I get some spare time, I'll give it a go, thanks for the effort :slight_smile:

I recently released Ethernet library version 2.0.0

Wait - you got all that improvement WITHOUT implementing write-only SPI functions?
Wow.
(Hmm. Not that I'm sure that a write-only SPI would be much faster. Mostly just ... easier? No more overwriting your output buffer (?))

Yup, the old Ethernet lib was horribly inefficient on every level.

Due's SPI library is still very inefficient, which holds back Due's performance to ~700 kbytes/sec. If someone were to improve the SPI lib, I believe Due could probably even outperform Teensy (where the SPI lib is highly optimized) on these tests, because Due is the only board that can actually produce a 14 MHz SPI clock. Pretty much all the others use 8 or 12 MHz when SPISettings asks for 14 MHz max.

The SPI lib on Due isn't my project. My dev cycles are funded by Teensy sales. All this optimization work came from Teensy's fork of Ethernet. Occasionally I try to contribute Teensy's improvements back to the rest of the Arduino community, so everyone can benefit. Hope everyone gets some good use from it. :slight_smile: