Ethernet2 (UDP) SPI transfers have a lot of dead time

I’ve tidied up my test code so it can be posted here.
It’s currently setup to use Pauls library (and wiznet by folder renaming), there are commented out includes at the start for changing to Ethernet2, along with lines 42-45 for the timeout and retry configuration.

The wiznet library seems the fastest of the 3, tho there is the oddity of it increasing other SPI deadtime. None of the libraries seem to be using “transfer (buffer, size);”, this would boost the speed enormously without resorting to bare-metal code.

#include <SPI.h>
//#include <Ethernet2.h>
#include "Ethernet.h"
#include "w5100.h"
//#include <utility/w5500.h>
// Edit the SPI speed in C:\Users\[USERNAME]\Documents\Arduino\libraries\Ethernet2\src\utility\w5500.cpp
//  to 28000000, 28MHz, line 25: (copy/paste the following is easiest)
//  SPISettings wiznet_SPI_settings(28000000, MSBFIRST, SPI_MODE0);
//
// Edit occurances of SPI_CSR_DLYBCT(1) to SPI_CSR_DLYBCT(0) in
//  C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.cpp
// and C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.h
//

byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED };
const IPAddress MyIP( 192, 168, 0, 10 );
const IPAddress Cab1_IP (192, 168, 0, 2); //this address exists
const IPAddress Cab2_IP (192, 168, 0, 1); //this doesn't, for testing retry and timeout
unsigned int local_Port = 12345;
unsigned int Cab1_Port = 12345;
unsigned int Cab2_Port = 12345;
EthernetUDP Udp;

const int EthTxBuf_Size = 100;
byte EthTxBuf[EthTxBuf_Size];
byte EthRxBuf[EthTxBuf_Size];
unsigned long current_micros;
const unsigned long looptime = 2000;
const int Testpin = 52;
const int Errorpin = 22;
const int Framepin = 32;
const int FPGA_SPIpin = 4;

void setup()
{
  Ethernet.begin(mac,MyIP);
  Udp.begin(local_Port);
  pinMode(Testpin, OUTPUT);
  pinMode(Errorpin, OUTPUT);
  pinMode(Framepin, OUTPUT);
  pinMode(FPGA_SPIpin, OUTPUT);
  //w5500.setRetransmissionCount(1);
  //w5500.setRetransmissionTime(1);
  W5100.setRetransmissionCount(uint8_t(1));
  W5100.setRetransmissionTime(uint16_t(1));
}

void loop()
{
  current_micros = micros();
  for (byte j=0; j<EthTxBuf_Size; j++)
  {
    EthTxBuf[j] = j;
  }
  digitalWrite(Framepin, HIGH);
  digitalWrite(Framepin, LOW);
  digitalWrite(Testpin, HIGH);

  Udp.beginPacket(Cab1_IP, Cab1_Port);
  Udp.write(EthTxBuf, EthTxBuf_Size);
  if (Udp.endPacket() == 0)
  {
    digitalWrite(Errorpin, HIGH);
    digitalWrite(Errorpin, LOW);
  }
  
  digitalWrite(Testpin, LOW);

  for (byte j=0; j<EthTxBuf_Size; j++)
  {
    EthTxBuf[j] = j;
  }

  SPI.beginTransaction(SPISettings(28000000, MSBFIRST, SPI_MODE0));
  digitalWrite(FPGA_SPIpin, LOW);
  SPI.transfer (&EthTxBuf, EthTxBuf_Size);
  digitalWrite(FPGA_SPIpin, HIGH);
  SPI.endTransaction();
  
  digitalWrite(Testpin, HIGH);
  Udp.beginPacket(Cab2_IP, Cab2_Port);
  Udp.write(EthTxBuf, EthTxBuf_Size);
  if (Udp.endPacket() == 0)
  {
    digitalWrite(Errorpin, HIGH);
    digitalWrite(Errorpin, LOW);
  }
  digitalWrite(Testpin, LOW);
  while ((micros()- current_micros)<looptime)
  {
    
  }
}

weird_dave:
None of the libraries seem to be using “transfer (buffer, size);”, this would boost the speed enormously without resorting to bare-metal code.

I decided to try a quick sanity check for this theory. I ran this code on an Arduino Due:

#include <SPI.h>

void setup() {
  SPI.begin();
  pinMode(10, OUTPUT);
}

void loop() {
  uint8_t data[5] = {0x55, 0x5A, 0x49, 0xAA, 0x96};
  digitalWrite(10, LOW);
  SPI.beginTransaction(SPISettings(25000000, MSBFIRST, SPI_MODE0));
  SPI.transfer(data, 5);
  SPI.endTransaction();
  digitalWrite(10, HIGH);
  delay(100);
}

Here is the rather disappointing result:

file.png

Then again, those 50% dead times between bytes are a LOT better than the overhead of calling SPI.transfer(byte) five times.

Here’s how bad that is:

file2.png

For comparison, here is Arduino Uno running the five SPI.transfer(byte) sketch:

file3.png

Even with only an 8 bit CPU running at one fifth the clock speed, Uno manages to transfer the 5 bytes at only 8 Mbit/sec in approximately the total time Due does at 21 Mbit/sec.

For completeness, here is how Uno performs with SPI.transfer(buf, 5):

file4.png

Your results with the Due match what I'm seeing. The Uno results put the Due to shame really given the core speed. I don't own an Uno to play with unfortunately, so thanks for sharing that research.

Does your library buffer the transfers? My results suggest they don't, but I recall reading they did (or were supposed to). I suspect it's possible to get a 100 bytes transmission done well under 100us with a buffer transfer, that's the whole UDP SPI transfer, at 28MHz. Also, could you confirm if the timeout and retry count work with your library? They didn't seem to work for me :(

Some bare metal sprinkled in and no buffering …

#include <SPI.h>
uint8_t data[5] = {0x55, 0x5A, 0x49, 0xAA, 0x96};
byte count;

void setup() {
  SPI.begin(10);
  SPI.setClockDivider(10, 5);  // 16.8MHz Clock
  REG_SPI0_CSR &= 0x00FFFFFF;  // DLYBCT = 0
}

void loop() {
  while (1) {
    if ((REG_SPI0_SR & 2) != 0) { // transmit when data register empty
      REG_SPI0_TDR = data[count];
      count++;
      if (count == 5) count = 0;
    }
  }
}

SPI clock at 16.8 MHz: Without while loop, 2.17µs delay between transfers

With while loop: no delay between transfers, 0.5µs/byte, 50µs/100bytes

SPI clock at 21 MHz: 0.12µs delay, 0.5µs/byte including delay, 50µs/100bytes

SPI clock at 28 MHz: 0.22µs delay, 0.5µs/byte including delay, 50µs/100bytes

Uno USART in MSPIM mode …

uint8_t data[5] = {0x55, 0x5A, 0x49, 0xAA, 0x96};
byte count;

void setup() {
  UBRR0H = 0;
  UBRR0L = 0;
  DDRD |= _BV (4);                         // XCK as output enables master mode
  UCSR0C = (1 << UMSEL01) | (1 << UMSEL00) | (0 << UCPHA0) | (0 << UCPOL0); // Master SPI, mode 0
  UCSR0B = (1 << RXEN0) | (1 << TXEN0);    // Enable receiver and transmitter
  UBRR0L = 1;                              // 4MHz XCK on pin 4
  SPCR = (1 << SPE);                       // enable SPI
  TIMSK0 = 0;                              // disable timer0
}

void loop() {
  while (1) {
    if ((UCSR0A & 32) != 0) { // transmit when data register empty
      UDR0 = data[count];
      count++;
      if (count == 5) count = 0;
    }
  }
}

SPI clock at 4 MHz, no delay between bytes, 2µs/byte, 200µs/100bytes

I have a variation of the bare metal code:

  for (byte i = 0; i < BFSIZE; i++) {
    while ((myspi->SPI_SR & SPI_SR_TDRE) == 0)
      ; // spin
    myspi->SPI_TDR = SPI_PCS(3) | i;
    if (myspi->SPI_SR & SPI_SR_RDRF) {
      *inptr++ = (byte) myspi->SPI_RDR;
    }
  }

And it has some mysterious aspects. Most mysterious: timing doesn’t seem to change between using SPI_SR_TDRE and SPI_SR_TXEMPTY, even though the former SHOULD have a full byte-time worth of leeway…

DUE_SPI_RAW.ino (1.69 KB)

Using DMA is supposed to optimize SPI transfers (haven't tried it), but here's an example on GitHub.

Seems the SPI library on AVR has received careful optimization work, but the SPI library on Due... not so much. :(

I decided to write a separate piece of sample code to play around with SPI, without any of the the Ethernet stuff. Problem is, it doesn’t work!
It gets to the SPI transfer part then falls over, I see “Transferring” on the serial monitor and never the “Done” (lines 26 and 28, wrapped around the SPI.transfer on line 27).
The Ethernet2 shield is attached to the SPI but nothing else (well, scope probes are). The transfers don’t happen, the SPI clock doesn’t change state.
If anyone can spot the idiot mistake I’ve made that would be great!
The Ethernet/SPI test code I posted previously still works, so it’s not a short or other daft hardware issue.

#include <SPI.h>

const int SPIbuf_Size = 100;
byte SPIbuf [SPIbuf_Size];
const int FPGA_SPI_CSpin = 4;
unsigned long current_micros;
const unsigned long looptime = 2000000;
const int Serial_Baud = 115200;

void setup() {
  pinMode(FPGA_SPI_CSpin, OUTPUT);
  Serial.begin(Serial_Baud);
  Serial.println(F("Setup Complete"));
}

void loop() {
  current_micros = micros();
  for (int j = 0; j<256; j++)
  {
    Serial.print(F("Doing : "));
    Serial.println(j);
    SPIbuf[0] = j;
    SPIbuf[SPIbuf_Size-1] = j;
    SPI.beginTransaction(SPISettings(1000000, MSBFIRST, SPI_MODE0));
    digitalWrite(FPGA_SPI_CSpin, LOW);
    Serial.println(F("Transferring"));
    SPI.transfer (&SPIbuf, SPIbuf_Size);
    Serial.println(F("Done"));
    digitalWrite(FPGA_SPI_CSpin, HIGH);
    SPI.endTransaction();
    Serial.print(j);
    Serial.print(F(" : "));
    Serial.print(SPIbuf[0]);
    Serial.print(F(" : "));
    Serial.println(SPIbuf[SPIbuf_Size-1]);
    while ((micros()- current_micros)<looptime)
    {
      
    }
  }
}

Seems: SPI.begin(); is required, shoved it in setup and it now works. I guess the Ethernet libraries are doing this and not SPI.end(); when they are finished, making my other code work.

I just did a comparison between 28MHz and 42MHz, although the clocking was indeed faster, the deadtime increases to keep the 100 byte transfer at about 53us in both cases. I couldn't get 84MHz to work, it seemed to still clock at 42 :(

Are you communicating with other hosts on your local ethernet? Or will you be talking to hosts on the internet, accessed through routers and high-latency links?

I just did a comparison between 28MHz and 42MHz, although the clocking was indeed faster, the deadtime increases to keep the 100 byte transfer at about 53us in both cases. I couldn't get 84MHz to work, it seemed to still clock at 42 :(

Yes, that matches my findings in reply 26. The SAM3X SPI hardware works without deadtime at up to 16.8MHz clock. Beyond this rate, the deadtime increases to cancel out the byte transfer time improvement because a wall has been hit.

According to the datasheet, DMA will optimize transfer rate ... I guess the DMA improvement would be noticed only if the SPI clock is set higher than 16.8MHz.

This looks interesting...

32.7.3.9 Peripheral Deselection with DMAC When the Direct Memory Access Controller is used, the chip select line will remain low during the whole transfer since the TDRE flag is managed by the DMAC itself. The reloading of the SPI_TDR by the DMAC is done as soon as TDRE flag is set to one.

Did you try turbospi.h library for Sam3x ( https://github.com/anydream/TurboSPI ) ?

Yeah, now that’s more like it!

SPI SCK @ 42MHz, no deadtime, 100 bytes transferred in 19.04µs!

5.25 MBps (42Mbps)

#include <TurboSPI.h>

TurboSPI    g_SPI;
DigitalPin  g_PinCS, g_PinRS;
uint8_t     g_Buffer[100];  // some data buffer to transfer
uint8_t     g_Divisor = 2;  // transfer speed set to MCU's clock divide by 2

void setup()
{
  // setup pins
  g_PinCS.Begin(45);
  g_PinRS.Begin(47);

  g_PinCS.PinMode(OUTPUT);
  g_PinRS.PinMode(OUTPUT);

  // setup SPI
  g_SPI.Begin();

  // fill the buffer with data
  for (uint8_t i = 0; i < sizeof(g_Buffer); i++) {
    g_Buffer[i] = i + 1;
  }
}

void loop()
{
  // setup speed and select slave
  g_SPI.Init(g_Divisor);
  g_PinCS.Low();

  // set some pins
  g_PinRS.High();

  // transfer data to slave
  g_SPI.Send(g_Buffer, sizeof(g_Buffer));

  // unselect slave
  g_PinCS.High();
}

[quote author=Paul Stoffregen date=1481204508 link=msg=3034401] Are you communicating with other hosts on your local ethernet? Or will you be talking to hosts on the internet, accessed through routers and high-latency links? [/quote] Neither that time, this was SPI only test of 100 bytes.

dlloyd: Yes, that matches my findings in reply 26. The SAM3X SPI hardware works without deadtime at up to 16.8MHz clock. Beyond this rate, the deadtime increases to cancel out the byte transfer time improvement because a wall has been hit.

Oops, I forgot to include my 21MHz result, it was slower than 28MHz, about 65us if memory serves... Have you done this?

// Edit occurances of SPI_CSR_DLYBCT(1) to SPI_CSR_DLYBCT(0) in
//  C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.cpp
// and C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.h

dlloyd: Yeah, now that's more like it!

SPI SCK @ 42MHz, no deadtime, 100 bytes transferred in 19.04µs!

That's rather good. It isn't obvious to me if it receives at the same time as transmitting, if it does then this improves the SPI part of my project (I'm effectively building an Ethernet to FPGA bridge, I don't want to control the wiznet directly hence the Due. I have the FPGA and Due talking via SPI full duplex). The real task now is to try and modify the Ethernet library to use TurboSPI, not a task I'm looking forward to, there goes the weekend :D

Yes, it should receive at the same time. The MISO line is on pin 47.

Been doing a LOT of work and experimentation with the Due SPI/DMA and finally decided to implement all SPI transfers via DMA.

The throughput is... great!

I have an OLED1351 @ 128*128 rgb565 and can fill it at > 20 fps off SD card.

The SD card has a DIV=4 and OLED DIV=5 and they both play nicely.

I found in the low level routines that called write8() it made a large difference as to whether the function was inline or not.

I got probably a 20-30% speedup by forcing inline.

I also broke the write8() functions down into 2 to cut down on overhead of setting the same registers multiple times.

 __INLINE__ uint8_t cDMA_spi_send_do_wait_buffer() 
{
 while (!due_dma_dmac_channel_transfer_done(DUE_DMA_SPI_TX_CH)) {}

 while ((SPI0->SPI_SR & SPI_SR_TXEMPTY) == 0) {}

 // leave RDR empty
 return  SPI0->SPI_RDR;
}

// new routines 8 bit send -- DMA --

__INLINE__ void cDMA_spi_send_again(uint8_t b, bool wait)
{
 __src8 = b;

 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_SADDR = (uint32_t)&__src8;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CTRLA = DMAC_CTRLA_BTSIZE(1) | DMAC_CTRLA_SRC_WIDTH_BYTE | DMAC_CTRLA_DST_WIDTH_BYTE;
 due_dma_dmac_channel_enable(DUE_DMA_SPI_TX_CH);

 if (wait)
 {
 cDMA_spi_send_do_wait_buffer();
 }
}

__INLINE__ void cDMA_spi_send(uint8_t b, bool wait)
{
// due_dma_dmac_channel_disable(DUE_DMA_SPI_TX_CH);
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_DSCR = 0;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_DADDR = (uint32_t)&SPI0->SPI_TDR;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CTRLB = DMAC_CTRLB_SRC_INCR_INCREMENTING | DMAC_CTRLB_SRC_DSCR | DMAC_CTRLB_DST_DSCR | DMAC_CTRLB_FC_MEM2PER_DMA_FC | DMAC_CTRLB_DST_INCR_FIXED;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CFG = DMAC_CFG_DST_PER(DUE_DMA_SPI_TX_IDX) | DMAC_CFG_DST_H2SEL | DMAC_CFG_FIFOCFG_ALAP_CFG | DMAC_CFG_SOD; 

 cDMA_spi_send_again(b, wait);
}

[code/]
cDMA_spi_send(b1, false);

// do some stuff here as have some cycles b4 the request ends


cDMA_spi_send_do_wait_buffer(); // now wait

cDMA_spi_send_again(b2, true);
cDMA_spi_send_again(b3, true);

With the ability to control whether to wait or not it allows some work to be done "for free". For the video streamer it means the pixel processing and looping is basically done for free as it's done while I would usually be waiting for the DMA request to end.

Also have the 16 bit send functions that don't require changing modes etc which also made a huge difference.

Just waiting on an AD5330 DAC so I can test the sound output with video. ATM the video is running 2x - 4x the normal speed so confident I should be able to support video with sound.

Made a SPIDevice class I use for all my projects. It will do things like check the DIV every time the chip is selected. However, it's important to only reset the DIV if necessary as it's a costly operation.

bool cDMA_spi_check_div(uintX_t sckDivisor, bool dma)  // check .. really need to do before each send to make sure each device is at correct speed etc
{
    // may be SPI lib or DMA 

    if (dma && last_div_dma != sckDivisor) 
    {
        last_div_dma = sckDivisor;

        SPI0->SPI_CR = SPI_CR_SPIDIS;   //  disable SPI
        SPI0->SPI_CR = SPI_CR_SWRST; // reset SPI
        SPI0->SPI_MR = SPI_PCS(DUE_DMA_SPI_CHIP_SEL) | SPI_MR_MODFDIS | SPI_MR_MSTR; // no mode fault detection, set master mode
        SPI0->SPI_CSR[DUE_DMA_SPI_CHIP_SEL] = SPI_CSR_SCBR((uint8_t)sckDivisor) | SPI_CSR_NCPHA; // mode 0, 8-bit,                      

        SPI0->SPI_CR |= SPI_CR_SPIEN; // enable SPI

        return true;
    }

    return false;   
}

I suppose the other thing worth mentioning is I totally gutted sdfat to include an external library for all SPI and made it fat32 only. It's about as lean and mean as I can get.