Go Down

Topic: Ethernet2 (UDP) SPI transfers have a lot of dead time (Read 9516 times) previous topic - next topic

westfw

#15
Dec 02, 2016, 02:06 am Last Edit: Dec 02, 2016, 02:08 am by westfw Reason: Add images
Quote
changing the occurances of  SPI_CSR_DLYBCT(1) to  SPI_CSR_DLYBCT(0)
This does help significantly for the bare-metal SPI case...

With DLYBCT(1):

With DLYBCT set to zero after starttransaction:

westfw

Ah hah!  I get it!  Since the SPI single-byte "transaction" routines are all "return the read value", they MUST wait for the complete transmission to happen, so you get no benefit from the potential overlap of the transmission with other code.  We might as well be bit-banging :-(
The counted version of write() does a little better, but it's still not taking advantage of both the shift-register AND the TX buffer register.


pjrc

This is a longshot, but you might try my optimized Ethernet library.

https://github.com/PaulStoffregen/Ethernet

Hopefully if you just put it into Documents/Arduino/libraries it'll override Arduino's version.  Pay attention to the messages about duplicate libraries and which one the Arduino IDE is really using.

This doesn't fix the slowness of Due's SPI library, but it does eliminate the redundant accesses to those Wiznet index registers.  It also uses transactions at the socket level, rather than needlessly starting and stopping the SPI transaction over and over again at the W5500 read/write level.

I have been mostly testing with TCP, and there are still many unsolved mysteries of slowness.  I'm waiting for delivery of a network tap this weekend before I continue work on this... so I can see what the Wiznet chip is really doing with the packets.

weird_dave

It seems there is a SPI_CSR_DLYBCT(1) hiding in SPI.h, I had only changed the ones in SPI.cpp, so that explains it a bit more :)
Having made this change, buffer transfers have improved as well, 100 bytes is now about 48us. It's also knocked 50us off the UDP time (450 down to 400us)
I shall try the wiznet library (this was Ethernet2 from the library manager) and see if there's an improvment.
Currently, I can see it is doing byte by byte transfers for all of it, so there's massive gaps

I notice that the wiznet site says 15Mbps, but the datasheet says 80 for SPI and 15 for the Ethernet link, which makes sense given the wiznet library sets the SPI speed to 42MHz :)

Paul, you posted while I was still writing this post :)
I did try your library yesterday (I posted in this thread: https://forum.arduino.cc/index.php?topic=438559.0)
I couldn't get the retry and timeout to work, I really need these as I need to 'give up early' and carry on, I'm trying to get my looptime down

weird_dave

I've tried the Wiznet library (at 28MHz) and it does seem to be faster, the UDP packet is down to about 270us. Clearly the wiznet library isn't using the faster buffer transfer :(

For testing, I've got a 100 byte SPI buffer transfer (not to the W5500) followed by the UDP transfer, just so I can compare on the fly, here's the really odd thing, the wiznet library is causing a minor slowdown in the SPI time! A 100 byte transfer over SPI takes 48.6us with the Ethernet2 library in use and 52us when I use the Wiznet library. It's not a huge amount, but it is very noticeable when you are looking at 10us/div on the scope, it crosses the 5 div boundary! Looking at the waveform, it is a small increase in deadtime between the SPI bursts, very strange!

weird_dave

I've tidied up my test code so it can be posted here.
It's currently setup to use Pauls library (and wiznet by folder renaming), there are commented out includes at the start for changing to Ethernet2, along with lines 42-45 for the timeout and retry configuration.

The wiznet library seems the fastest of the 3, tho there is the oddity of it increasing other SPI deadtime. None of the libraries seem to be using "transfer (buffer, size);", this would boost the speed enormously without resorting to bare-metal code.

Code: [Select]
#include <SPI.h>
//#include <Ethernet2.h>
#include "Ethernet.h"
#include "w5100.h"
//#include <utility/w5500.h>
// Edit the SPI speed in C:\Users\[USERNAME]\Documents\Arduino\libraries\Ethernet2\src\utility\w5500.cpp
//  to 28000000, 28MHz, line 25: (copy/paste the following is easiest)
//  SPISettings wiznet_SPI_settings(28000000, MSBFIRST, SPI_MODE0);
//
// Edit occurances of SPI_CSR_DLYBCT(1) to SPI_CSR_DLYBCT(0) in
//  C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.cpp
// and C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.h
//

byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED };
const IPAddress MyIP( 192, 168, 0, 10 );
const IPAddress Cab1_IP (192, 168, 0, 2); //this address exists
const IPAddress Cab2_IP (192, 168, 0, 1); //this doesn't, for testing retry and timeout
unsigned int local_Port = 12345;
unsigned int Cab1_Port = 12345;
unsigned int Cab2_Port = 12345;
EthernetUDP Udp;

const int EthTxBuf_Size = 100;
byte EthTxBuf[EthTxBuf_Size];
byte EthRxBuf[EthTxBuf_Size];
unsigned long current_micros;
const unsigned long looptime = 2000;
const int Testpin = 52;
const int Errorpin = 22;
const int Framepin = 32;
const int FPGA_SPIpin = 4;

void setup()
{
  Ethernet.begin(mac,MyIP);
  Udp.begin(local_Port);
  pinMode(Testpin, OUTPUT);
  pinMode(Errorpin, OUTPUT);
  pinMode(Framepin, OUTPUT);
  pinMode(FPGA_SPIpin, OUTPUT);
  //w5500.setRetransmissionCount(1);
  //w5500.setRetransmissionTime(1);
  W5100.setRetransmissionCount(uint8_t(1));
  W5100.setRetransmissionTime(uint16_t(1));
}

void loop()
{
  current_micros = micros();
  for (byte j=0; j<EthTxBuf_Size; j++)
  {
    EthTxBuf[j] = j;
  }
  digitalWrite(Framepin, HIGH);
  digitalWrite(Framepin, LOW);
  digitalWrite(Testpin, HIGH);

  Udp.beginPacket(Cab1_IP, Cab1_Port);
  Udp.write(EthTxBuf, EthTxBuf_Size);
  if (Udp.endPacket() == 0)
  {
    digitalWrite(Errorpin, HIGH);
    digitalWrite(Errorpin, LOW);
  }
 
  digitalWrite(Testpin, LOW);

  for (byte j=0; j<EthTxBuf_Size; j++)
  {
    EthTxBuf[j] = j;
  }

  SPI.beginTransaction(SPISettings(28000000, MSBFIRST, SPI_MODE0));
  digitalWrite(FPGA_SPIpin, LOW);
  SPI.transfer (&EthTxBuf, EthTxBuf_Size);
  digitalWrite(FPGA_SPIpin, HIGH);
  SPI.endTransaction();
 
  digitalWrite(Testpin, HIGH);
  Udp.beginPacket(Cab2_IP, Cab2_Port);
  Udp.write(EthTxBuf, EthTxBuf_Size);
  if (Udp.endPacket() == 0)
  {
    digitalWrite(Errorpin, HIGH);
    digitalWrite(Errorpin, LOW);
  }
  digitalWrite(Testpin, LOW);
  while ((micros()- current_micros)<looptime)
  {
   
  }
}


pjrc

#21
Dec 05, 2016, 09:57 pm Last Edit: Dec 05, 2016, 09:57 pm by Paul Stoffregen
None of the libraries seem to be using "transfer (buffer, size);", this would boost the speed enormously without resorting to bare-metal code.
I decided to try a quick sanity check for this theory.  I ran this code on an Arduino Due:

Code: [Select]

#include <SPI.h>

void setup() {
  SPI.begin();
  pinMode(10, OUTPUT);
}

void loop() {
  uint8_t data[5] = {0x55, 0x5A, 0x49, 0xAA, 0x96};
  digitalWrite(10, LOW);
  SPI.beginTransaction(SPISettings(25000000, MSBFIRST, SPI_MODE0));
  SPI.transfer(data, 5);
  SPI.endTransaction();
  digitalWrite(10, HIGH);
  delay(100);
}


Here is the rather disappointing result:


pjrc

#22
Dec 05, 2016, 10:02 pm Last Edit: Dec 05, 2016, 10:03 pm by Paul Stoffregen
Then again, those 50% dead times between bytes are a LOT better than the overhead of calling SPI.transfer(byte) five times.

Here's how bad *that* is:


pjrc

#23
Dec 05, 2016, 10:12 pm Last Edit: Dec 05, 2016, 10:12 pm by Paul Stoffregen
For comparison, here is Arduino Uno running the five SPI.transfer(byte) sketch:



Even with only an 8 bit CPU running at one fifth the clock speed, Uno manages to transfer the 5 bytes at only 8 Mbit/sec in approximately the total time Due does at 21 Mbit/sec.

pjrc

#24
Dec 05, 2016, 10:15 pm Last Edit: Dec 05, 2016, 10:16 pm by Paul Stoffregen
For completeness, here is how Uno performs with SPI.transfer(buf, 5):


weird_dave

Your results with the Due match what I'm seeing. The Uno results put the Due to shame really given the core speed. I don't own an Uno to play with unfortunately, so thanks for sharing that research.

Does your library buffer the transfers? My results suggest they don't, but I recall reading they did (or were supposed to). I suspect it's possible to get a 100 bytes transmission done well under 100us with a buffer transfer, that's the whole UDP SPI transfer, at 28MHz.
Also, could you confirm if the timeout and retry count work with your library? They didn't seem to work for me :(

dlloyd

Some bare metal sprinkled in and no buffering ...

Code: [Select]
#include <SPI.h>
uint8_t data[5] = {0x55, 0x5A, 0x49, 0xAA, 0x96};
byte count;

void setup() {
  SPI.begin(10);
  SPI.setClockDivider(10, 5);  // 16.8MHz Clock
  REG_SPI0_CSR &= 0x00FFFFFF;  // DLYBCT = 0
}

void loop() {
  while (1) {
    if ((REG_SPI0_SR & 2) != 0) { // transmit when data register empty
      REG_SPI0_TDR = data[count];
      count++;
      if (count == 5) count = 0;
    }
  }
}

SPI clock at 16.8 MHz: Without while loop, 2.17µs delay between transfers

With while loop: no delay between transfers, 0.5µs/byte, 50µs/100bytes


SPI clock at 21 MHz: 0.12µs delay, 0.5µs/byte including delay, 50µs/100bytes


SPI clock at 28 MHz: 0.22µs delay, 0.5µs/byte including delay, 50µs/100bytes


Uno USART in MSPIM mode ...

Code: [Select]
uint8_t data[5] = {0x55, 0x5A, 0x49, 0xAA, 0x96};
byte count;

void setup() {
  UBRR0H = 0;
  UBRR0L = 0;
  DDRD |= _BV (4);                         // XCK as output enables master mode
  UCSR0C = (1 << UMSEL01) | (1 << UMSEL00) | (0 << UCPHA0) | (0 << UCPOL0); // Master SPI, mode 0
  UCSR0B = (1 << RXEN0) | (1 << TXEN0);    // Enable receiver and transmitter
  UBRR0L = 1;                              // 4MHz XCK on pin 4
  SPCR = (1 << SPE);                       // enable SPI
  TIMSK0 = 0;                              // disable timer0
}

void loop() {
  while (1) {
    if ((UCSR0A & 32) != 0) { // transmit when data register empty
      UDR0 = data[count];
      count++;
      if (count == 5) count = 0;
    }
  }
}

SPI clock at 4 MHz, no delay between bytes, 2µs/byte, 200µs/100bytes


westfw

https://community.atmel.com/forum/getting-back-back-spi-transfers-sam3x
I have a variation of the bare metal code:

Code: [Select]
 for (byte i = 0; i < BFSIZE; i++) {
    while ((myspi->SPI_SR & SPI_SR_TDRE) == 0)
      ; // spin
    myspi->SPI_TDR = SPI_PCS(3) | i;
    if (myspi->SPI_SR & SPI_SR_RDRF) {
      *inptr++ = (byte) myspi->SPI_RDR;
    }
  }

And it has some mysterious aspects.   Most mysterious: timing doesn't seem to change between using SPI_SR_TDRE and SPI_SR_TXEMPTY, even though the former SHOULD have a full byte-time worth of leeway...


dlloyd

Using DMA is supposed to optimize SPI transfers (haven't tried it), but here's an example on GitHub.

pjrc

Seems the SPI library on AVR has received careful optimization work, but the SPI library on Due... not so much.  :(

Go Up