Go Down

Topic: Ethernet2 (UDP) SPI transfers have a lot of dead time (Read 9658 times) previous topic - next topic

weird_dave

I decided to write a separate piece of sample code to play around with SPI, without any of the the Ethernet stuff. Problem is, it doesn't work!
It gets to the SPI transfer part then falls over, I see "Transferring" on the serial monitor and never the "Done" (lines 26 and 28, wrapped around the SPI.transfer on line 27).
The Ethernet2 shield is attached to the SPI but nothing else (well, scope probes are). The transfers don't happen, the SPI clock doesn't change state.
If anyone can spot the idiot mistake I've made that would be great!
The Ethernet/SPI test code I posted previously still works, so it's not a short or other daft hardware issue.

Code: [Select]
#include <SPI.h>

const int SPIbuf_Size = 100;
byte SPIbuf [SPIbuf_Size];
const int FPGA_SPI_CSpin = 4;
unsigned long current_micros;
const unsigned long looptime = 2000000;
const int Serial_Baud = 115200;

void setup() {
  pinMode(FPGA_SPI_CSpin, OUTPUT);
  Serial.begin(Serial_Baud);
  Serial.println(F("Setup Complete"));
}

void loop() {
  current_micros = micros();
  for (int j = 0; j<256; j++)
  {
    Serial.print(F("Doing : "));
    Serial.println(j);
    SPIbuf[0] = j;
    SPIbuf[SPIbuf_Size-1] = j;
    SPI.beginTransaction(SPISettings(1000000, MSBFIRST, SPI_MODE0));
    digitalWrite(FPGA_SPI_CSpin, LOW);
    Serial.println(F("Transferring"));
    SPI.transfer (&SPIbuf, SPIbuf_Size);
    Serial.println(F("Done"));
    digitalWrite(FPGA_SPI_CSpin, HIGH);
    SPI.endTransaction();
    Serial.print(j);
    Serial.print(F(" : "));
    Serial.print(SPIbuf[0]);
    Serial.print(F(" : "));
    Serial.println(SPIbuf[SPIbuf_Size-1]);
    while ((micros()- current_micros)<looptime)
    {
     
    }
  }
}

weird_dave

Seems:
SPI.begin();
is required, shoved it in setup and it now works. I guess the Ethernet libraries are doing this and not
SPI.end();
when they are finished, making my other code work.

weird_dave

I just did a comparison between 28MHz and 42MHz, although the clocking was indeed faster, the deadtime increases to keep the 100 byte transfer at about 53us in both cases. I couldn't get 84MHz to work, it seemed to still clock at 42 :(

pjrc

Are you communicating with other hosts on your local ethernet?  Or will you be talking to hosts on the internet, accessed through routers and high-latency links?

dlloyd

Quote
I just did a comparison between 28MHz and 42MHz, although the clocking was indeed faster, the deadtime increases to keep the 100 byte transfer at about 53us in both cases. I couldn't get 84MHz to work, it seemed to still clock at 42 :(
Yes, that matches my findings in reply 26. The SAM3X SPI hardware works without deadtime at up to 16.8MHz clock. Beyond this rate, the deadtime increases to cancel out the byte transfer time improvement because a wall has been hit.

According to the datasheet, DMA will optimize transfer rate ... I guess the DMA improvement would be noticed only if the SPI clock is set higher than 16.8MHz.

This looks interesting...

32.7.3.9 Peripheral Deselection with DMAC
When the Direct Memory Access Controller is used, the chip select line will remain low during the whole transfer since the TDRE flag is managed by the DMAC itself. The reloading of the SPI_TDR by the DMAC is done as soon as TDRE flag is set to one.

ard_newbie



Did you try turbospi.h library for Sam3x  ( https://github.com/anydream/TurboSPI ) ?


dlloyd

Yeah, now that's more like it!

SPI SCK @ 42MHz, no deadtime, 100 bytes transferred in 19.04µs!

5.25 MBps (42Mbps)


Code: [Select]
#include <TurboSPI.h>

TurboSPI    g_SPI;
DigitalPin  g_PinCS, g_PinRS;
uint8_t     g_Buffer[100];  // some data buffer to transfer
uint8_t     g_Divisor = 2;  // transfer speed set to MCU's clock divide by 2

void setup()
{
  // setup pins
  g_PinCS.Begin(45);
  g_PinRS.Begin(47);

  g_PinCS.PinMode(OUTPUT);
  g_PinRS.PinMode(OUTPUT);

  // setup SPI
  g_SPI.Begin();

  // fill the buffer with data
  for (uint8_t i = 0; i < sizeof(g_Buffer); i++) {
    g_Buffer[i] = i + 1;
  }
}

void loop()
{
  // setup speed and select slave
  g_SPI.Init(g_Divisor);
  g_PinCS.Low();

  // set some pins
  g_PinRS.High();

  // transfer data to slave
  g_SPI.Send(g_Buffer, sizeof(g_Buffer));

  // unselect slave
  g_PinCS.High();
}

weird_dave

Are you communicating with other hosts on your local ethernet?  Or will you be talking to hosts on the internet, accessed through routers and high-latency links?
Neither that time, this was SPI only test of 100 bytes.

Yes, that matches my findings in reply 26. The SAM3X SPI hardware works without deadtime at up to 16.8MHz clock. Beyond this rate, the deadtime increases to cancel out the byte transfer time improvement because a wall has been hit.
Oops, I forgot to include my 21MHz result, it was slower than 28MHz, about 65us if memory serves...
Have you done this?
Code: [Select]
// Edit occurances of SPI_CSR_DLYBCT(1) to SPI_CSR_DLYBCT(0) in
//  C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.cpp
// and C:\Users\[USERNAME]\AppData\Local\Arduino15\packages\arduino\hardware\sam\1.6.9\libraries\SPI\src\SPI.h


Yeah, now that's more like it!

SPI SCK @ 42MHz, no deadtime, 100 bytes transferred in 19.04µs!
That's rather good. It isn't obvious to me if it receives at the same time as transmitting, if it does then this improves the SPI part of my project (I'm effectively building an Ethernet to FPGA bridge, I don't want to control the wiznet directly hence the Due. I have the FPGA and Due talking via SPI full duplex).
The real task now is to try and modify the Ethernet library to use TurboSPI, not a task I'm looking forward to, there goes the weekend :D

dlloyd

Yes, it should receive at the same time. The MISO line is on pin 47.

Hoek

Been doing a LOT of work and experimentation with the Due SPI/DMA and finally decided to implement all SPI transfers via DMA.

The throughput is... great!

I have an OLED1351 @ 128*128 rgb565 and can fill it at > 20 fps off SD card.

The SD card has a DIV=4 and OLED DIV=5 and they both play nicely.

I found in the low level routines that called write8() it made a large difference as to whether the function was inline or not.

I got probably a 20-30% speedup by forcing inline.


I also broke the write8() functions down into 2 to cut down on overhead of setting the same registers multiple times.

Code: [Select]


 __INLINE__ uint8_t cDMA_spi_send_do_wait_buffer()
{
 while (!due_dma_dmac_channel_transfer_done(DUE_DMA_SPI_TX_CH)) {}

 while ((SPI0->SPI_SR & SPI_SR_TXEMPTY) == 0) {}

 // leave RDR empty
 return  SPI0->SPI_RDR;
}

// new routines 8 bit send -- DMA --

__INLINE__ void cDMA_spi_send_again(uint8_t b, bool wait)
{
 __src8 = b;

 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_SADDR = (uint32_t)&__src8;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CTRLA = DMAC_CTRLA_BTSIZE(1) | DMAC_CTRLA_SRC_WIDTH_BYTE | DMAC_CTRLA_DST_WIDTH_BYTE;
 due_dma_dmac_channel_enable(DUE_DMA_SPI_TX_CH);
 
 if (wait)
 {
 cDMA_spi_send_do_wait_buffer();
 }
}

__INLINE__ void cDMA_spi_send(uint8_t b, bool wait)
{
// due_dma_dmac_channel_disable(DUE_DMA_SPI_TX_CH);
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_DSCR = 0;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_DADDR = (uint32_t)&SPI0->SPI_TDR;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CTRLB = DMAC_CTRLB_SRC_INCR_INCREMENTING | DMAC_CTRLB_SRC_DSCR | DMAC_CTRLB_DST_DSCR | DMAC_CTRLB_FC_MEM2PER_DMA_FC | DMAC_CTRLB_DST_INCR_FIXED;
 DMAC->DMAC_CH_NUM[DUE_DMA_SPI_TX_CH].DMAC_CFG = DMAC_CFG_DST_PER(DUE_DMA_SPI_TX_IDX) | DMAC_CFG_DST_H2SEL | DMAC_CFG_FIFOCFG_ALAP_CFG | DMAC_CFG_SOD;

 cDMA_spi_send_again(b, wait);
}

[code/]



Code: [Select]


cDMA_spi_send(b1, false);

// do some stuff here as have some cycles b4 the request ends


cDMA_spi_send_do_wait_buffer(); // now wait

cDMA_spi_send_again(b2, true);
cDMA_spi_send_again(b3, true);


With the ability to control whether to wait or not it allows some work to be done "for free". For the video streamer it means the pixel processing and looping is basically done for free as it's done while I would usually be waiting for the DMA request to end.

Also have the 16 bit send functions that don't require changing modes etc which also made a huge difference.

Just waiting on an AD5330 DAC so I can test the sound output with video. ATM the video is running 2x - 4x the normal speed so confident I should be able to support video with sound.

Made a SPIDevice class I use for all my projects. It will do things like check the DIV every time the chip is selected. However, it's important to only reset the DIV if necessary as it's a costly operation.

Code: [Select]


bool cDMA_spi_check_div(uintX_t sckDivisor, bool dma)  // check .. really need to do before each send to make sure each device is at correct speed etc
{
// may be SPI lib or DMA

if (dma && last_div_dma != sckDivisor)
{
last_div_dma = sckDivisor;

SPI0->SPI_CR = SPI_CR_SPIDIS;   //  disable SPI
SPI0->SPI_CR = SPI_CR_SWRST; // reset SPI
SPI0->SPI_MR = SPI_PCS(DUE_DMA_SPI_CHIP_SEL) | SPI_MR_MODFDIS | SPI_MR_MSTR; // no mode fault detection, set master mode
SPI0->SPI_CSR[DUE_DMA_SPI_CHIP_SEL] = SPI_CSR_SCBR((uint8_t)sckDivisor) | SPI_CSR_NCPHA; // mode 0, 8-bit,

SPI0->SPI_CR |= SPI_CR_SPIEN; // enable SPI

return true;
}

return false;
}


I suppose the other thing worth mentioning is I totally gutted sdfat to include an external library for all SPI and made it fat32 only. It's about as lean and mean as I can get.



ard_newbie


AFAIK inlining prevents the compiler to add a prologue (push { registers}) and prologue (pop {registers} ) to a function call, so logically it should be faster at the price of a larger code size.

PDC DMA and AHB DMA are surely the best options on a DUE to speed up every time they are available.

pjrc

Just a quick followup to this very old thread...

I recently released Ethernet library version 2.0.0, which brings my many optimizations originally written for Teensy to all Arduino boards, including Arduino Due.

W5200 & W5500 now utilize SPI.transfer(buffer, size).  I made many other optimizations, including native register I/O to avoid the slow digitalWrite on Due, and important higher level optimizations.  Details and benchmarks here:

https://www.pjrc.com/arduino-ethernet-library-2-0-0/

To get version 2.0.0, just use the library manager to update your Ethernet lib.



weird_dave

Do you have any like for like timing comparisons with the official wiznet library?

pjrc

This library, right?

https://github.com/Wiznet/WIZ_Ethernet_Library

I installed the "Arduino IDE 1.5.x" version just now.  It doesn't work with my Seeed W5500 shield (with w5100.h edited to select W5500 - this lib doesn't auto-detect which chip you have).

It does work with my Arduino Ethernet R3 shield.  The speed measures 9.83 kbytes/sec.  Ethernet 2.0.0 gets 109.73 kbytes/sec.

pjrc

I switched to Arduino Uno.  Wiznet's library *does* work with the Seeed W5500 shield when using Uno.  The speed is 139.99 kbytes/sec.  For comparison, Ethernet 2.0.0 gets 329.00 kbytes/sec on the same test with Uno, and 689.69 kbytes/sec with Due.

I also retested W5100 (Arduino Ethernet R3).  Wiznet's library get 10.17 kbytes/sec (yes, slightly faster than 9.83 kbytes/sec it gets with Arduino Due).  Ethernet 2.0.0 gets 82.66 kbytes/sec when using W5100 on Uno, and 109.73 kbytes/sec with Due.

Without a doubt, Ethernet 2.0.0 is much faster than Wiznet's library.

Go Up