Faster Ethernet Anyone?

In another forum section I recently posted about the actual throughput as measured for the Ethernet shield, using a Mega2560. In a nutshell, without any changes to library code, you get around 15 KBytes per second. Changing the SPI clock from 4 MHZ to 8 MHZ bumps that up to about 20 KBytes per second. Eliminating any usage of the print() and println() functions makes a huge improvement, topping out around 70 KBytes per second. Note that these rates are based on measuring from the first byte of sent data to the last byte, and don't account for other program delays.

With these tweaks, maybe 70 KBytes per second is good enough. But in some situations, it would be nice if faster rates were possible? In an effort to evaluate the Arduino platform for some more extreme applications, faster Ethernet is a requirement for me, and it can be done with the right hardware.

The simplest approach is to use the Wiznet W5500 Ethernet chip, instead of the W5100. Although the W5100 is good, the SPI interface is pretty inefficient because it requires that a 16 bit address and 8 bit command be sent ahead of each data byte. The W5500 has cured that inefficiency by allowing data transfers to specify the address once, then follow with just data bytes afterword, automatically incrementing an internal hardware pointer, until the SS line is raised and the SPI transfer is ended. The W5100 is also about half the cost of the W5100, and comes in a smaller package.

Using the above best case benchmark, and an average buffer write of say, 10 bytes, the expected throughput should be comfortably above 200 KBytes per second. Just keep in mind, each buffer write becomes a single socket send command at the socket level. That means additional overhead associated with sending lots of little packets. Bigger buffers are always better whenever it is practical to do so.

I'm interested in pushing the limits even higher. Fast Ethernet transfers aren't necessary for HTTP client server stuff. They are good for streaming data, however. And if one wanted to stream data in both directions, faster Ethernet would be a necessity.

To go even faster, a faster SPI interface is needed. Now, the specification for this chip indicates a theoretical 80 MHZ SPI clock frequency, but a guaranteed frequency of 33.3 MHZ. Although not stated, I suspect that 33.33 MHZ is more of a guaranteed throughput kind of number for larger SPI transfers. Trying to design around an 80 MHZ clock can be quite challenging. 20 MHZ isn't so bad, and increasing a custom SPI interface from 8 MHZ to 20 MHZ would bump the estimated single buffer transfer speed into the 500 Kbyte/second range. That is a considerable increase in performance.

To be fair, these numbers are fairly raw. For example, the custom SPI interface would need to be loaded 8 bits at a time. Using a bit-banged approach and avoiding the standard library digital I/O calls, how fast can we wiggle bits? You would need to load new data every 10 processor clock cycles to run at that rate. That could probably be done for a single buffer transfer. But inline assembly code would quite likely be required in that part of the Ethernet library.

One other change for a faster Ethernet shield would be to implement the interrupt line for the W5500. When waiting for a socket connection, the current shield handles this through polling. For simple applications, this is quite fine. But eliminating the polling frees up clock cycles, which may be needed for other things.

Faster Ethernet has advantages. The faster you can get your data to or from the Ethernet chip, the faster it can move the data. This reduces time spent waiting, and increases the time available in your sketch to do other things. If you are bumping into limits on what you are able to do with the existing Ethernet shield, would you be interested in a faster one?

I remember when I was really happy to bump FTP performance from about 70kbps to 300kbps on the 3Mbps ethernet (yeah, those are all lower-case "b" for bits.) It seems that much before that, 56kbps was about the max link speed anyway, so no one had really noticed how slow the internet code was :slight_smile:

would you be interested in a faster [ethernet]?

I dunno. I'd rather have a board that had a better overall design for networking. If it's going to use SPI, it should at least have DMA for it's SPI interfaces. (some XMEGA chips can do that, I think.) Overall speed is probably less interesting than "overhead"; I have to generate the data I'm sending somehow...

Did you investigate using the parallel interface mode of the W5100 ? That seems more natural than putting a custom SPI controller in front of the SPI interface. You could even consider using a more intensive parallel interface to some other ethernet controller (say, like a Yun with higher speed interconnect between the ARM and AVR.)

There are only so many "fixes" you can make to an AVR-based networking device before you have to ask yourself "why was I not using a Raspberry Pi or BeagleBone Black, again?" (or even: connect up the Ethernet on a Due-like design...)

The parallel interface on the W5100 would certainly be better, but that would require using the external memory access capability of the Mega2560. The expansion interface I am designing will open the door to doing 8 bit I/O. And I have left hooks in place for a DMA controller. I agree that DMA solves a number of overhead issues. I've used that in the past on other products with embedded micro-processors. Having a couple of DMA channels to handle communication and signal output or input really simplified things nicely.

Custom SPI dedicated solely to the W5500 will fit in a 44 pin CPLD with a cost of $1.18. It is the easiest way of cranking up the SPI speed and the easiest way of getting faster data transfers. Plus, anytime I see a newer version of an IC costing half as much as a prior version, I begin to wonder about obsolescence issues. I've just seen it happen often enough. Anyway, this is one option that is compatible with the existing Arduino hardware, and only requires minor changes in the Ethernet library. I would definitely utilize the interrupt capabilities of the W5500 for this one.

Now, if one were to use the expansion shield that I am developing as an I/O interface for a different version of a fast Ethernet card, that opens up some interesting doors, especially once the DMA controller is designed. That allows DMA transfers to be set up between Ethernet and expansion memory. And those transfers are not limited to 2 MBytes/Sec. Based on the specs for the W5500, it looks like 4 MBytes/Sec. is a guaranteed transfer rate. Of course, you are limited to transferring no more than what is currently available in the W5500 transmit or receive buffers. But filling up a buffer in a little over 2mS is pretty attractive. :slight_smile:

I have done some optimizations of the print library last year (together with some others) that really improved the printing of numbers including floats substantially. Should work for the ethernet shield too.

If you use the print.cpp and print.h posted at the end (replace in the core file) you can test it on the net.

(have no ethershield free to test myself)

I would be happy to run some tests with these libraries and see how they perform.

attached my print.cpp and print.h
you need to rename the print.cpp and print.h in the core and place these there.

you need also to put the fastmath.h in the core folder
(core == e.g. C:\Program Files (x86)\arduino-1.0.4\hardware\arduino\cores\arduino)

Print.zip (6.97 KB)

Hope there is still some interest in this! W5100 is still the cheapest card available and stocked by local dealers. And I think many want a good upload speed.

I did some performance tests, and indeed getting only 10.3 KBytes/sec in receive mode. Send mode is fast, at at least 70KBytes per sec. I measured the exact delays, and here are my findings. (My setup is dedicated ethernet link between laptop and arduino mega. I modified TinyWebServer to remove unnecessary calls in put_handler.)

  • 2048 buffer can by fetched from card (using SPI) in about 10-12ms.
  • But after you fetch this much, there in inexplicable delay of about 170ms. During this time, even
    if you keep checking if there is data, no data is returned. This is also reflected in LEDs going blank during this time i.e. no activity by the card.

And this pattern repeats. So in 1 sec, you have 5 repeats of above scenario, and hence, you will get only 10 KBytes of data.

Maximum possible data fetch, assuming you have no delays will be about 200 KBytes per second, which is quite decent.

To verify that the delay is in the card and not in any part of driver, I put in delay of about 170ms within the main receive loop in application. And this delay, as expected, is indeed in parallel to delay caused by card, and hence the data transfer is still the same as 10.3KBytes/sec.

Does anyone with deep experience of w5100 have explanation for this behavior ?

Have you got a packet trace?

170ms+12ms is suspiciously close to the 200ms "recommendation" for TCP features like "delayed ack" and gratuitous window updates. (TCP is a complicated protocol with built-in flow-control mechanisms that probably don't interact all that well with an implementation that has very limited buffer memory (about 2k per connection per direction, if it's all statically allocated. Coincidence with your "delay after reading 2048 bytes"? I doubt it!))

Do you get the same performance using something like UDP ? (of course, you'd probably end up dropping a lot of packets...)

Want to thank you for the clues to proceed further.

I did a quick UDP check using TFTP. Seems like it gets stuck at UDP.parsePacket() after about 64KBytes of transfer. I am now able to use tcpdump and see the flow. Let me explore further and share the results.

I finally got tcpdump of connection from laptop to arduino over a dedicated ethernet connection. I used curl to transfer file of about 450KBytes. And it has always been taking about 45 seconds for the transfer. Here is a trace taken from somewhere in between:

00:03:03.745287 IP arduino.http > laptop.41472: Flags - , ack 316534, win 2048, length 0 00:03:03.745358 IP laptop.41472 > arduino.http: Flags [P.], seq 316534:317558, ack 20, win 29200, length 1024 00:03:03.745365 IP laptop.41472 > arduino.http: Flags - , seq 317558:318582, ack 20, win 29200, length 1024 00:03:03.948151 IP arduino.http > laptop.41472: Flags - , ack 318582, win 2048, length 0 00:03:03.948222 IP laptop.41472 > arduino.http: Flags [P.], seq 318582:319606, ack 20, win 29200, length 1024 00:03:03.948228 IP laptop.41472 > arduino.http: Flags - , seq 319606:320630, ack 20, win 29200, length 1024 00:03:04.150957 IP arduino.http > laptop.41472: Flags - , ack 320630, win 2048, length 0 00:03:04.151028 IP laptop.41472 > arduino.http: Flags [P.], seq 320630:321654, ack 20, win 29200, length 1024 00:03:04.151034 IP laptop.41472 > arduino.http: Flags - , seq 321654:322678, ack 20, win 29200, length 1024 00:03:04.353723 IP arduino.http > laptop.41472: Flags - , ack 322678, win 2048, length 0 00:03:04.353758 IP laptop.41472 > arduino.http: Flags [P.], seq 322678:323702, ack 20, win 29200, length 1024 00:03:04.353763 IP laptop.41472 > arduino.http: Flags [P.], seq 323702:324726, ack 20, win 29200, length 1024 00:03:04.556663 IP arduino.http > laptop.41472: Flags - , ack 324726, win 2048, length 0 00:03:04.556730 IP laptop.41472 > arduino.http: Flags - , seq 324726:325750, ack 20, win 29200, length 1024 00:03:04.556736 IP laptop.41472 > arduino.http: Flags [P.], seq 325750:326774, ack 20, win 29200, length 1024 00:03:04.759463 IP arduino.http > laptop.41472: Flags - , ack 326774, win 2048, length 0 00:03:04.759504 IP laptop.41472 > arduino.http: Flags [P.], seq 326774:327798, ack 20, win 29200, length 1024 00:03:04.759509 IP laptop.41472 > arduino.http: Flags - , seq 327798:328822, ack 20, win 29200, length 1024 00:03:04.962329 IP arduino.http > laptop.41472: Flags - , ack 328822, win 2048, length 0 00:03:04.962397 IP laptop.41472 > arduino.http: Flags [P.], seq 328822:329846, ack 20, win 29200, length 1024 00:03:04.962404 IP laptop.41472 > arduino.http: Flags - , seq 329846:330870, ack 20, win 29200, length 1024 00:03:05.165186 IP arduino.http > laptop.41472: Flags - , ack 330870, win 2048, length 0 00:03:05.165251 IP laptop.41472 > arduino.http: Flags [P.], seq 330870:331894, ack 20, win 29200, length 1024 00:03:05.165257 IP laptop.41472 > arduino.http: Flags - , seq 331894:332918, ack 20, win 29200, length 1024 00:03:05.368028 IP arduino.http > laptop.41472: Flags - , ack 332918, win 2048, length 0 00:03:05.368096 IP laptop.41472 > arduino.http: Flags [P.], seq 332918:333942, ack 20, win 29200, length 1024 00:03:05.368103 IP laptop.41472 > arduino.http: Flags - , seq 333942:334966, ack 20, win 29200, length 1024 00:03:05.570882 IP arduino.http > laptop.41472: Flags - , ack 334966, win 2048, length 0 00:03:05.570953 IP laptop.41472 > arduino.http: Flags [P.], seq 334966:335990, ack 20, win 29200, length 1024 00:03:05.570960 IP laptop.41472 > arduino.http: Flags - , seq 335990:337014, ack 20, win 29200, length 1024 00:03:05.773694 IP arduino.http > laptop.41472: Flags - , ack 337014, win 1600, length 0 00:03:05.773767 IP laptop.41472 > arduino.http: Flags [P.], seq 337014:338038, ack 20, win 29200, length 1024

If you see closely, my linux processes the packet in no time. However, W5100 takes about 200ms to respond. Every single time. So this may indeed be the processing time required by W5100 to process a receive stream. I am wondering why it takes so much, when it can transmit much faster.

The TCP protocol conformance itself seems to be pretty accurate. It correctly advertises 2048 as its buffer size.

W5100 takes about 200ms to respond. Every single time.

But you also said that you see pauses in your application level AFTER you read each 2048 bytes, which implies that the W5100 is going reasonably fast for the purposes of the actual interface. It's just not promptly re-opening the TCP window when the buffers are returned to it. Ie, it's doing "delayed ack" but isn't doing the window management as well as it could.

It looks like the w5100 has an option to turn off delayed acks. Try changing the calls to socket() in EthernetClient.cpp or EthernetServer.cpp from

  socket(_sock, SnMR::TCP, _srcport, 0);

To

  socket(_sock, SnMR::TCP, _srcport, SnMR::ND);

Not using delayed ACKs typically increases the number of ACKs sent; I suspect that if it works, the Arduino will be sending four ACKs instead of one for each window full of data (one pure ACK for each packet received, one window update for each packet read.) I'd like to see a new packet trace (I don't have an active w5100 here) if this results in any change. (It's SO nice to have data flowing in both directions, on the forums...) In any case, with modern network infrastructure, and a tiny Arduino as the source, it's unlikely that the extra ACK would cause any problems.
There's still a chance it will send ACKs immediately, but still not send the window updates promptly :frowning:

(This conversation SO reminds me why I'm reluctant to get into Arduino networking. Someone else's network stack, with no visible code, no instrumentation, and no ability to FIX anything. Sigh.)

WestFW, I would really like to thank you. I think we really yearn to extract every bit from hardware!

So I tried the ND flag as you suggested. I am attaching the full file, so you can see the complete negotiations. Yes, the flag is respected, but arduino is unable to keep up with the traffic. It will try to negotiate but finally give up and close the socket. Needless to say, the download is not successful - only about 25kb came through.

tmp-ND.txt (10.7 KB)

Well, that's a weird trace. I'm not sure what's going on. It looks like the Arduino is always advertising a 2048 byte window, even when its buffers are full. Then, of course, it ends up dropping packets.

I'd expect to see

Arduino: window 2048
laptop: 1024 bytes 
Arduino: window 1024
laptop: 1024 bytes
Arduino: window 0
 (sketch reads data)
Arduino: window 1024
 (sketch reads more data)
Arduino: window 2048
laptop: 1024 bytes ...

Another idea is to have the sketch-level code send a keepalive packet after it has read "significant" data. In theory, the keepalive would contain updated ACK values, but "invalid" data that would be ignored by the other side (I'm not entirely sure it will process the ACK in that case...) But I don't see an existing API for sending keepalives, so this would be more involved (and I'm not quite sure how to tell when to try to send it.)
Presumably, this would use W5100.execCmdSn(s, Sock_SEND_KEEP);

I added call to KEEP_ALIVE for every 20mSec after I didn't receive data. I am attaching the tcpdump file.

As you can see, initially it seems to advertise decreasing window sizes. However, the delay of 200ms still remains after which it advertises 2048 again. And in any case, the later part it doesn't do this also - it always advertises 2048.

The file was however not saved. There was a timeout error from application side (i.e. no data for more than 30sec.).

tmp-ND-KA.txt (147 KB)

Subsequent to my earlier attempts, I used ENC28J60 card with UIPEthernet library. This i still cheaper card, but with basic functionality, so TCP/IP stack runs on arduino.

TCP:

With some tuning, I could get 20-25KBytes/sec receive performance on UIPEthernet's TCP stack. The processing of each packet takes about 4-5ms of CPU time. And delay is almost always due to TCP protocol interaction i.e. how acknowledgement packets are sent. For example, there is no delayed ack.

UDP:

When I used UDP from UIPEthernet, and with appropriate tuning, I could get near to 90-100KBytes/sec. I used a TFTP server for the same, with one ack packet for every 512k packet. (I used mega at 16MHz.) So I very much recommend using TFTP for large transfers in arduino.

With more experimentation, I could get 53 KBytes/sec on ethernet using ENC28J60 card. It all has to do with experimenting with already provided configurations for IP. (This is for UIP Ethernet stack.)

There are three configurations: UIP_CONF_BUFFER_SIZE, UIP_CONF_TCP_MSS, UIP_CONF_RECEIVE_WINDOW. All these are set in ./utility/uip-conf.h. In distribution, some of these values are already set, and I got only 7-8KBytes/sec. When I simply commented all these values, the defaults are set in ./utility/uip-opt.h. Only UIP_BUFSIZE is set to 400, and others are set from this value. With this really default configuration, I could get more than 50KBytes/sec.

And seeing the tcpdump, this time, it was perfect - with no duplications, or any opportunity to tune further. Neverthles, UDP performance was almost 80-90KBytes/sec, so perhaps we should try further.

Attaching the file with which I measured the bandwidth.

UIPTcpServer.ino (2.7 KB)

I am now happy to report that my receive performance problem with W5100 is also solved. Now I am able to get 60KBytes/sec TCP performance.

This is what I discovered: Earlier, I was uploading a file to web server (using Curl) running on arduino. Here there were no packets sent back during the transfer.

But when I tried the interactive TCP code (attached to the previous post), I could see 60KBytes/sec. So the delay of 200ms was being caused by W5100 waiting for something to be sent back to client. When you explicitly send short lines, I think W5100 doesn't do extra wait, and so, no more delay.

Perhaps there are some timers or something in Ethernet client codebase; I will checkout.

In case anyone's following this old thread, I've released Ethernet 2.0.0 today. It features greatly improved performance, especially on the older W5100 chip.

Here's an article I wrote about the changes, which includes benchmark testing on 15 different boards and a variety of shields with all 3 Wiznet chips.

https://www.pjrc.com/arduino-ethernet-library-2-0-0/

Cool. Those benchmarks really show the difference between the W5100 chips and the "next generation" chips!

Are you running on 10Mbps ethernet or 100Mbps? (Now that the numbers are getting big enough where it might make a difference.)

Really great work Paul.

Just tested it on my project:

  • Arduino DUE
  • W5500
  • webserver HTTP
  • webserver UDP
  • webserver webSocket

First http seem really more fast. A simple get request was on the ethernet2 library about 370ms for 1.2K of data.
Now I get a score of 195ms !

I use UDP server to call NTP getTime routine. It was unstable before, it had get no response 20% of the time. By now, it seem more stable. Look like to be related...

I have a websocket library modify from Links2004 /arduinoWebSockets.
Works well also. Great that I could remove some #ifdef to select the chip. Maybe more could be remove if the Ethernet library support ESP8266... but there I'm not equip neither qualified to help...
Also Ethernet2 did not have client.remoteIP(), great to have it now.

Again congratulation for the good work. perhaps a new tread could be good to follow your further development.

Regards.

Nitrof