Go Down

Topic: Faster Ethernet Anyone? (Read 10684 times) previous topic - next topic

rkessing01

In another forum section I recently posted about the actual throughput as measured for the Ethernet shield, using a Mega2560. In a nutshell, without any changes to library code, you get around 15 KBytes per second. Changing the SPI clock from 4 MHZ to 8 MHZ bumps that up to about 20 KBytes per second. Eliminating any usage of the print() and println() functions makes a huge improvement, topping out around 70 KBytes per second. Note that these rates are based on measuring from the first byte of sent data to the last byte, and don't account for other program delays.

With these tweaks, maybe 70 KBytes per second is good enough. But in some situations, it would be nice if faster rates were possible? In an effort to evaluate the Arduino platform for some more extreme applications, faster Ethernet is a requirement for me, and it can be done with the right hardware.

The simplest approach is to use the Wiznet W5500 Ethernet chip, instead of the W5100. Although the W5100 is good, the SPI interface is pretty inefficient because it requires that a 16 bit address and 8 bit command be sent ahead of each data byte. The W5500 has cured that inefficiency by allowing data transfers to specify the address once, then follow with just data bytes afterword, automatically incrementing an internal hardware pointer, until the SS line is raised and the SPI transfer is ended. The W5100 is also about half the cost of the W5100, and comes in a smaller package.

Using the above best case benchmark, and an average buffer write of say, 10 bytes, the expected throughput should be comfortably above 200 KBytes per second. Just keep in mind, each buffer write becomes a single socket send command at the socket level. That means additional overhead associated with sending lots of little packets. Bigger buffers are always better whenever it is practical to do so.

I'm interested in pushing the limits even higher. Fast Ethernet transfers aren't necessary for HTTP client server stuff. They are good for streaming data, however. And if one wanted to stream data in both directions, faster Ethernet would be a necessity.

To go even faster, a faster SPI interface is needed. Now, the specification for this chip indicates a theoretical 80 MHZ SPI clock frequency, but a guaranteed frequency of 33.3 MHZ. Although not stated, I suspect that 33.33 MHZ is more of a guaranteed throughput kind of number for larger SPI transfers. Trying to design around an 80 MHZ clock can be quite challenging. 20 MHZ isn't so bad, and increasing a custom SPI interface from 8 MHZ to 20 MHZ would bump the estimated single buffer transfer speed into the 500 Kbyte/second range. That is a considerable increase in performance.

To be fair, these numbers are fairly raw. For example, the custom SPI interface would need to be loaded 8 bits at a time. Using a bit-banged approach and avoiding the standard library digital I/O calls, how fast can we wiggle bits? You would need to load new data every 10 processor clock cycles to run at that rate. That could probably be done for a single buffer transfer. But inline assembly code would quite likely be required in that part of the Ethernet library.

One other change for a faster Ethernet shield would be to implement the interrupt line for the W5500. When waiting for a socket connection, the current shield handles this through polling. For simple applications, this is quite fine. But eliminating the polling frees up clock cycles, which may be needed for other things.

Faster Ethernet has advantages. The faster you can get your data to or from the Ethernet chip, the faster it can move the data. This reduces time spent waiting, and increases the time available in your sketch to do other things. If you are bumping into limits on what you are able to do with the existing Ethernet shield, would you be interested in a faster one?


westfw

I remember when I was really happy to bump FTP performance from about 70kbps to 300kbps on the 3Mbps ethernet (yeah, those are all lower-case "b" for bits.)  It seems that much before that, 56kbps was about the max link speed anyway, so no one had really noticed how slow the internet code was :-)

Quote
would you be interested in a faster [ethernet]?

I dunno.  I'd rather have a board that had a better overall design for networking.  If it's going to use SPI, it should at least have DMA for it's SPI interfaces.  (some XMEGA chips can do that, I think.)  Overall speed is probably less interesting than "overhead"; I have to generate the data I'm sending somehow...

Did you investigate using the parallel interface mode of the W5100 ?  That seems more natural than putting a custom SPI controller in front of the SPI interface.  You could even consider using a more intensive parallel interface to some other ethernet controller (say, like a Yun with higher speed interconnect between the ARM and AVR.)

There are only so many "fixes" you can make to an AVR-based networking device before you have to ask yourself "why was I not using a Raspberry Pi or BeagleBone Black, again?"  (or even: connect up the Ethernet on a Due-like design...)

rkessing01

The parallel interface on the W5100 would certainly be better, but that would require using the external memory access capability of the Mega2560. The expansion interface I am designing will open the door to doing 8 bit I/O. And I have left hooks in place for a DMA controller.  I agree that DMA solves a number of overhead issues. I've used that in the past on other products with embedded micro-processors. Having a couple of DMA channels to handle communication and signal output or input really simplified things nicely.

Custom SPI dedicated solely to the W5500 will fit in a 44 pin CPLD with a cost of $1.18. It is the easiest way of cranking up the SPI speed and the easiest way of getting faster data transfers. Plus, anytime I see a newer version of an IC costing half as much as a prior version, I begin to wonder about obsolescence issues. I've just seen it happen often enough. Anyway, this is one option that is compatible with the existing Arduino hardware, and only requires minor changes in the Ethernet library. I would definitely utilize the interrupt capabilities of the W5500 for this one.

Now, if one were to use the expansion shield that I am developing as an I/O interface for a different version of a fast Ethernet card, that opens up some interesting doors, especially once the DMA controller is designed. That allows DMA transfers to be set up between Ethernet and expansion memory. And those transfers are not limited to 2 MBytes/Sec. Based on the specs for the W5500, it looks like 4 MBytes/Sec. is a guaranteed transfer rate. Of course, you are limited to transferring no more than what is currently available in the W5500 transmit or receive buffers. But filling up a buffer in a little over 2mS is pretty attractive.  :)




robtillaart

I have done some optimizations of the print library last year (together with some others) that really improved the printing of numbers including floats substantially. Should work for the ethernet shield too.

- http://forum.arduino.cc/index.php?topic=179111 - (warning long thread ahead)
- http://forum.arduino.cc/index.php?topic=167414.0 - (warning ^2 even longer thread ahead)

If you use the print.cpp and print.h posted at the end (replace in the core file) you can test it on the net.

(have no ethershield free to test myself)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

rkessing01

I would be happy to run some tests with these libraries and see how they perform.

robtillaart

attached my print.cpp and print.h
you need to rename the print.cpp and print.h in the core  and place these there.

you need also to put the fastmath.h in the core folder
(core == e.g. C:\Program Files (x86)\arduino-1.0.4\hardware\arduino\cores\arduino)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

pv_bgm

Hope there is still some interest in this! W5100 is still the cheapest card available and stocked by local dealers. And I think many want a good upload speed.

I did some performance tests, and indeed getting only 10.3 KBytes/sec in receive mode. Send mode is fast, at at least 70KBytes per sec. I measured the exact delays, and here are my findings. (My setup is dedicated ethernet link between laptop and arduino mega. I modified TinyWebServer to remove unnecessary calls in put_handler.)

- 2048 buffer can by fetched from card (using SPI) in about 10-12ms.
- But after you fetch this much, there in inexplicable delay of about 170ms. During this time, even
  if you keep checking if there is data, no data is returned. This is also reflected in LEDs going blank during this time i.e. no activity by the card.

And this pattern repeats. So in 1 sec, you have 5 repeats of above scenario, and hence, you will get only 10 KBytes of data.

Maximum possible data fetch, assuming you have no delays will be about 200 KBytes per second, which is quite decent.

To verify that the delay is in the card and not in any part of driver, I put in delay of about 170ms within the main receive loop in application. And this delay, as expected, is indeed in parallel to delay caused by card, and hence the data transfer is still the same as 10.3KBytes/sec.

Does anyone with deep experience of w5100 have explanation for this behavior ?

westfw

Have you got a packet trace?

170ms+12ms is suspiciously close to the 200ms "recommendation" for TCP features like "delayed ack" and gratuitous window updates.  (TCP is a complicated protocol with built-in flow-control mechanisms that probably don't interact all that well with an implementation that has very limited buffer memory (about 2k per connection per direction, if it's all statically allocated.  Coincidence with your "delay after reading 2048 bytes"?  I doubt it!))

Do you get the same performance using something like UDP ?  (of course, you'd probably end up dropping a lot of packets...)


pv_bgm

Want to thank you for the clues to proceed further.

I did a quick UDP check using TFTP.  Seems like it gets stuck at UDP.parsePacket() after about 64KBytes of transfer.  I am now able to use tcpdump and see the flow. Let me explore further and share the results.

pv_bgm

I finally got tcpdump of connection from laptop to arduino over a dedicated ethernet connection. I used curl to transfer file of about 450KBytes. And it has always been taking about 45 seconds for the transfer. Here is a trace taken from somewhere in between:

<code>
00:03:03.745287 IP arduino.http > laptop.41472: Flags [.], ack 316534, win 2048, length 0
00:03:03.745358 IP laptop.41472 > arduino.http: Flags [P.], seq 316534:317558, ack 20, win 29200, length 1024
00:03:03.745365 IP laptop.41472 > arduino.http: Flags [.], seq 317558:318582, ack 20, win 29200, length 1024
00:03:03.948151 IP arduino.http > laptop.41472: Flags [.], ack 318582, win 2048, length 0
00:03:03.948222 IP laptop.41472 > arduino.http: Flags [P.], seq 318582:319606, ack 20, win 29200, length 1024
00:03:03.948228 IP laptop.41472 > arduino.http: Flags [.], seq 319606:320630, ack 20, win 29200, length 1024
00:03:04.150957 IP arduino.http > laptop.41472: Flags [.], ack 320630, win 2048, length 0
00:03:04.151028 IP laptop.41472 > arduino.http: Flags [P.], seq 320630:321654, ack 20, win 29200, length 1024
00:03:04.151034 IP laptop.41472 > arduino.http: Flags [.], seq 321654:322678, ack 20, win 29200, length 1024
00:03:04.353723 IP arduino.http > laptop.41472: Flags [.], ack 322678, win 2048, length 0
00:03:04.353758 IP laptop.41472 > arduino.http: Flags [P.], seq 322678:323702, ack 20, win 29200, length 1024
00:03:04.353763 IP laptop.41472 > arduino.http: Flags [P.], seq 323702:324726, ack 20, win 29200, length 1024
00:03:04.556663 IP arduino.http > laptop.41472: Flags [.], ack 324726, win 2048, length 0
00:03:04.556730 IP laptop.41472 > arduino.http: Flags [.], seq 324726:325750, ack 20, win 29200, length 1024
00:03:04.556736 IP laptop.41472 > arduino.http: Flags [P.], seq 325750:326774, ack 20, win 29200, length 1024
00:03:04.759463 IP arduino.http > laptop.41472: Flags [.], ack 326774, win 2048, length 0
00:03:04.759504 IP laptop.41472 > arduino.http: Flags [P.], seq 326774:327798, ack 20, win 29200, length 1024
00:03:04.759509 IP laptop.41472 > arduino.http: Flags [.], seq 327798:328822, ack 20, win 29200, length 1024
00:03:04.962329 IP arduino.http > laptop.41472: Flags [.], ack 328822, win 2048, length 0
00:03:04.962397 IP laptop.41472 > arduino.http: Flags [P.], seq 328822:329846, ack 20, win 29200, length 1024
00:03:04.962404 IP laptop.41472 > arduino.http: Flags [.], seq 329846:330870, ack 20, win 29200, length 1024
00:03:05.165186 IP arduino.http > laptop.41472: Flags [.], ack 330870, win 2048, length 0
00:03:05.165251 IP laptop.41472 > arduino.http: Flags [P.], seq 330870:331894, ack 20, win 29200, length 1024
00:03:05.165257 IP laptop.41472 > arduino.http: Flags [.], seq 331894:332918, ack 20, win 29200, length 1024
00:03:05.368028 IP arduino.http > laptop.41472: Flags [.], ack 332918, win 2048, length 0
00:03:05.368096 IP laptop.41472 > arduino.http: Flags [P.], seq 332918:333942, ack 20, win 29200, length 1024
00:03:05.368103 IP laptop.41472 > arduino.http: Flags [.], seq 333942:334966, ack 20, win 29200, length 1024
00:03:05.570882 IP arduino.http > laptop.41472: Flags [.], ack 334966, win 2048, length 0
00:03:05.570953 IP laptop.41472 > arduino.http: Flags [P.], seq 334966:335990, ack 20, win 29200, length 1024
00:03:05.570960 IP laptop.41472 > arduino.http: Flags [.], seq 335990:337014, ack 20, win 29200, length 1024
00:03:05.773694 IP arduino.http > laptop.41472: Flags [.], ack 337014, win 1600, length 0
00:03:05.773767 IP laptop.41472 > arduino.http: Flags [P.], seq 337014:338038, ack 20, win 29200, length 1024
</code>

If you see closely, my linux processes the packet in no time. However, W5100 takes about 200ms to respond. Every single time. So this may indeed be the processing time required by W5100 to process a receive stream. I am wondering why it takes so much, when it can transmit much faster.

The TCP protocol conformance itself seems to be pretty accurate. It correctly advertises 2048 as its buffer size.

westfw

Quote
W5100 takes about 200ms to respond. Every single time.
But you also said that you see pauses in your application level AFTER you read each 2048 bytes, which implies that the W5100 is going reasonably fast for the purposes of the actual interface.  It's just not promptly re-opening the TCP window when the buffers are returned to it.  Ie, it's doing "delayed ack" but isn't doing the window management as well as it could.

It looks like the w5100 has an option to turn off delayed acks.  Try changing the calls to socket() in EthernetClient.cpp or EthernetServer.cpp from
Code: [Select]
  socket(_sock, SnMR::TCP, _srcport, 0);
To
Code: [Select]
  socket(_sock, SnMR::TCP, _srcport, SnMR::ND);

Not using delayed ACKs typically increases the number of ACKs sent; I suspect that if it works, the Arduino will be sending four ACKs instead of one for each window full of data (one pure ACK for each packet received, one window update for each packet read.)  I'd like to see a new packet trace (I don't have an active w5100 here) if this results in any change.  (It's SO nice to have data flowing in both directions, on the forums...)  In any case, with modern network infrastructure, and a tiny Arduino as the source, it's unlikely that the extra ACK would cause any problems.
There's still a chance it will send ACKs immediately, but still not send the window updates promptly :-(

(This conversation SO reminds me why I'm reluctant to get into Arduino networking.  Someone else's network stack, with no visible code, no instrumentation, and no ability to FIX anything.  Sigh.)


pv_bgm

WestFW, I would really like to thank you. I think we really yearn to extract every bit from hardware!

So I tried the ND flag as you suggested. I am attaching the full file, so you can see the complete negotiations. Yes, the flag is respected, but arduino is unable to keep up with the traffic.  It will try to negotiate but finally give up and close the socket. Needless to say, the download is not successful - only about 25kb came through.

westfw

Well, that's a weird trace.  I'm not sure what's going on.   It looks like the Arduino is always advertising a 2048 byte window, even when its buffers are full.  Then, of course, it ends up dropping packets.

I'd expect to see
Arduino: window 2048
laptop: 1024 bytes
Arduino: window 1024
laptop: 1024 bytes
Arduino: window 0
 (sketch reads data)
Arduino: window 1024
 (sketch reads more data)
Arduino: window 2048
laptop: 1024 bytes ...


Another idea is to have the sketch-level code send a keepalive packet after it has read "significant" data.  In theory, the keepalive would contain updated ACK values, but "invalid" data that would be ignored by the other side (I'm not entirely sure it will process the ACK in that case...)  But I don't see an existing API for sending keepalives, so this would be more involved (and I'm not quite sure how to tell when to try to send it.)
Presumably, this would use W5100.execCmdSn(s, Sock_SEND_KEEP);


pv_bgm

I added call to KEEP_ALIVE for every 20mSec after I didn't receive data.  I am attaching the tcpdump file.

As you can see, initially it seems to advertise decreasing window sizes. However, the delay of 200ms still remains after which it advertises 2048 again. And in any case, the later part it doesn't do this also - it always advertises 2048.

The file was however not saved. There was a timeout error from application side (i.e. no data for more than 30sec.).


pv_bgm

Subsequent to my earlier attempts, I used ENC28J60 card with UIPEthernet library.  This i still cheaper card, but with basic functionality, so TCP/IP stack runs on arduino.

TCP:

With some tuning, I could get 20-25KBytes/sec receive performance on UIPEthernet's TCP stack. The processing of each packet takes about 4-5ms of CPU time.  And delay is almost always due to TCP protocol interaction i.e. how acknowledgement packets are sent. For example, there is no delayed ack.

UDP:

When I used UDP from UIPEthernet, and with appropriate tuning, I could get near to 90-100KBytes/sec. I used a TFTP server for the same, with one ack packet for every 512k packet.  (I used mega at 16MHz.)  So I very much recommend using TFTP for large transfers in arduino.



Go Up