Dealing with Lost Arduino Ethernet Connectivity/Stuck Sockets

For sometime my Arduino webserver application has been having internet connectivity issues.

Basically after a while the Arduino application stops operating as a webserver and ignores (or cannot receive/recognise) incoming URL web page requests. The application continues to run - it is just that its calls to EthernetServer.available() never returns new EthernetClient connections (sockets).

I have previously discussed this problem over here http://forum.arduino.cc/index.php?topic=239403.msg because sometimes the lost ethernet connectivity was preventing my daily UPD NTP automatic time reset.

I have researched this quite a bit on the world wide web and found a few discussion forums - but nothing that is definitive that solves the problem comprehensively for everyone. I don’t promise that here either.

Eventually my research led me to question the four internet sockets on the W5100 chip of my Freetronics Ethermega card. Was it possible that the sockets were getting permanently locked and lost to my application in some way?

I found some code to dump the status of each of the sockets and embedded my own version of the procedure into my application to dump the socket status information every hour into my application’s SD card daily activity log. Here is my implementation of the status dump function:

void ShowSocketStatus() {

  ActivityWriteSPICSC("ETHERNET SOCKET LIST");
  ActivityWriteSPICSC("#:Status Port Destination DPort");
  ActivityWriteSPICSC("0=avail,14=waiting,17=connected,22=UDP");
  ActivityWriteSPICSC("1C=close wait");
  String l_line = "";
  l_line.reserve(64);
  char l_buffer[10] = "";
  for (uint8_t i = 0; i < MAX_SOCK_NUM; i++) {
    l_line = "#" + String(i);
    uint8_t s = W5100.readSnSR(i); //status
    l_line += ":0x";
    sprintf(l_buffer,"%x",s);
    l_line += l_buffer;
    l_line += " ";
    l_line += String(W5100.readSnPORT(i)); //port
    l_line += " D:";
    uint8_t dip[4];
    W5100.readSnDIPR(i, dip); //IP Address
    for (int j=0; j<4; j++) {
      l_line += int(dip[j]);
      if (j<3) l_line += ".";
    }
    l_line += " (";
    l_line += String(W5100.readSnDPORT(i)); //port on destination
    l_line += ") ";
    if (G_SocketConnectionTimes[i] != 0)
      l_line += TimeToHHMM(G_SocketConnectionTimes[i]);
    //Serial.println(l_line);

    ActivityWriteSPICSC(l_line);
  }
}

By reviewing my application activity logs I was able to observe that the sockets were getting into a permanent “connected” (hex 17) status and never being released. When all four sockets appeared in the log file as “connected” I could no longer access the application via a web browser. (I could not even display the SD card log files with the evidence since that required a web connection - until I restarted my application.)

Once I found this evidence of the problem I extensively reviewed my code to make sure that every incoming ethernet connection was being correctly terminated via a call to EthernetClient.stop(). I found the odd problem there and the reliability of my application improved - but still I was losing occasional sockets.

I suspect (and still suspect) that there is a bug in the W5100 microcode associated with multiple requests from the same IP address coming in too quickly and not being assigned correctly to unique sockets that are then correctly managed. So I decided to see it I could force the stuck (“connected”) sockets to close and I was pleased to find that there is a socket disconnect function in the W5100 library.

So I set about implementing application functionality to check the status of every web socket every five minutes, to record the time when each socket was observed as “connected” for the first time and to disconnect the sockets after ten minutes. However for testing purposes I am running a seventy minute timeout for “connected” sockets so I can observe socket statuses in my application activity logs between when the stuck connection is first detected and when it is disconnected.

My application ran for more than four days until today when the first stuck “connected” socket was observed at 1:52AM this morning. Here is a portion of my application’s activity log for today:

01:00:00 ETHERNET SOCKET LIST
01:00:00 #:Status Port Destination DPort
01:00:00 0=avail,14=waiting,17=connected,22=UDP
01:00:00 #0:0x0 80 D:130.89.212.77 (56170)
01:00:00 #1:0x0 80 D:130.89.212.77 (56171)
01:00:00 #2:0x14 80 D:130.89.212.77 (56156)
01:00:00 #3:0x0 80 D:89.238.250.188 (59162)
01:00:00 Climate Update
- FREE RAM: 2847
02:00:00 ETHERNET SOCKET LIST
02:00:00 #:Status Port Destination DPort
02:00:00 0=avail,14=waiting,17=connected,22=UDP
02:00:00 #0:0x14 80 D:130.89.212.77 (60803)
02:00:00 #1:0x17 80 D:130.89.212.77 (60802) 01:52
02:00:00 #2:0x0 80 D:207.46.13.108 (11705)
02:00:00 #3:0x0 80 D:89.238.250.188 (59162)
02:00:00 Climate Update
- FREE RAM: 2847
03:00:00 ETHERNET SOCKET LIST
03:00:00 #:Status Port Destination DPort
03:00:00 0=avail,14=waiting,17=connected,22=UDP
03:00:00 #0:0x0 80 D:202.46.48.22 (24018)
03:00:00 #1:0x17 80 D:130.89.212.77 (60802) 01:52
03:00:00 #2:0x14 80 D:180.76.5.169 (20875)
03:00:00 #3:0x0 80 D:89.238.250.188 (59162)
03:00:00 Climate Update
- FREE RAM: 2847
03:02:45 Socket #1 - Disconnected
04:00:00 ETHERNET SOCKET LIST
04:00:00 #:Status Port Destination DPort
04:00:00 0=avail,14=waiting,17=connected,22=UDP
04:00:00 #0:0x0 80 D:85.212.109.147 (50341)
04:00:00 #1:0x14 80 D:85.212.109.147 (50339)
04:00:00 #2:0x0 80 D:85.212.109.147 (50342)
04:00:00 #3:0x0 80 D:85.212.109.147 (50330)

At 1:00am I had three available sockets and one in a wait status (which seems normal - there is always one socket in that status.)

At 2:00am the socket status list shows socket #1 in a “connected” status and the time this was first observed (1:52) is listed. It is apparently connected to IP address 130.89.212.77 using its destination port 60802.

At 3:00am the socket status list still shows socket #1 in a “connected” status from 1:52am. It is the same IP address and same destination port. Because seventy minutes has not elapsed there has been no attempt by my application to disconect socket #1 yet. And the port must have remained with the “connected” status at every five minute check since 1:52am or my application would have reset the connection timer.

And at 03:02:45am my application issued the W5100 socket disconnection command after the seventy minute timeout. Here is the command from my sketch:

W5100.execCmdSn(l_sock, Sock_DISCON);

And it seems to have worked. The 4:00am stocket status list shows socket #1 as no longer connected and having been used by another IP address using a different destination port. By 6:00am socket #1 was in the available status as shown here:

06:27:00 #0:0x14 80 D:157.55.39.27 (12208)
06:27:00 #1:0x0 80 D:218.77.79.43 (39460)
06:27:00 #2:0x0 80 D:85.212.109.147 (50342)
06:27:00 #3:0x0 80 D:85.212.109.147 (50330)

So if anyone else is still having problems with lost Arduino ethernet connectivity I suggest you start checking the socket status periodically within your application. If you see evidence of stuck “connected” sockets see if you can use the above socket disconnection command to solve the problem.

It is early days for me - my application has only been running for four days and has only had to deal with one stuck socket. I will let it run continuously for about a month to see if my solution correctly deals with other stuck sockets and allows the application to run without error for a full month.

If anyone wants to chase or test this solution I am happy to publish other code fragments from my solution current solution.

Cheers

Catweazle NZ

I can usually crash server code using PuTTY. If the client does not send a blank line (double CR/LF), the normal example server code will lock up.

zoomkat’s server code is the exception. It does not read the entire header, just the first line, so not sending the blank line does not crash it. I need to use the Arduino as a client to crash his code. I just send a few characters without a CR/LF. You can’t do that with PuTTY. It always sends the CR/LF. Neither of those attempts crash mine. It is more fault tolerant than that.

I have not checked for a timeout if a client connects and doesn’t send anything. The library doesn’t respond to this if the client doesn’t send anything, and it could “disable” the socket.

SurferTim: I can usually crash server code using PuTTY. If the client does not send a blank line (double CR/LF), the normal example server code will lock up.

zoomkat's server code is the exception. It does not read the entire header, just the first line, so not sending the blank line does not crash it. I need to use the Arduino as a client to crash his code. I just send a few characters without a CR/LF. You can't do that with PuTTY. It always sends the CR/LF. Neither of those attempts crash mine. It is more fault tolerant than that.

I have not checked for a timeout if a client connects and doesn't send anything. The library doesn't respond to this if the client doesn't send anything, and it could "disable" the socket.

SurferTim

In a round about way you are confirming my suspicions. I suspect that when an IP address sends multiple http requests, particularly rapid and/or incomplete requests the W5100 microcode is getting confused and sockets are stuck in a connected status.

For example what happens when a user presses their browser refresh key rapidly several times? Does every refresh result in a complete http request across all browsers or do later refreshes overrun the earlier ones leaving them in an incomplete state and locking up the W5100 sockets in a connected status?

Likewise a web crawler could operate in such a way (by design or by error) as to send incomplete http requests that cause sockets to lock up in a connected status.

This would seem to be a way to attack an Arduino W5100 card for those who want to launch such attacks. For me the net effect of losing all the sockets to locked connected statuses would be to prevent me accessing the application remotely and preventing me from opening my garage door from just outside when I pull up in the car.

I have not had a socket lock since the last one referred to here. My application is still running along very well after nearly nine days now.

Regards

Catweazle NZ

I tested the "connect and not send anything" scenario and it times out after a few seconds, freeing up the socket. I tried connecting, not sending anything, and breaking the connection to insure it was not the client disconnecting, and it also timed out. All sockets still available except the socket listening for the client connection.

After eleven and a half days of uptime my application got another locked (permanently connected) ethernet socket.

My system identified it, listed it in the application log and disconnected it after 70 minutes. Subsequent port status listings show the port has been used by other IP addresses.

Interestingly the locked connected socket was caused by myself. I logged into my application from a server in the Netherlands (I am in New Zealand) using a remote desktop connection. I only displayed a couple of web pages and closed the browser (sometime later). I did nothing out of the normal so it seems the occurrence of these locked "connected" sockets is random and does not require a particular set of irregular steps to occur.

I am still going to allow my application to run for a month to observe any other locked connections that come along and confirm that my application is dealing with them satisfactorily. Then I will no doubt start work on the next tranche of application development.

Regards

Catweazle NZ

Maybe if you posted your code I could help you. I haven't had my server code lock up a socket in a long time, and I try to get it to lock up.

Well I did not let my system run for a month. I applied more updates last weekend to provide an online reatime display of the W5100 ethernet sockets on an application web page.

During the week since I got a number of stuck connected sockets that I was able to generate myself by remotely connecting to my application in New Zealand using a remote desktop connection from the Netherlands. However I was not able to identify an exact sequence of steps to reproduce stuck connected sockets. Fortunately my stuck socket disconnection functionality worked just fine in every case.

Yesterday I implemented more functionality to record full html header data in my application logs as well as the socket number used for each html request connection. With this I was trying to trap the specific html requests responsible for specific stuck connected ethernet sockets. I got the functionality working and then left for my evening work.

While at work I discovered that my application was again not serving internet html requests. I suspected four stuck sockets and a problem with my latest tranche of application development. But it was not to be. After returning home and trying many things my conclusion is that my Freetronics Ethermega card's ethernet connectivity (W5100 chip etc) has failed. The ethernet lights on the ethernet plug just don't light up - my application's logs are reporting failed ethernet connectivity.

Occasionally (e.g. after a power down and a fiddle with things here and there) I can get the ethernet connectivity to work OK for a few seconds - but then the LEDs on the ethernet plug go out permanently again.

My application is still running - I have just lost my user interface and I cannot get any data from it, control any switches, etc. I can power the app down, remove the SD card and extract any of the logged data that is still being recorded by the running app.

I have tested the ethernet ports of my modem/router - I can still access the internet via a cable on my PC - just not with my Ethermega card.

So tomorrow, Monday, I will order a new Ethermega card, wait a few days for arrival and swap out my failed card for a new one. Maybe my stuck connected socket problem will also go away.

Cheers

Catweazle NZ

Hi, i have the same problem with the Mega2560 R3 and the Ethernetshield. After some hours or 2-3 days the ethernetconnection hangs...

I check now every minute the status of the 4 sockets and after a timeout of 5 minutes i reset the hanging socket. I also write a logfile on the sd-card for debugging.

It works now over 1 week without any problem, thank you for you nice work.

paulinchen (sorry for my poor english)

paulinchen: Hi, i have the same problem with the Mega2560 R3 and the Ethernetshield. After some hours or 2-3 days the ethernetconnection hangs...

I check now every minute the status of the 4 sockets and after a timeout of 5 minutes i reset the hanging socket. I also write a logfile on the sd-card for debugging.

It works now over 1 week without any problem, thank you for you nice work.

paulinchen (sorry for my poor english)

paulinchen

Please provide more information about this if you can including your experiences going forward with stuck connected sockets and whether disconnecting the sockets within your software is always a solution.

There may be an intermittent problem with the W5100 ethernet chip and we need more evidence.

Regards

Catweazle NZ

All

I swapped out my Freetronics Ethermega card last night and my application's web site is back up at http://www.2wg.co.nz.

My latest enhancements to record full html header information and internet socket numbers in my application's SD card log files seems to be working just fine. But the log files are going to be big - helped along by the google, bing, baidu, et all web crawlers.

Today I have created two stuck sockets using a remote desktop connection from the Netherlands back to New Zealand. They were for two different pages so I don't think there is anything that is page specific.

I need to also log the client destination port for these connections to be sure I am correctly matching up the sockets that get stuck and need to be disconnected with the correct html request.

But the fact that I am already getting stuck sockets for a second (later revision) Freetronics Ethermega card suggests that the recent hardware failure of the first card was unrelated.

I will continue on in this area while further developing my application and report back on this issue from time to time.

Regards

Catweazle NZ

All

Now I am writing data logging like this for all html requests received so I should be able to match up stuck sockets with html requests and possibly determine any common factors for stuck sockets:

22:05:31 ** HTML REQUEST **
- IP: 192.168.1.55
- Socket #: 1
- Dest Port: 50095
- GET / HTTP/1.1
- Host: 192.168.1.177
- [HOST 192.168.1.177]
- [PAGE Dashboard]
- User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0
- Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
- Accept-Language: en-nz,th;q=0.5
- Accept-Encoding: gzip, deflate
- DNT: 1
- Referer: http://192.168.1.177/Menu/
- Connection: keep-alive

It may be an intermittent problem related to the distance from New Zealand of each request. I do not get stuck sockets for my LAN and internet connections within New Zealand - but I can get them frequently when I log in from the Netherlands using remote desktop. It may also be a browser issue since I use Firefox and Safari locally and Internet Explorer from the Netherlands.

More testing to come ...

Catweazle NZ

All

Just a final update on this:

I tidied up my html request logging within my Arduino web application. Here is an example:

03:40:02 ** HTML REQUEST **
- GET /13285/ HTTP/1.1
- Host: 2wg.co.nz
- Connection: Keep-alive
- Accept: */*
- From: googlebot(at)googlebot.com
- User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Accept-Encoding: gzip,deflate
- If-Modified-Since: Sun, 03 Aug 2014 20:42:29 GMT
-
- << CONNECTION >>
- IP: 66.249.73.92
- Socket #: 1
- Dest Port: 46386
- << PARSE RESULT >>
- HOST 2WG.CO.NZ
- PAGE Settings

For every connection I capture the time, IP address, socket number and destination port. These values uniquely identify every connection. IP address, socket number and destination port are captured using customisation to the EthernetClient class/unit/sketch.

I then compare this information to my systems ethernet socket disconnections -any active connection is disconnected and logged after ten minutes. Here is a disconnection example:

09:07:32 ETHERNET SOCKET LIST
09:07:32 #:Status Port Destination DPort
09:07:32 0=avail,14=waiting,17=connected,22=UDP
09:07:32 1C=close wait
09:07:32 #0:0x17 80 D:5.153.43.3 (60680) 08:57
09:07:32 #1:0x14 80 D:5.153.43.3 (60718)
09:07:32 #2:0x0 80 D:202.46.48.29 (18776)
09:07:32 #3:0x0 80 D:119.63.193.196 (56563)
09:07:32 Socket #0 - Disconnected
09:07:32 Email SOCKET DISCONNECTION
09:07:36 Email OK

With all this in place I thought I would know exactly which html requests result in stuck ethernet sockets that need to be disconnected. But alas it was not so - you see stuck ethernet connections that I suffer from never pass through to my application via EthernetServer.available(). They are stuck before my application gets them and as a result my application never processes them and never records their details in my application log files.

Then I found out something about EthernetServer.available() for which here is the source code:

EthernetClient EthernetServer::available()
{
  accept();

  for (int sock = 0; sock < MAX_SOCK_NUM; sock++) {
    EthernetClient client(sock);
    if (EthernetClass::_server_port[sock] == _port &&
        (client.status() == SnSR::ESTABLISHED ||
         client.status() == SnSR::CLOSE_WAIT)) {
      if (client.available()) {
        // XXX: don't always pick the lowest numbered socket.
        return client;
      }
    }
  }

  return EthernetClient(MAX_SOCK_NUM);
}

If seems that EthernetServer.available() will not return ethernet clients for which there is no available data. So if an ethernet connection is established but no data is ever sent from the remote browser it seems that the connected socket will stay stuck forever. This subject is discussed in various locations across the world wide web.

Anyway, as far as my application is concerned I am using my stuck socket disconnection functionality and as a result I do not expect my application will ever again lose connectivity to the world wide web because of stuck sockets. I still don’t know exactly what the problem is and I won’t bother testing EthernetServer.available() with the " if (client.available()) {" line removed.

I am now moving forward with my application’s other functionality and won’t be discussing this subject further.

Cheers for now.

CatweazleNZ

p.s. CWZ’s Home Automation can be found at http://www.2wg.co.nz