Ethernet client connect failure

I'm hoping that someone is familiar enough with the Ethernet library and the W5100 chip to be able to suggest some reasons that this may be happening. I'm having failures with the connect method. There are four places that connect() returns a failure indication (0). I changed the code to return unique values and found that it's the one I've indicated with a comment below. I'm hoping to understand what conditions might lead to this particular return being taken.

Some background: The project sends data once per minute (to ThingSpeak) which consists of connecting, posting the data, waiting for the server response and disconnect. It runs flawlessly for days and weeks on end at my house. I made a copy for my brother-in-law. At his house it might run a few hours, or maybe a day or so, then at some point the connect starts failing and continues to do so until the MCU (and the W5100) is reset.

I've swapped all the hardware around and have convinced myself that there are no hardware issues.

We both have AT&T as our ISP, but he has U-Verse with the AT&T-supplied 2Wire modem/router, mine is the "regular" DSL, I just have a Motorola DSL modem that AT&T supplied, and use my own router (D-Link).

Any help is appreciated!

PS: The project also uses UDP (for NTP time synchronization), and this continues to operate.

uint8_t Client::connect() {
  if (_sock != MAX_SOCK_NUM)
    return 0;

  for (int i = 0; i < MAX_SOCK_NUM; i++) {
    uint8_t s = W5100.readSnSR(i);
    if (s == SnSR::CLOSED || s == SnSR::FIN_WAIT) {
      _sock = i;
      break;
    }
  }

  if (_sock == MAX_SOCK_NUM)
    return 0;                //  <--- TAKING THIS RETURN

  _srcport++;
  if (_srcport == 0) _srcport = 1024;
  socket(_sock, SnMR::TCP, _srcport, 0);

  if (!::connect(_sock, _ip, _port)) {
    _sock = MAX_SOCK_NUM;
    return 0;
  }

  while (status() != SnSR::ESTABLISHED) {
    delay(1);
    if (status() == SnSR::CLOSED) {
      _sock = MAX_SOCK_NUM;
      return 0;
    }
  }

  return 1;
}

There is a reported (compiler?) error in the reading of 16-bit socket registers. It's reportedly fixed in the upcoming 1.0.1.

See: Google Code Archive - Long-term storage for Google Code Project Hosting.

You can insert the fix at line 253 of w5100.h

Thanks, John.

Is that a Linux-specific issue? For this project, I'm using 0022 on Windows.

I can't tell, would you know whether it's the sort of thing that would be sensitive to some sort of network differences? (That's my operating theory, since the thing works fine at my place, but not at my brother-in-law's.)

Is that a Linux-specific issue?

No, it is not. It is SurferTim's fix, and he really seems to know what he is talking about.

Thanks, PaulS.

There are two symptoms of this 605 bug. The first you might notice is the DHCP function Ethernet.begin(mac);
will not return. If you assign the ip address manually, that will work, but just here. See the second symptom below.

The second symptom is the client.available() function. It will return the incorrect value, causing the code to read data that doesn't exist, over and over forever. Here is an example of what it returns. Normally it is just garbage.
http://arduino.cc/forum/index.php/topic,105228.0.html

PaulS:

Is that a Linux-specific issue?

No, it is not. It is SurferTim's fix, and he really seems to know what he is talking about.

OK, thanks. Yes I have noticed that about SurferTim :slight_smile:

SurferTim:
Thanks, PaulS.

There are two symptoms of this 605 bug. The first you might notice is the DHCP function Ethernet.begin(mac);
will not return. If you assign the ip address manually, that will work, but just here. See the second symptom below.

The second symptom is the client.available() function. It will return the incorrect value, causing the code to read data that doesn't exist, over and over forever. Here is an example of what it returns. Normally it is just garbage.
http://arduino.cc/forum/index.php/topic,105228.0.html

Doesn't match my symptoms. I'm using Arduino 0022, so no DHCP, and have not seen any garbage coming back or any reason to suspect client.available().

Looking at the Client.cpp module some more (and I don't begin to understand it all), I'm thinking that second error return is because it looked for an available socket to make the connection and didn't find one. Can you verify that?

Not sure why I'd see that on one network and not another, but looking at my code, there could be a scenario where I miss doing client.stop() and therefore all the sockets get used up. Might be a timing thing. I need to test and validate that.

Hi Jack! I was clarifying the symptoms for exactly this reason. If those are not the symptoms, then it is something else.

I would suspect connections are not being closed, or are being closed and the code is not picking up on that.
That is normally the cause of the MAX_SOCK_NUM fail.

To properly close the connection, you must first empty the rx buffer. The server will not close the connection until your code does that. Don't just close the connection after you empty the buffer either. There may be another packet on the way. The server will close the connection on its end, signalling it is finished sending packets.

When you detect the server has closed the connection, you must close your end with client.stop(). That notifies the server you received everything, and closes your socket.

while(client.connected()) {
   while(client.available())   {
      Serial.write(client.read());
   }
}
client.stop();

Hi Tim, great, that's exactly what I was hoping to verify. Thanks also for outlining the proper approach. That was my understanding, but very good to have it verified as well. I may have discovered a hole in the code that may be timing related (i.e. how fast the close comes from the server) so I'm off to check that out next chance I get.

Thanks again, much appreciated. Will report back.

I may have discovered a hole in the code that may be timing related (i.e. how fast the close comes from the server) so I'm off to check that out next chance I get.

Let me know what you find. There should be no timing related problems with the example I posted above. The server should be able to close the connection immediately with no problem.

edit: There is one little bug if the connection breaks (not closed by the server). It will lock up in the "while(client.connected())" loop. Here is the thread (and bug patch) on that:

This link has a telnet type sketch has a timeout capability to close the connection properly if the connection breaks.

SurferTim:
Let me know what you find. There should be no timing related problems with the example I posted above. The server should be able to close the connection immediately with no problem.

Sure will. My code is maybe a bit more tangled up than the example, but I have a fix in mind, should be straightforward. It's looking a lot like the user shot himself in the foot :blush: XD

It's looking a lot like the user shot himself in the foot

C not only lets you do that, it loads the gun and aims for you!

PaulS:

It's looking a lot like the user shot himself in the foot

C not only lets you do that, it loads the gun and aims for you!

You say that like it's a bad thing! XD

That was not a shot in the foot. From my point of view, you just barely missed it. :slight_smile:

Just a tip. When you modify that, try not to dally in the "while(client.available())" loop. If you wait until the rx buffer is empty, you can process that packet with the CPU while the w5100 gets another packet.

Thanks to everyone for the feedback. I've deployed modified code both at my house and the bro-in-law's. If they both run for at least three days, I'll consider it fixed. I'll report back, hopefully not until the first of the week!

SurferTim, thanks again, I do have the "while(client.available())" loop pretty tight I think.

Basically, I was picking up the connection status via client.connected() after doing the POST, similar to the PachubeClient example. My theory is that on the Uverse network (being faster?), the server had actually disconnected at that point, and the way I was testing for the disconnect, I would not have recognized it properly because the assumption was that it was still connected. So now I pick up the connection status before the post. Time will tell...

I'm still struggling with this. Hope y'all don't mind looking at my code, maybe spot what I'm doing wrong. I did pare my sketch back to the bare minimum and it still exhibits the same symptom. The code below simply posts a single number once a minute to ThingSpeak.

I've run it at two other locations and both have the same issue. Both locations have 2Wire routers supplied by AT&T. At my house it works fine, I have a D-Link router. So it seems something is different but I don't know what. I hesitate to blame the routers, because obviously PCs etc. work OK.

I'm starting to fiddle with resetting the WIZ811MJ when a failure occurs. That's probably code I will retain anyway, but I really want to understand what's going on, else I'm just treating the symptom.

Thanks again!

#include <Ethernet.h>
#include <SPI.h>         
#include <Streaming.h>

#define WIZ811MJ_RESET 9       //to WIZ811MJ reset pin
#define HB_LED A2              //heartbeat LED
#define SYNC_LED A3            //illuminated when data is sent, extinguish when response received
#define TX_INTERVAL 60         //seconds between data transmissions
#define HB_INTERVAL 1000       //blink interval for heartbeat LED, ms

byte mac[6] = {2, 0, 192, 168, 0, 202};
byte ip[4] = {192, 168, 0, 202};
byte gateway[4] = {192, 168, 0, 1};
byte server[4] = {184, 106, 153, 149};    //ThingSpeak API IP Address
Client client(server, 80);                //http client

unsigned long ms;              //current time from millis()
unsigned long msLastSend;      //ms when last data was sent
boolean lastConnected = false;
char apiKey[] = "16-byte-api-key.";    //Thingspeak API key

//Store the larger parts of the fixed post text in progmem.
PROGMEM prog_char post0[] = "POST /update HTTP/1.1\nHost: api.thingspeak.com\nX-THINGSPEAKAPIKEY: ";
PROGMEM prog_char post1[] = "\nContent-Type: application/x-www-form-urlencoded\nContent-Length: ";
PROGMEM prog_char post2[] = "\nConnection: close\n\n";
PROGMEM char *post[] = {post0, post1, post2};

void setup(void) 
{
    pinMode(HB_LED, OUTPUT);
    pinMode(SYNC_LED, OUTPUT);
    pinMode(WIZ811MJ_RESET, OUTPUT);

    digitalWrite(WIZ811MJ_RESET, LOW);    //reset the WIZ811MJ
    delay(1000);
    digitalWrite(WIZ811MJ_RESET, HIGH);
    delay(3000);

    Ethernet.begin(mac, ip, gateway);
    Serial.begin(115200);
    delay(3000);
    Serial << _DEC(millis()) << " Starting" << endl;
}

void loop(void)
{
    static long count;
    
    ms = millis();
    heartBeat();
    readServerResponse();
    if (msLastSend == 0 || ms - msLastSend >= TX_INTERVAL * 1000UL) {
        msLastSend = ms;
        sendData(++count);
    }
}

void sendData(long data)
{
    char cData[8];
    int dataLen;
    uint8_t connectStatus;
    
    digitalWrite(SYNC_LED, HIGH);
    Serial << endl << _DEC(millis()) << " SEND" << endl;
    ltoa(data, cData, 10);
    dataLen = strlen(cData) + 7;    //7 for "field1="
        
    if (!client.connected()) {
        if ( (connectStatus = client.connect()) == 0 ) {
            lastConnected = true;
            Serial << _DEC(millis()) << " CONNECTED" << endl;    //connected to server
            pmToClient(post[0]);
            client << apiKey;
            pmToClient(post[1]);
            client << _DEC(dataLen);
            pmToClient(post[2]);
            client << "field1=" << cData << '\n';
            Serial << _DEC(millis()) << " POST: " << _DEC(data) << endl;
        } 
        else {
            Serial << _DEC(millis()) << " CONNECT FAIL: " << _DEC(connectStatus) << endl;
        }
    }
    else {
        Serial << _DEC(millis()) << " WAIT FOR DISCONNECT " << endl;
    }
}

void readServerResponse(void)
{
    char c;
    boolean connected;
    
    if (client.available()) {
        Serial << _DEC(millis()) << " SERVER RESP" << endl;
        while (client.available()) {
            c = client.read();
            Serial << c;
        }
        digitalWrite(SYNC_LED, LOW);
        Serial << endl;
    }

    connected = client.connected();
    if (!connected && lastConnected) {
        Serial << _DEC(millis()) << " DISCONNECT" << endl << endl;
        client.stop();
    }
    
    if (connected != lastConnected) {
       Serial << _DEC(millis()) << " Connected=" << _DEC(connected) << endl;
       lastConnected = connected;
    }
}

//Send a progmem string to Ethernet client
void pmToClient(char *s)
{
    while (char c = pgm_read_byte_near(s++)) {
       client << c;
    } 
}

void heartBeat(void)
{
    static boolean ledState;
    static unsigned long msHB;            //ms when HB_LED state was changed
    
    if (ms - msHB >= HB_INTERVAL) {
        msHB = ms;
        digitalWrite(HB_LED, ledState = !ledState);
    }
}

Have you tried running a sniffer (such as WireShark) on the same network (the you get the fails) looking only at packets from or to your Arduino box (to not log too many packets)?

I can imagine many reasons for hanging connections (summing up to your problem) on DSL lines, starting with low timeout stateful inspection and probably ending in changing MTUs. Many of them are more or less well handled by todays PC operating systems but simple implementations like the WizNet chips may have problems with it.

If you have the sniff try finding SYN and corresponding FIN packets. My guess is that you have some SYN's left in the end. If this is the case compare the last successful connection (both SYN and FIN) with the first failing connection. You may see timing issues (server side).

@pylon, thanks for the reply and ideas. I did just download WireShark the other day, I'm still trying to figure it out, TCP/IP internals not being my strong suit :smiley: Sounds like that's a good tree to bark up, though, we'll see what we can see!

Use a capture filter (Capture -> Options) of:

host 192.168.1.65

given that 192.168.1.65 is the IP of your Arduino. This way you get all packets originating or targeting your Arduino. You may have to read a bit about TCP/IP though to be able to analyze the output you get :slight_smile:

I don't know exactly how this will send , but it appears to be sending one character per packet.

void pmToClient(char *s)
{
    while (char c = pgm_read_byte_near(s++)) {
       client << c;
    } 
}

How many character/packets are you sending?

SurferTim:
I don't know exactly how this will send , but it appears to be sending one character per packet.

void pmToClient(char *s)

{
    while (char c = pgm_read_byte_near(s++)) {
       client << c;
    }
}



How many character/packets are you sending?

A typical post would be like below, 179 characters or so if I counted right. Previously I had all the text for the post coded as character string literals, I moved them to progmem to save RAM. I retained it that way here, although it's not much of a concern with this pared-down sketch.

I was seeing the same behavior prior to moving the text to progmem, so that had no effect that I can tell. Thanks!

POST /update HTTP/1.1
Host: api.thingspeak.com
X-THINGSPEAKAPIKEY: abcdefghijklmnop
Content-Type: application/x-www-form-urlencoded
Content-Length: 9
Connection: close

field1=42