Ethernet stability

Hi all,
An outline of what I have is....
An RF network of RFM12 nodes with 1-wire temp sensors and an RFM12-enabled 4 output mains power switch (for lights or whatever) . These communicate to a 'gateway' that collects the info from all the RF nodes and deals with command/responses for the power-switch box.
The gateway links to an Ethermega by RS485.

On the Ethermega, I run a webserver that displays some of the temperatures and lets you turn on/off the 4 power outputs.
Also on the Ethermega, a client logs the temperatures to both nimbits and thingspeak (every 60-sec for Thingspeak, 90-sec for nimbits).
On startup, the time is retrieved from a NTP timeserver and each midnight thereafter.

What I notice is that when using the client connect with a name lookup, ie cloud.nimbits.com, NOT a fixed IP, the client side of the Ethermega eventually hangs up. This can take up to 6-hrs or even after 20-min , but the server side continues to work when the client side stops.

I am currently trying the client side (logging) with fixed IP addresses (which is not a long term solution) and at the same time the ethermega server is being worked by having a autorefresh request from firefox every 10 sec.
Since moving to fixed IP addresses for the logging, the system has been stable for 24 hours (longest ever!) now of logging to two sites and serving up a webpage every 10 seconds without a hitch ( over 8600 page refreshes).

While trying to investigate this I came across a Wiznet errata document (W3150A+/W5100 Errata Sheet) on the W5100 chip about possible errors relating to ARP requests but I am a very inexperienced at C and I lack the understanding to figure out if the fix for the errata has been included in the ethernet library, or even if the client failure while using lookups could be caused by this so I'm hoping for someone with better C/C++ understanding who has looked into this and may be willing to share their experiences and understanding.

I include the logging code here so perhaps someone may spot something I'm doing wrong. It is only a section of the complete code.

When the client side accumulates 5 successive failures, the whole system is reset by the WDT. I have never had a problem with the ethernet not coming back after a reboot.

Is there any way to check how many W5100 sockets are currently available?

Any help/suggestions gratefully received :slight_smile:

Thanks.

/*
posts to nimbits

Uses the PString.h library.
Instead of making many connections to the server using the usual Arduino ethernet functions
like 'client.print' (every statement is a separate trasnaction), all the data to send in the 
POST trasaction is 'assembled' into one large string using the Pstring library, 
which then gets sent in one hit. This reduces the IP traffic considerably
*/



void do_weblog() {
    // if 'postinginterval' secs have passed since
    // your last connection, then connect again and send data:
    if((millis() - lastCloudTime > postingInterval)) {
        line();
        showTimeDate();
        showRunTime();
        sendData();    // the POST happens here
        delay(2);
        }
    }



// this makes a HTTP connection to the logging server:

void sendData() {

byte cloud[] = { 74, 125, 31, 121 }; // cloud.nimbits.com

        str.begin();    // reset the into-string pointer. This is the 'main' string being assembled.
        cont.begin();   // and for the actual content ie payload string

// Create the 'content' of the string to send. Its assembled from the user details, then the sensor data
// 1st part of content is access details
        cont.print("email=XXXXXXXXXXXXXXXX&key=YYYYYYYYYYYYYY");    // store in the string 'cont'

// the sensor data is aquired from the RFM12 gateway that receives all the assorted temp sensor data
// get the data and store it in the 'cont' string. Data is requested from the RFM12 'gateway' using RS485.
        cont.print("&p1=hot-water&v1=");
        GetTemperature(0);        // hot water address
        cont.print(temptemp);    // 
        cont.print("&p2=outside1&v2=");
        GetTemperature(5);        // turning circle sensor adress
        cont.print(temptemp);    // 
        cont.print("&p3=highest&v3=");
        GetTemperature(7);        // library sensor address
        cont.print(temptemp);    // 
        cont.print("&p4=freezer1&v4=");
        GetTemperature(6);        // freezer in pantry address
        cont.print(temptemp);    // 
        cont.print("&p5=workshop&v5=");
        GetTemperature(1);        // workshop address
        cont.print(temptemp);    //
        
 // now get the length of the assembled content string. This string is the complete 'content'
        int contlen = (cont.length());
        
 // Got the content assembled so try and connect to the web-site

    EthernetClient Logging_client;      // client for both logging services
    Serial2.println("Attempting to connect to Nimbits...");

    if (Logging_client.connect(cloud, 80)) 
//    if (Logging_client.connect("cloud.nimbits.com", 80)) 
    { 
        str.println("POST /service/batch HTTP/1.1");
        str.println("Host: cloud.nimbits.com");
        str.println("Connection: close");
        str.println("Cache-Control: max-age=0");
        str.print("Content-Length: ");
        str.println(contlen,DEC);
        str.println("Content-Type: application/x-www-form-urlencoded");
        str.println();
        str.println(cont);    // the actual content (data points)

// the total string ('post' headers and content) is sent to the ethernet connection in one hit
        Logging_client.print(str);  // ethernet send to COSM

        Serial2.println();        // for debug
        Serial2.print(str);       // this is a copy of whats sent to the ethernet (same string)
        Serial2.println();        // for debug
        Serial2.println();        // for debug

      bufindex = 0;

// look for response
    while (Logging_client.connected()) {
      if (Logging_client.available()) {
        char c = Logging_client.read();
        Serial2.print(c);
        if (bufindex < 198) {
          //store characters to string 
         content[bufindex] = c;
         bufindex = bufindex+1;
       }
      }
      content[bufindex] = 0;
     }

    if (strstr(content, "200 OK") != 0) {
       failcount = 0;    // reset after every successfull connect
   }
    else {
        // if you couldn't make a connection:
        failcount = failcount + 1;
         }    
    }
// The above is when the connection succeeds and data is sent/received    



// if the connection fails then here is next
    else {
        // if you couldn't make a connection:
        failcount = failcount + 1;
        Serial2.print("Connection failed ");
        } 
 
 
// and after here for either good or bad connection 
      Logging_client.flush();    // ensure no data left in buffer (wont allow close if present)
      Logging_client.stop();     // and finish the socket

      Serial2.print("failcount ");
      Serial2.println(failcount);

        if (failcount > 5) {    // if connection consistently failing then reboot.
          WDTreboot();        // sets up an 8-sec timeout
        }
 
    lastCloudTime = millis();
}

Maybe you should check the ip address with dns, then use that ip for a while. I do this with NTP and it works well. Here is the code snippet I use for that.

  IPAddress timeServer;
  DNSClient dns;

  dns.begin(Ethernet.dnsServerIP());
  
  if(dns.getHostByName("pool.ntp.org",timeServer)) {
    Serial.print(F("NTP server ip :"));
    Serial.println(timeServer);
  }
  else Serial.print(F("dns lookup failed"));

You're referring to erratum 2/3 which deals with ARP traffic. As ARP traffic is also used in non-DNS requests I don't think this is the source of your problems. But erratum 1 is a possible source. It describes a race condition where the reception of a UDP datagram occurs almost simultaneously with the sending of a packet. The code of the Ethernet library only check for the SEND_OK bit in the interrupt register, which may never happen (at least the erratum says that). The recommendation in the errata document is NOT in the Ethernet library yet.

I'm not sure if your problem is related to that erratum, because unfortunately the erratum says nothing about the TIMEOUT interrupt. If the TIMEOUT occurs as specified in the datasheet, the code in the Ethernet library is correct.

Thanks for your input :slight_smile:
I have tweaked and adjusted and tried many things but the end result is the same which is after one or two days, it locks-up.
To keep things happening, the watchdog reboots the board and it all seems happy again. Its transparent to the user, the only way I know its happened is an uptime timer on the web-page.
Perhaps I'm just trying to do too much on the one board..
NTP lookup, web-server for control/status and client for posting logging data.
I just dont understand C or C++ well enough to dig deep down. In time perhaps.....
Meantime, I have written a server and client for the wiznet in assembly language. Its a fraction of the size, but not as comprehensive. DNS and NTP still to do....
Driving the W5100 at register level has been quite educational. I certainly learned a lot about the structure of web transactions.
With the low-level access, its easier to see whats going wrong. This highlights the lack of useful diagnostic and status info that is 'missing' from the ethernet library, which is basically a go/no-go affair but nothing available to show whats going on in the background, or at what point something failed.

Now that I know a lot more about the chip, I can have another try at the C-code and see if it makes more sense.
:slight_smile:

Your problem seems to be a bit like mine, but I'm not too sure about the racetrack cause. I'd like to know how you get on with coding for the 5100 chip.

My set up is a Uno and ethernet shield with sd card and relay outputs. The sketch converts the resistance of Analoug PT1000 sensors into temperatures which then set the relays. Every 10 mins all the data is dumped to the SD card for record purposes. and a Web server section allows this data to be viewed over TCP/IP on another computer, It is alos possibe to alter the thermostat settings from the pc. Also the clock is set by NTP via UDP at startup and then suposedly once a day ther after. all the bits of the programme work (Thermostat settings, A/D conversion, Data to SD card, SD card to Web page, NTP update) but when they are all together somthing causes a crash after a few hours.

I've looked at Power supplies, SD card formatting, SD card maloc bug, EthernetUDP memory creep, and FreeRam()) nothing seems to make any diference, and its been driving me crazy for 6 months.

Hi book_woorm,
I have never been able to find a definitive answer but I have noticed that the length of time before lockup/crashes is maximised by not using DNS lookups so I use hard coded IP addresses.
Its not good as a long term fix but it makes a crash rate of hours turn into days.
The only section I still use a lookup is once every 24hrs I do a lookup for a timeserver at oceania.pool.ntp.org. If my home router had a timeserver function I could even hard-code that IP addr as well.
I have read many theories about the reasons for this, lack of memory often seems to come up as a possibilty. (have you seen the Goldilocks board?, 16K ram!)
To get around my lockups, I use the watchdog timer to reboot the board. It always seems to come back up OK.
On my Arduino server, I have an uptime timer (days/hrs/min/sec) so when I look at the page I know how long its been since its last reset.
What I dont differentiate (yet) is whether a reboot is due to an ethernet lockup or a string of five consecutive logging-service post failures. I will remedy that soon.
It seems that the logging sites can miss quite a few posts in busy times, at least thingspeak can, nimbits seems more reliable in that sense.
This is only my personal experiences.
One thing you are doing that i dont is writing to the SD card. I just read from one. I'm not sure if writing uses more ram.
Good luck with your coding, I hope you find some answers :slight_smile:
Stewie

This code can cause a lockup if the connection breaks (hardware fail).

// look for response
    while (Logging_client.connected()) {

      // if the connection breaks while in this loop
      // this while loop will never exit

      if (Logging_client.available()) {
        char c = Logging_client.read();
        Serial2.print(c);
        if (bufindex < 198) {
          //store characters to string 
         content[bufindex] = c;
         bufindex = bufindex+1;
       }
      }
      content[bufindex] = 0;
     }

This is what I use to prevent those ugly lockups. Maybe it will help you. It has a timeout that will close the connection if no packets received for 10 seconds.

  // connectLoop controls the hardware fail timeout
  int connectLoop = 0;

  while(client.connected())
  {
    while(client.available())
    {
      inChar = client.read();
      Serial.write(inChar);
      // set connectLoop to zero if a packet arrives
      connectLoop = 0;
    }

    connectLoop++;

    // if more than 10000 milliseconds since the last packet
    if(connectLoop > 10000)
    {
      // then close the connection from this end.
      Serial.println();
      Serial.println(F("Timeout"));
      client.stop();
    }
    // this is a delay for the connectLoop timing
    delay(1);
  }

  Serial.println();
  Serial.println(F("disconnecting."));
  // close client end
  client.stop();

Hi Stewie I was going to go down the 'watchdog timer' route thinking that the one second drumbeat on the interupt would stop when the programme hangs. Using the drumbeat to continuosly re trigger a 555 monostable is simple enough though it would involve a new master PCB for the system, but Ive discovered the Web I/O can hang by itsself and the drum beat caries on dutifuly measuring temperatures and recording data. Other times the data recording stops but the Web server functions don't It all seems to vary with howmany Serial.print statments I've put in a particular version tyring to track the problem. That smacks of memory overload but FreeRam() is returning between 630 and 750 bytes depending on where in the programme I ask the question.

Thanks to SurferTim for the 'time out' code I'll try that when the current test falls over.

Here is the original test of the timeout. Almost a year ago. I did not find the bug, just provided the fix after it was pointed out to me.
http://arduino.cc/forum/index.php/topic,102879
It may not be your problem today, but it isn't really a matter of "if", only "when". The fails that happen once every couple weeks or months are the tough ones to find.

Thanks Tim,
I have incorporated your timeout into my code and will see if it makes a difference.
To date, about 3.5 days of uptime is my best. I'll be interested to see if this now changes (my fingers are crossed....)
Stewie

@Stewie

Your posted code does not compile:

sketch_mar25a.ino: In function 'void do_weblog()':
sketch_mar25a:17: error: 'lastCloudTime' was not declared in this scope
sketch_mar25a:17: error: 'postingInterval' was not declared in this scope
sketch_mar25a:18: error: 'line' was not declared in this scope
sketch_mar25a:19: error: 'showTimeDate' was not declared in this scope
sketch_mar25a:20: error: 'showRunTime' was not declared in this scope
sketch_mar25a.ino: In function 'void sendData()':
sketch_mar25a:34: error: 'str' was not declared in this scope
sketch_mar25a:35: error: 'cont' was not declared in this scope
sketch_mar25a:44: error: 'GetTemperature' was not declared in this scope
sketch_mar25a:45: error: 'temptemp' was not declared in this scope
sketch_mar25a:64: error: 'EthernetClient' was not declared in this scope
sketch_mar25a:64: error: expected `;' before 'Logging_client'
sketch_mar25a:65: error: 'Serial2' was not declared in this scope
sketch_mar25a:67: error: 'Logging_client' was not declared in this scope
sketch_mar25a:88: error: 'bufindex' was not declared in this scope
sketch_mar25a:97: error: 'content' was not declared in this scope
sketch_mar25a:101: error: 'content' was not declared in this scope
sketch_mar25a:104: error: 'content' was not declared in this scope
sketch_mar25a:105: error: 'failcount' was not declared in this scope
sketch_mar25a:109: error: 'failcount' was not declared in this scope
sketch_mar25a:119: error: 'failcount' was not declared in this scope
sketch_mar25a:125: error: 'Logging_client' was not declared in this scope
sketch_mar25a:129: error: 'failcount' was not declared in this scope
sketch_mar25a:132: error: 'WDTreboot' was not declared in this scope
sketch_mar25a:135: error: 'lastCloudTime' was not declared in this scope

Hi Nick,
No, it wont on its own. In my original post I said..

"I include the logging code here so perhaps someone may spot something I'm doing wrong. It is only a section of the complete code".

I was hoping that someone may spot an obvious error in the section of code that does the POST.

The complete code is spread through five modules and is fairly large now.
One for the server and temperature retrieval, one for power switching, one each for Thingspeak & Nimbits and one for NTP

While you are running this test, are you using a static ip or dhcp to set your ip for the Arduino?

Its static IP
Also, In my attempt to reduce the potential of memory leaks/allocation problems I made almost everything global scope variables/arrays while monitoring free memory to see if it was getting eaten up by something. Bad practice from what I understand but if it nails everything down and removes a possibility then I can live with it for now. Currently 4473 bytes free.
Strings in Flash, static IP addresses except for a NTP access on boot and every 24hrs.

  • I just noticed that your timeout code just did its thing on a Nimbits post :). I'm watching the status as I tinker...

If you saw a "Timeout" message, it would have probably locked up then. During my tests, mine locks up and does not recover from it if the timeout code wasn't there.

I thought that the read-from-client part that you have inside your timeout section should not be able to happen (in theory...)
The Ethernet library has a timeout and retries built into it and if the connection hangs while waiting for data then shouldn't the library force-close the socket? thus allowing exit from the wait/read loop?
Perhaps this force-close fails if a flush is not done first? or is it?
I'm just speculating here as I don't understand the internals of the library, which is what prompted me to do it in assembly.
I'd still rather understand whats happening in the high-level code :~

Do you know if its possible to read the W5100 socket status register?

That would certainly help me with trying to pinpoint where things are going wrong, or at least perhaps rule some things out.

The w5100 has a problem with hardware fails. If the connection is closed (not a fail) by the server, all goes well. If the connection breaks, the server close message never gets to the client, and that while loop never exits.

If the ethernet firmware/library is supposed to timeout on its own during a receive, it doesn't. At least last I checked, and I think you are rechecking it right now. :slight_smile:

Yes, you are right about no (in the W5100 anyway) receive timeout. I dont understand the Ethernet library enough to know if it is in the software or not.
I was getting mixed up with the connect/transmit timeout. I implemented a receive timer in my assembly language version for just that reason or, as you say, you never get out of the loop.
I kept a stack of testing logs when I was working on the client side so I just had a look through them and right enough, the receive timeout error crops up several (2~5) times daily for Thingspeak. It tends to happen around the same period as 502 gateway unavailable conditions, also a daily occurrence. I put it down to server loading during busy periods, where it could not keep up (?)
The other errors that repeatedly come up are...
While waiting from a response to the connect request, a close arrives from the remote end, or...
Connection request simply times out.
At least both of these are simple to deal with, but as you pointed out, the receive side can just hang forever. I suspect that will be the source of at least one of my problems so thanks for highlighting that one.

The timeouts/bad-gateway errors seem to not happen for ages then a block of them appear off and on for perhaps 20min, then all back to plain sailing again. Its quite possible that sometimes the Arduino would get 'stuck' at this point, but also possible for it to get lucky and sail through and not get caught till another time.

I'll try with the timeouts in place and see how it goes.

I have an Ethernet shield that checks a couple of other servers, roughly every 10 seconds. It ran OK for weeks but eventually seemed to hang occasionally. I added a watchdog reset and so far, no problems. In setup I added:

  // watchdog setup in case shield hangs
  wdt_enable(WDTO_8S);  // reset after eight seconds, if no "pat the dog" received

Before connecting to a client I have:

    wdt_reset();  // give me eight seconds to do stuff (pat the dog)

And that's it! (Plus an include at the start of the file):

#include <avr/wdt.h>

I do not use a watchdog timer in any of my code. But that is just me...

OTOH, Nick, I am using IDE v1.0.4, and I know how you are about upgrades. :wink:
Do you use that timeout code? Like I told the OP, the fails that happen once every couple weeks or months are the tough ones to find.