Go Down

Topic: Ethernet stability (Read 5 times) previous topic - next topic

Stewie

Hi all,
An outline of what I have is....
An RF network of RFM12 nodes with 1-wire temp sensors and an RFM12-enabled 4 output mains power switch (for lights or whatever) . These communicate to a 'gateway' that collects the info from all the RF nodes and deals with command/responses for the power-switch box.
The gateway links to an Ethermega by RS485.

On the Ethermega, I run a webserver that displays some of the temperatures and lets you turn on/off the 4 power outputs.
Also on the  Ethermega, a client logs the temperatures to both nimbits and thingspeak (every 60-sec for Thingspeak, 90-sec for nimbits).
On startup, the time is retrieved from a NTP timeserver and each midnight thereafter.

What I notice is that when using the client connect with a name lookup, ie cloud.nimbits.com, NOT a fixed IP, the client side of the Ethermega eventually hangs up. This can take  up to 6-hrs or even after 20-min , but the server side continues to work when the client side stops.

I am currently trying the client side (logging) with fixed IP addresses (which is not a long term solution) and at the same time the ethermega server is being worked by having a autorefresh request from firefox every 10 sec.
Since moving to fixed IP addresses for the logging, the system has been stable for 24 hours (longest ever!) now of logging to two sites and serving up a webpage every 10 seconds without a hitch ( over 8600 page refreshes).

While trying to investigate this I came across a Wiznet errata document (W3150A+/W5100 Errata Sheet) on the W5100 chip about possible errors relating to ARP requests but I am a very inexperienced at C and I lack the understanding to figure out if the fix for the errata has been included in the ethernet library, or even if the client failure while using lookups could be caused by this so I'm hoping for someone with better C/C++ understanding who has looked into this and may be willing to share their experiences and understanding.

I include the logging code here so perhaps someone may spot something I'm doing wrong. It is only a section of the complete code.

When the client side accumulates 5 successive failures, the whole system is reset by the WDT. I have never had a problem with the ethernet not coming back after a reboot.

Is there any way to check how many W5100 sockets are currently available?

Any help/suggestions gratefully received  :)

Thanks.


Code: [Select]


/*
posts to nimbits

Uses the PString.h library.
Instead of making many connections to the server using the usual Arduino ethernet functions
like 'client.print' (every statement is a separate trasnaction), all the data to send in the
POST trasaction is 'assembled' into one large string using the Pstring library,
which then gets sent in one hit. This reduces the IP traffic considerably
*/



void do_weblog() {
    // if 'postinginterval' secs have passed since
    // your last connection, then connect again and send data:
    if((millis() - lastCloudTime > postingInterval)) {
        line();
        showTimeDate();
        showRunTime();
        sendData();    // the POST happens here
        delay(2);
        }
    }



// this makes a HTTP connection to the logging server:

void sendData() {

byte cloud[] = { 74, 125, 31, 121 }; // cloud.nimbits.com

        str.begin();    // reset the into-string pointer. This is the 'main' string being assembled.
        cont.begin();   // and for the actual content ie payload string

// Create the 'content' of the string to send. Its assembled from the user details, then the sensor data
// 1st part of content is access details
        cont.print("email=XXXXXXXXXXXXXXXX&key=YYYYYYYYYYYYYY");    // store in the string 'cont'

// the sensor data is aquired from the RFM12 gateway that receives all the assorted temp sensor data
// get the data and store it in the 'cont' string. Data is requested from the RFM12 'gateway' using RS485.
        cont.print("&p1=hot-water&v1=");
        GetTemperature(0);        // hot water address
        cont.print(temptemp);    //
        cont.print("&p2=outside1&v2=");
        GetTemperature(5);        // turning circle sensor adress
        cont.print(temptemp);    //
        cont.print("&p3=highest&v3=");
        GetTemperature(7);        // library sensor address
        cont.print(temptemp);    //
        cont.print("&p4=freezer1&v4=");
        GetTemperature(6);        // freezer in pantry address
        cont.print(temptemp);    //
        cont.print("&p5=workshop&v5=");
        GetTemperature(1);        // workshop address
        cont.print(temptemp);    //
       
// now get the length of the assembled content string. This string is the complete 'content'
        int contlen = (cont.length());
       
// Got the content assembled so try and connect to the web-site

    EthernetClient Logging_client;      // client for both logging services
    Serial2.println("Attempting to connect to Nimbits...");

    if (Logging_client.connect(cloud, 80))
//    if (Logging_client.connect("cloud.nimbits.com", 80))
    {
        str.println("POST /service/batch HTTP/1.1");
        str.println("Host: cloud.nimbits.com");
        str.println("Connection: close");
        str.println("Cache-Control: max-age=0");
        str.print("Content-Length: ");
        str.println(contlen,DEC);
        str.println("Content-Type: application/x-www-form-urlencoded");
        str.println();
        str.println(cont);    // the actual content (data points)

// the total string ('post' headers and content) is sent to the ethernet connection in one hit
        Logging_client.print(str);  // ethernet send to COSM

        Serial2.println();        // for debug
        Serial2.print(str);       // this is a copy of whats sent to the ethernet (same string)
        Serial2.println();        // for debug
        Serial2.println();        // for debug

      bufindex = 0;

// look for response
    while (Logging_client.connected()) {
      if (Logging_client.available()) {
        char c = Logging_client.read();
        Serial2.print(c);
        if (bufindex < 198) {
          //store characters to string
         content[bufindex] = c;
         bufindex = bufindex+1;
       }
      }
      content[bufindex] = 0;
     }

    if (strstr(content, "200 OK") != 0) {
       failcount = 0;    // reset after every successfull connect
   }
    else {
        // if you couldn't make a connection:
        failcount = failcount + 1;
         }   
    }
// The above is when the connection succeeds and data is sent/received   



// if the connection fails then here is next
    else {
        // if you couldn't make a connection:
        failcount = failcount + 1;
        Serial2.print("Connection failed ");
        }


// and after here for either good or bad connection
      Logging_client.flush();    // ensure no data left in buffer (wont allow close if present)
      Logging_client.stop();     // and finish the socket

      Serial2.print("failcount ");
      Serial2.println(failcount);

        if (failcount > 5) {    // if connection consistently failing then reboot.
          WDTreboot();        // sets up an 8-sec timeout
        }

    lastCloudTime = millis();
}







SurferTim

Maybe you should check the ip address with dns, then use that ip for a while. I do this with NTP and it works well. Here is the code snippet I use for that.
Code: [Select]
  IPAddress timeServer;
  DNSClient dns;

  dns.begin(Ethernet.dnsServerIP());
 
  if(dns.getHostByName("pool.ntp.org",timeServer)) {
    Serial.print(F("NTP server ip :"));
    Serial.println(timeServer);
  }
  else Serial.print(F("dns lookup failed"));


pylon

You're referring to erratum 2/3 which deals with ARP traffic. As ARP traffic is also used in non-DNS requests I don't think this is the source of your problems. But erratum 1 is a possible source. It describes a race condition where the reception of a UDP datagram occurs almost simultaneously with the sending of a packet. The code of the Ethernet library only check for the SEND_OK bit in the interrupt register, which may never happen (at least the erratum says that). The recommendation in the errata document is NOT in the Ethernet library yet.

I'm not sure if your problem is related to that erratum, because unfortunately the erratum says nothing about the TIMEOUT interrupt. If the TIMEOUT occurs as specified in the datasheet, the code in the Ethernet library is correct.

Stewie

Thanks for your input :)
I have tweaked and adjusted and tried many things but the end result is the same which is after one or two days, it locks-up.
To keep things happening, the watchdog reboots the board and it all seems happy again. Its transparent to the user, the only way I know its happened is an uptime timer on the web-page.
Perhaps I'm just trying to do too much on the one board..
NTP lookup, web-server for control/status and client for posting logging data.
I just dont understand C or C++ well enough to dig deep down. In time perhaps.....
Meantime, I have written a server and client for the wiznet in assembly language. Its a fraction of the size, but not as comprehensive. DNS and NTP still to do....
Driving the W5100 at register level has been quite educational. I certainly learned a lot about the structure of web transactions.
With the low-level access, its easier to see whats going wrong. This highlights the lack of useful diagnostic and status info that is 'missing' from the ethernet library, which is basically a go/no-go affair but nothing available to show whats going on in the background, or at what point something failed.

Now that I know a lot more about the chip, I can have another try at the C-code and see if it makes more sense.
:)


book_woorm

Your problem seems to be a bit like mine, but I'm not too sure about the racetrack cause. I'd like to know how you get on with coding for the 5100 chip.

My set up is a Uno and ethernet shield with sd card and relay outputs. The sketch converts the resistance of Analoug PT1000 sensors into temperatures which then set the relays. Every 10 mins all the data is dumped to the SD card for record purposes. and a Web server section allows this data to be viewed over TCP/IP on another computer, It is alos possibe to alter the thermostat settings from the pc. Also the clock is set by NTP via UDP at startup and then suposedly once a day ther after. all the bits of the programme work (Thermostat settings, A/D conversion, Data to SD card, SD card to Web page, NTP update) but when they are all together somthing causes a crash after a few hours.

I've looked at Power supplies, SD card formatting, SD card maloc bug, EthernetUDP memory creep, and FreeRam()) nothing seems to make any diference, and its been driving me crazy for 6 months.

Stewie

Hi book_woorm,
I have never been able to find a definitive answer but I have noticed that the length of time before lockup/crashes is maximised by not using DNS lookups so I use hard coded IP addresses.
Its not good as a long term fix but it makes a crash rate of hours turn into days.
The only section I still use a lookup is once every 24hrs I do a lookup for a timeserver at oceania.pool.ntp.org. If my home router had a timeserver function I could even hard-code that IP addr as well.
I have read many theories about the reasons for this, lack of memory often seems to come up as a possibilty. (have you seen the Goldilocks board?, 16K ram!)
To get around my lockups, I use the watchdog timer to reboot the board. It always seems to come back up OK.
On my Arduino server, I have an uptime timer (days/hrs/min/sec) so when I look at the page I know how long its been since its last reset.
What I dont differentiate (yet) is whether a reboot is due to an ethernet lockup or a string of five consecutive logging-service post failures. I will remedy that soon.
It seems that the logging sites can miss quite a few posts in busy times, at least thingspeak can, nimbits seems more reliable in that sense.
This is only my personal experiences.
One thing you are doing that i dont is writing to the SD card. I just read from one. I'm not sure if writing uses more ram.
Good luck with your coding, I hope you find some answers :)
Stewie

SurferTim

This code can cause a lockup if the connection breaks (hardware fail).
Code: [Select]
// look for response
    while (Logging_client.connected()) {

      // if the connection breaks while in this loop
      // this while loop will never exit

      if (Logging_client.available()) {
        char c = Logging_client.read();
        Serial2.print(c);
        if (bufindex < 198) {
          //store characters to string
         content[bufindex] = c;
         bufindex = bufindex+1;
       }
      }
      content[bufindex] = 0;
     }


This is what I use to prevent those ugly lockups. Maybe it will help you. It has a timeout that will close the connection if no packets received for 10 seconds.
Code: [Select]
  // connectLoop controls the hardware fail timeout
  int connectLoop = 0;

  while(client.connected())
  {
    while(client.available())
    {
      inChar = client.read();
      Serial.write(inChar);
      // set connectLoop to zero if a packet arrives
      connectLoop = 0;
    }

    connectLoop++;

    // if more than 10000 milliseconds since the last packet
    if(connectLoop > 10000)
    {
      // then close the connection from this end.
      Serial.println();
      Serial.println(F("Timeout"));
      client.stop();
    }
    // this is a delay for the connectLoop timing
    delay(1);
  }

  Serial.println();
  Serial.println(F("disconnecting."));
  // close client end
  client.stop();


book_woorm

Hi Stewie I was going to go down the 'watchdog timer' route thinking that the one second drumbeat on the interupt would stop when the programme hangs. Using the drumbeat to continuosly re trigger a 555 monostable is simple enough though it would involve a new master PCB for the system, but Ive discovered the Web I/O can hang by itsself and the drum beat caries on dutifuly measuring temperatures and recording data. Other times the data recording stops but the Web server functions don't It all seems to vary with howmany Serial.print statments I've put in a particular version tyring to track the problem.  That smacks of memory overload but FreeRam() is returning between 630 and 750 bytes depending on where in the programme I ask the question.

Thanks to SurferTim for the 'time out' code I'll try that when the current test falls over.

SurferTim

Here is the original test of the timeout. Almost a year ago. I did not find the bug, just provided the fix after it was pointed out to me.
http://arduino.cc/forum/index.php/topic,102879
It may not be your problem today, but it isn't really a matter of "if", only "when". The fails that happen once every couple weeks or months are the tough ones to find.

Stewie

Thanks Tim,
I have incorporated your timeout into my code and will see if it makes a difference.
To date, about 3.5 days of uptime is my best. I'll be interested to see if this now changes (my fingers are crossed....)
Stewie




Nick Gammon

@Stewie

Your posted code does not compile:

Code: [Select]
sketch_mar25a.ino: In function 'void do_weblog()':
sketch_mar25a:17: error: 'lastCloudTime' was not declared in this scope
sketch_mar25a:17: error: 'postingInterval' was not declared in this scope
sketch_mar25a:18: error: 'line' was not declared in this scope
sketch_mar25a:19: error: 'showTimeDate' was not declared in this scope
sketch_mar25a:20: error: 'showRunTime' was not declared in this scope
sketch_mar25a.ino: In function 'void sendData()':
sketch_mar25a:34: error: 'str' was not declared in this scope
sketch_mar25a:35: error: 'cont' was not declared in this scope
sketch_mar25a:44: error: 'GetTemperature' was not declared in this scope
sketch_mar25a:45: error: 'temptemp' was not declared in this scope
sketch_mar25a:64: error: 'EthernetClient' was not declared in this scope
sketch_mar25a:64: error: expected `;' before 'Logging_client'
sketch_mar25a:65: error: 'Serial2' was not declared in this scope
sketch_mar25a:67: error: 'Logging_client' was not declared in this scope
sketch_mar25a:88: error: 'bufindex' was not declared in this scope
sketch_mar25a:97: error: 'content' was not declared in this scope
sketch_mar25a:101: error: 'content' was not declared in this scope
sketch_mar25a:104: error: 'content' was not declared in this scope
sketch_mar25a:105: error: 'failcount' was not declared in this scope
sketch_mar25a:109: error: 'failcount' was not declared in this scope
sketch_mar25a:119: error: 'failcount' was not declared in this scope
sketch_mar25a:125: error: 'Logging_client' was not declared in this scope
sketch_mar25a:129: error: 'failcount' was not declared in this scope
sketch_mar25a:132: error: 'WDTreboot' was not declared in this scope
sketch_mar25a:135: error: 'lastCloudTime' was not declared in this scope
http://www.gammon.com.au/electronics

Stewie

Hi Nick,
No, it wont on its own. In my original post I said..

"I include the logging code here so perhaps someone may spot something I'm doing wrong. It is only a section of the complete code".

I was hoping that someone may spot an obvious error in the section of code that does the POST.

The complete code is spread through five modules and is fairly large now.
One for the server and temperature retrieval, one for power switching, one each for Thingspeak & Nimbits and one for NTP

SurferTim

While you are running this test, are you using a static ip or dhcp to set your ip for the Arduino?


Stewie

Its static IP
Also, In my attempt to reduce the potential of memory leaks/allocation problems I made almost everything global scope variables/arrays while monitoring free memory to see if it was getting eaten up by something. Bad practice from what I understand but if it nails everything down and removes a possibility then I can live with it for now. Currently 4473 bytes free.
Strings in Flash, static IP addresses except for a NTP access on boot and every 24hrs.

- I just noticed that your timeout code just did its thing on a Nimbits post  :). I'm watching the status as I tinker...



SurferTim

If you saw a "Timeout" message, it would have probably locked up then. During my tests, mine locks up and does not recover from it if the timeout code wasn't there.

Go Up