Ethernet stability

You're referring to erratum 2/3 which deals with ARP traffic. As ARP traffic is also used in non-DNS requests I don't think this is the source of your problems. But erratum 1 is a possible source. It describes a race condition where the reception of a UDP datagram occurs almost simultaneously with the sending of a packet. The code of the Ethernet library only check for the SEND_OK bit in the interrupt register, which may never happen (at least the erratum says that). The recommendation in the errata document is NOT in the Ethernet library yet.

I'm not sure if your problem is related to that erratum, because unfortunately the erratum says nothing about the TIMEOUT interrupt. If the TIMEOUT occurs as specified in the datasheet, the code in the Ethernet library is correct.

Thanks for your input :slight_smile:
I have tweaked and adjusted and tried many things but the end result is the same which is after one or two days, it locks-up.
To keep things happening, the watchdog reboots the board and it all seems happy again. Its transparent to the user, the only way I know its happened is an uptime timer on the web-page.
Perhaps I'm just trying to do too much on the one board..
NTP lookup, web-server for control/status and client for posting logging data.
I just dont understand C or C++ well enough to dig deep down. In time perhaps.....
Meantime, I have written a server and client for the wiznet in assembly language. Its a fraction of the size, but not as comprehensive. DNS and NTP still to do....
Driving the W5100 at register level has been quite educational. I certainly learned a lot about the structure of web transactions.
With the low-level access, its easier to see whats going wrong. This highlights the lack of useful diagnostic and status info that is 'missing' from the ethernet library, which is basically a go/no-go affair but nothing available to show whats going on in the background, or at what point something failed.

Now that I know a lot more about the chip, I can have another try at the C-code and see if it makes more sense.
:slight_smile:

Your problem seems to be a bit like mine, but I'm not too sure about the racetrack cause. I'd like to know how you get on with coding for the 5100 chip.

My set up is a Uno and ethernet shield with sd card and relay outputs. The sketch converts the resistance of Analoug PT1000 sensors into temperatures which then set the relays. Every 10 mins all the data is dumped to the SD card for record purposes. and a Web server section allows this data to be viewed over TCP/IP on another computer, It is alos possibe to alter the thermostat settings from the pc. Also the clock is set by NTP via UDP at startup and then suposedly once a day ther after. all the bits of the programme work (Thermostat settings, A/D conversion, Data to SD card, SD card to Web page, NTP update) but when they are all together somthing causes a crash after a few hours.

I've looked at Power supplies, SD card formatting, SD card maloc bug, EthernetUDP memory creep, and FreeRam()) nothing seems to make any diference, and its been driving me crazy for 6 months.

Hi book_woorm,
I have never been able to find a definitive answer but I have noticed that the length of time before lockup/crashes is maximised by not using DNS lookups so I use hard coded IP addresses.
Its not good as a long term fix but it makes a crash rate of hours turn into days.
The only section I still use a lookup is once every 24hrs I do a lookup for a timeserver at oceania.pool.ntp.org. If my home router had a timeserver function I could even hard-code that IP addr as well.
I have read many theories about the reasons for this, lack of memory often seems to come up as a possibilty. (have you seen the Goldilocks board?, 16K ram!)
To get around my lockups, I use the watchdog timer to reboot the board. It always seems to come back up OK.
On my Arduino server, I have an uptime timer (days/hrs/min/sec) so when I look at the page I know how long its been since its last reset.
What I dont differentiate (yet) is whether a reboot is due to an ethernet lockup or a string of five consecutive logging-service post failures. I will remedy that soon.
It seems that the logging sites can miss quite a few posts in busy times, at least thingspeak can, nimbits seems more reliable in that sense.
This is only my personal experiences.
One thing you are doing that i dont is writing to the SD card. I just read from one. I'm not sure if writing uses more ram.
Good luck with your coding, I hope you find some answers :slight_smile:
Stewie

This code can cause a lockup if the connection breaks (hardware fail).

// look for response
    while (Logging_client.connected()) {

      // if the connection breaks while in this loop
      // this while loop will never exit

      if (Logging_client.available()) {
        char c = Logging_client.read();
        Serial2.print(c);
        if (bufindex < 198) {
          //store characters to string 
         content[bufindex] = c;
         bufindex = bufindex+1;
       }
      }
      content[bufindex] = 0;
     }

This is what I use to prevent those ugly lockups. Maybe it will help you. It has a timeout that will close the connection if no packets received for 10 seconds.

  // connectLoop controls the hardware fail timeout
  int connectLoop = 0;

  while(client.connected())
  {
    while(client.available())
    {
      inChar = client.read();
      Serial.write(inChar);
      // set connectLoop to zero if a packet arrives
      connectLoop = 0;
    }

    connectLoop++;

    // if more than 10000 milliseconds since the last packet
    if(connectLoop > 10000)
    {
      // then close the connection from this end.
      Serial.println();
      Serial.println(F("Timeout"));
      client.stop();
    }
    // this is a delay for the connectLoop timing
    delay(1);
  }

  Serial.println();
  Serial.println(F("disconnecting."));
  // close client end
  client.stop();

Hi Stewie I was going to go down the 'watchdog timer' route thinking that the one second drumbeat on the interupt would stop when the programme hangs. Using the drumbeat to continuosly re trigger a 555 monostable is simple enough though it would involve a new master PCB for the system, but Ive discovered the Web I/O can hang by itsself and the drum beat caries on dutifuly measuring temperatures and recording data. Other times the data recording stops but the Web server functions don't It all seems to vary with howmany Serial.print statments I've put in a particular version tyring to track the problem. That smacks of memory overload but FreeRam() is returning between 630 and 750 bytes depending on where in the programme I ask the question.

Thanks to SurferTim for the 'time out' code I'll try that when the current test falls over.

Here is the original test of the timeout. Almost a year ago. I did not find the bug, just provided the fix after it was pointed out to me.
http://arduino.cc/forum/index.php/topic,102879
It may not be your problem today, but it isn't really a matter of "if", only "when". The fails that happen once every couple weeks or months are the tough ones to find.

Thanks Tim,
I have incorporated your timeout into my code and will see if it makes a difference.
To date, about 3.5 days of uptime is my best. I'll be interested to see if this now changes (my fingers are crossed....)
Stewie

@Stewie

Your posted code does not compile:

sketch_mar25a.ino: In function 'void do_weblog()':
sketch_mar25a:17: error: 'lastCloudTime' was not declared in this scope
sketch_mar25a:17: error: 'postingInterval' was not declared in this scope
sketch_mar25a:18: error: 'line' was not declared in this scope
sketch_mar25a:19: error: 'showTimeDate' was not declared in this scope
sketch_mar25a:20: error: 'showRunTime' was not declared in this scope
sketch_mar25a.ino: In function 'void sendData()':
sketch_mar25a:34: error: 'str' was not declared in this scope
sketch_mar25a:35: error: 'cont' was not declared in this scope
sketch_mar25a:44: error: 'GetTemperature' was not declared in this scope
sketch_mar25a:45: error: 'temptemp' was not declared in this scope
sketch_mar25a:64: error: 'EthernetClient' was not declared in this scope
sketch_mar25a:64: error: expected `;' before 'Logging_client'
sketch_mar25a:65: error: 'Serial2' was not declared in this scope
sketch_mar25a:67: error: 'Logging_client' was not declared in this scope
sketch_mar25a:88: error: 'bufindex' was not declared in this scope
sketch_mar25a:97: error: 'content' was not declared in this scope
sketch_mar25a:101: error: 'content' was not declared in this scope
sketch_mar25a:104: error: 'content' was not declared in this scope
sketch_mar25a:105: error: 'failcount' was not declared in this scope
sketch_mar25a:109: error: 'failcount' was not declared in this scope
sketch_mar25a:119: error: 'failcount' was not declared in this scope
sketch_mar25a:125: error: 'Logging_client' was not declared in this scope
sketch_mar25a:129: error: 'failcount' was not declared in this scope
sketch_mar25a:132: error: 'WDTreboot' was not declared in this scope
sketch_mar25a:135: error: 'lastCloudTime' was not declared in this scope

Hi Nick,
No, it wont on its own. In my original post I said..

"I include the logging code here so perhaps someone may spot something I'm doing wrong. It is only a section of the complete code".

I was hoping that someone may spot an obvious error in the section of code that does the POST.

The complete code is spread through five modules and is fairly large now.
One for the server and temperature retrieval, one for power switching, one each for Thingspeak & Nimbits and one for NTP

While you are running this test, are you using a static ip or dhcp to set your ip for the Arduino?

Its static IP
Also, In my attempt to reduce the potential of memory leaks/allocation problems I made almost everything global scope variables/arrays while monitoring free memory to see if it was getting eaten up by something. Bad practice from what I understand but if it nails everything down and removes a possibility then I can live with it for now. Currently 4473 bytes free.
Strings in Flash, static IP addresses except for a NTP access on boot and every 24hrs.

  • I just noticed that your timeout code just did its thing on a Nimbits post :). I'm watching the status as I tinker...

If you saw a "Timeout" message, it would have probably locked up then. During my tests, mine locks up and does not recover from it if the timeout code wasn't there.

I thought that the read-from-client part that you have inside your timeout section should not be able to happen (in theory...)
The Ethernet library has a timeout and retries built into it and if the connection hangs while waiting for data then shouldn't the library force-close the socket? thus allowing exit from the wait/read loop?
Perhaps this force-close fails if a flush is not done first? or is it?
I'm just speculating here as I don't understand the internals of the library, which is what prompted me to do it in assembly.
I'd still rather understand whats happening in the high-level code :~

Do you know if its possible to read the W5100 socket status register?

That would certainly help me with trying to pinpoint where things are going wrong, or at least perhaps rule some things out.

The w5100 has a problem with hardware fails. If the connection is closed (not a fail) by the server, all goes well. If the connection breaks, the server close message never gets to the client, and that while loop never exits.

If the ethernet firmware/library is supposed to timeout on its own during a receive, it doesn't. At least last I checked, and I think you are rechecking it right now. :slight_smile:

Yes, you are right about no (in the W5100 anyway) receive timeout. I dont understand the Ethernet library enough to know if it is in the software or not.
I was getting mixed up with the connect/transmit timeout. I implemented a receive timer in my assembly language version for just that reason or, as you say, you never get out of the loop.
I kept a stack of testing logs when I was working on the client side so I just had a look through them and right enough, the receive timeout error crops up several (2~5) times daily for Thingspeak. It tends to happen around the same period as 502 gateway unavailable conditions, also a daily occurrence. I put it down to server loading during busy periods, where it could not keep up (?)
The other errors that repeatedly come up are...
While waiting from a response to the connect request, a close arrives from the remote end, or...
Connection request simply times out.
At least both of these are simple to deal with, but as you pointed out, the receive side can just hang forever. I suspect that will be the source of at least one of my problems so thanks for highlighting that one.

The timeouts/bad-gateway errors seem to not happen for ages then a block of them appear off and on for perhaps 20min, then all back to plain sailing again. Its quite possible that sometimes the Arduino would get 'stuck' at this point, but also possible for it to get lucky and sail through and not get caught till another time.

I'll try with the timeouts in place and see how it goes.

I have an Ethernet shield that checks a couple of other servers, roughly every 10 seconds. It ran OK for weeks but eventually seemed to hang occasionally. I added a watchdog reset and so far, no problems. In setup I added:

  // watchdog setup in case shield hangs
  wdt_enable(WDTO_8S);  // reset after eight seconds, if no "pat the dog" received

Before connecting to a client I have:

    wdt_reset();  // give me eight seconds to do stuff (pat the dog)

And that's it! (Plus an include at the start of the file):

#include <avr/wdt.h>

I do not use a watchdog timer in any of my code. But that is just me...

OTOH, Nick, I am using IDE v1.0.4, and I know how you are about upgrades. :wink:
Do you use that timeout code? Like I told the OP, the fails that happen once every couple weeks or months are the tough ones to find.

Upgrade: Exchange your old bugs for new ones.

But as it turns out I am on 1.0.4 right now. :slight_smile:

I don't fiddle around with Ethernet timeouts, they just seem to happen in a timely way. The watchdog lines are all I use, and to be honest I don't know if they kick in often because the board just seems to keep working.

I have one in my garage monitoring if the roller door is open or not, with no watchdog, and I've never had to reboot that except once I think after a brown-out of the house power. I'm talking a couple of years here.

AH HA!! I KNEW IT!! Deep down inside, you always wanted a reliable current version. :slight_smile:

I bumped your karma 'cause I like your stuff.