Ethernet Shield Freeze

I have developed a small webserver largely based on the example code provided with 0016. I have patched the Ethernet library with etracer's client.close fix that was recently discussed on the mailing list. I am also using the WString library patched to have a dealloc method for strings. My arduino webserver services a simple ajax request and can successfully respond to easily 5-10 requests/s for long periods of time. The code appears very solid. The hardware is a Duemilanove and official Ethernet shield powered via USB. (I have also tried with external power). The problem I encounter can occur with no additional hardware or circuitry connected.

At times, the ethernet module will stop responding to requests, including ICMP ping. I have verified that it is not a network, comptuer, or cabling problem. I understand that at least ICMP ping responses should be handled entirely within the wiznet module.

I have access to both another arduino and official ethernet shield. I have not yet tried to reproduce this problem, but my first inclination is that it may be wiznet hardware/firmware related or possibly a heat problem. Is anyone aware of any method by which code running on the arduino could cause the wiznet to stop responding to ping (short of issuing a reconfiguration)? Sometimes this problem will happen when the boards have been idle for hours - other times it will occur shortly after or during the servicing of requests. It does not seem to matter how many requests have been made or how long the arduino has been running.

Any thoughts on debugging this?

In my ethernet shield happens a similar issue, I think is because the heat

I've experienced the same issue, but it's pretty infrequent. I solved it by controlling the wiznet reset with an arduino pin so that I could reset the wiznet anytime it stops responding. I'm using the Wiznet 812MJ standalone, so I'm not sure how easy it would be to implement on the shield.

The ethernet shield is not all its cracked up to be, I guess.

This issue is happening pretty frequently on my board. I will try a heatsink, but it appears I might need to do something a little more robust to be sure I can count on it working when it's far away from me.

@Digger450: I'm sure I can tap the reset pin somewhere on the shield to control it with the arduino. Do you have sample code that you can share for how you detect that it is unresponsive in a sketch? I don't want to reinvent the wheel if you have already determined a reliable workaround.

I am also having the cold start bug, and I was planning on adding a capacitor to fix that anyway, but if I have to control the reset pin from the arduino anyway, I might as well just put a pulldown resistor on the wiznet reset and have the arduino take complete control of it. After seeng the plethora of problems with the Ethernet shield, I have to say I'm kind of dissapointed in it.

Gork - I can try and give you some ideas. My application is different than yours as I am receiving messages at regular intervals. If I go a certain amount of time without receiving a message I reset the wiznet. Where this is located it was very important that it be able to recover without my intervention. So, if after 5 attempts to reset the wiznet it still doesn't receive a message I reset the the Arduino as well. This process continues until communication is established. For me, this has worked flawlessly for several months.

You may be able to setup a scheduled connection to either a local machine or one on the internet to check your connection. I simply have the reset pin tied to a digital pin through a 10k resistor. At startup I set the pin high. For a reset I set the pin low for half a second and then return to high. You'll have to re-initialize your settings afterward.

The problem could be heat related, but I'm guessing it could just as easily be software. I agree that it could be more robust, but considering the cost compared to other options I'm ok with the workaround.

Let me know if you need any more info.

I created a Google Code issue for this: Google Code Archive - Long-term storage for Google Code Project Hosting.

Is there a sample program you can post that has this problem?

What if you try a simpler program that just initializes the Wiznet but doesn't use it for communication? Does the shield still stop responding to pings after some time?

Thank you; I will be able to do some further testing hopefully in the next few days. I'm glad in some respect that the problem does not seem unique to me.

As I said, I do have another arduino and ethernet shield. I will try to reproduce the problem on the other hardware first, then I will cut my code down to the bare minimum to see what happens. It might take a while for testing to come up with anything.

I am currently using 0017-RC1 but will go to 0017-RC2 as this appears to have the patch from issue #34 already included.

So I setup a simple sketch that only initializes the shield, but does nothing with it.

#include <Ethernet.h>

// network configuration.  gateway and subnet are optional.
byte mac[] = { 0xDE, 0xAD, 0xBE, 0xEF, 0xFE, 0xED };
byte ip[] = { 192, 168, 1, 191 }; // Local IP

Server ethServer(23);

void setup()
{
  Ethernet.begin(mac, ip);
  ethServer.begin();
}

void loop()
{
  
}

I then wrote a C# program to ping the shield at a set interval. The results are interesting. If a ping fails I log it to a file, here are the results so far.

8/1/2009 4:48:38 PM:  PingLoop Started
8/1/2009 4:48:45 PM:  IP To Ping = 192.168.1.191
8/1/2009 4:48:46 PM:  Ping Interval = 60
8/1/2009 4:52:33 PM:  Failed - TimedOut
8/1/2009 4:53:34 PM:  Failed - TimedOut
8/1/2009 4:54:36 PM:  Failed - TimedOut
8/1/2009 4:55:37 PM:  Failed - TimedOut
8/1/2009 5:52:34 PM:  Failed - TimedOut
8/1/2009 5:53:35 PM:  Failed - TimedOut
8/1/2009 5:54:37 PM:  Failed - TimedOut
8/1/2009 5:55:38 PM:  Failed - TimedOut
8/1/2009 6:52:35 PM:  Failed - TimedOut
8/1/2009 6:53:36 PM:  Failed - TimedOut
8/1/2009 6:54:38 PM:  Failed - TimedOut
8/1/2009 6:55:39 PM:  Failed - TimedOut
8/1/2009 8:52:35 PM:  Failed - TimedOut
8/1/2009 9:52:34 PM:  Failed - TimedOut
8/1/2009 9:53:36 PM:  Failed - TimedOut
8/1/2009 10:52:34 PM:  Failed - TimedOut
8/1/2009 10:53:36 PM:  Failed - TimedOut
8/1/2009 10:54:37 PM:  Failed - TimedOut
8/1/2009 11:53:36 PM:  Failed - TimedOut
8/1/2009 11:54:37 PM:  Failed - TimedOut
8/1/2009 11:55:39 PM:  Failed - TimedOut
8/2/2009 12:53:36 AM:  Failed - TimedOut
8/2/2009 12:54:38 AM:  Failed - TimedOut
8/2/2009 12:55:39 AM:  Failed - TimedOut
8/2/2009 12:56:41 AM:  Failed - TimedOut

I'm not sure what is happening, but it seems to fail around the same time during the hour, but not every hour. I'll let it keep running and see what happens

OK I have been running through this a little more. I burned the exact same sketch to a second set of identical hardware - a second ethernet shield and Duemilanove. I have been hammering it intermittently with HTTP requests of various load profiles and it is 100% solid for going on 2 days -- hasn't missed a single ping or faltered on hundreds of thousands of server connections. The wiznet chip on this unit seems to be operating cooler than the other one.

So what gives? I started taking a closer look at the shield. The build quality is visibly poor. There is flux all over the through-hole components on the back of the boards that has not been washed away. One of the stacking header pins has solder blobs on its entire length. On the front, a couple of the vias near the wiznet LQFP are filled with solder. There was a small blob of spatter stuck to the side of the crystal on the top. Under magnification there is some very small balling and spatter on the pads of the LQFP. I am guessing that this last thing is the problem; there is probably a bridge here but I can't tell for sure without a microscope.

So it looks like a hardware QA issue, but I'm going to give this a few more days of testing before I determine what to do about it. Does anyone know the production method of the ethernet shield? It looks like reflow then hand solder of the through-hole components, but the work on both appears rather sloppy. Is the paste stenceled and SMT placed by hand on these? What is being used for reflow? Even my "Good" board exhibits some build issues. Compared to an arduino, there is a large visible quality difference, although certainly they use different production methods.

I bought the shields from SparkFun only about a week ago. I am probably capable of fixing them (at least if the wiznet is not damaged) but maybe I should just see if they would rather exchange it.

Do I need to follow up on the google code issue? It's not looking like a software/firmware problem.

@Digger450: What version of the IDE are you using that exhibits this behavior? In your above situation it looks like when there is a missed ping, the chip starts responding again soon afterward -- depending on the network setup this may not even be abnormal. In my case the wiznet never came back after it stopped responding. At this point, I think you and I may have different issues, particularly since we're not using exactly the same hardware.

Where did you get the shield?

There have been reports of poor quality copies of the official Arduino hardware bought on ebay recently. The reports I recall were Duemilanove copies, but it's possible to copy any of them...

-j

Sparkfun, as I said - both ethernet shields and one of the Duemilanove boards came from them. I'm pretty sure I have "genuine" hardware all around.

Sparkfun, as I said

Oops, overlooked that. Your boards would be genuine, in that case.

-j

Hej,

sorry it took me this long to respond on the HW QA issue. Could you please report your board number directly to team [at] arduino.cc ? We will run some research on this matter on the production series you mention.

Also please let's keep this discussion open in case it is a software issue. And keep on using googlecode for bug reports.

/d

I'm also having my Ethernet shield freeze up, although I don't detect a heat issue.

It will run for 24-72 hours perfectly fine, and then freeze up. It is only the Ethernet that crashed, as the rest of my program continues to work properly.

I have tried 2 Ethernet shields, with the exact same result, as well as 3 separate Arduino bases (duemilanove). Sorry I can't try more hardware, that's all I own at the moment!

Have you solved this?

Is there code to (through software only) reset the Ethernet? I don't have any available I/Os on the arduino in my setup to dedicate to this.

Any help would be appreciated.

I have a similar problem but have a couple more clues. I posted this in another area but there were no replies so I'll copy it here:

I'm using an Atmega 328 and the wiznet ethernet module. Everything works fine. BUT sometimes the wiznet module stops responding. I'm using code similar to the SERVER code on this website. Or at least I think it's the module since the arduino program is still running.
I have a firewall called Zone Alarm on the windows laptops here. Zone Alarm pops up a window up and says:

The firewall has blocked Internet access to your computer (ICMP Unreachable) from 0.0.0.0.
When this occurs the wiznet module stops responding to GET requests for the server running on the 328.
Does anyone have any idea where IP address 0.0.0.0 is coming from? And what the connection to the wiznet module is? This one has me bafooned! The only thing that comes to mind is ARP? But that's at a lower level?
My ethernet setup is a DSL modem connected to an 8 port switch and all my computers and duino off the switch. The modem has a built in firewall and does not let 0.0.0.0 IP address through. The duino is listening on port 80. It responds to requests for data and just sends back an HTML OK and some data. I tried to make the server code robust so if an unidentified request comes in then it just sends back OK and closes the connection.
These 0.0.0.0 IP address things occur very rarely. Several days apart.

After reading these posts it looks like the Wiznet chip may have a problem. I have 2 new wiznet modules and they act the same way. Note that I'm using the wiznet module and not the ethernet shield.
Since I wrote the above post I put debugging code in my server function to log any connection that does not match an expected connection request. So far the Wiznet module has froze up several times after this addtional code and nothing has been logged. I also tried changing the port from 80 to 6 and it still freezes up.
Could someone with more ethernet/tcp/arp experience comment on this?

Thanks,
Rich

I'm going to have to start blaming the hardware also. I've been experiencing random failures also. I have a test running that listens for connections (there are none) and periodically sends out a poll (calls a remote cgi script) and collects some nonsense data. Other than printing the time periodically that's all it does, but it fails sometimes. After reading these posts and touching the Wiz (Ouch!) I decided maybe the chip was getting hot. I glued a LM335 to the chip and wired it to an analog port. I'm now monitoring the temperature. Guess what? It's not failing.

That leads me to three possibilities.

  1. The problem is still intermittent and I just haven't experienced it yet.
  2. The problem is heat and by adding a bit more plastic to the case, I've created additional area for heat dispersion.
  3. I have no idea what I'm doing and I should give up and go have a beer.

Since I'm trying to lose weight, I'm not going to entertain number 3, no matter how likely it is.

Time will eventually tell me if number 1 is correct.

Nothing will guarantee that # 2 is correct.

Anyway I'll keep this post going if anything comes up. By the way the temperature is hanging around 102-105 degrees F. I suspect the difference is due to bit wobble. I take 10 readings and average them, but I still see a variance. I'm not using a precision reference either, just the 5 volt line, which measures 4.87. I have a LM336 that I could hook up if I really wanted to be precise, but I'm looking for a change not necessarily an absolute value.

The poll is occurring every 5 seconds so I'm keeping the chip busy.

Jim.

Hm, that didn't work. I don't think it's heat related after all. It died after at about just over 4 hours and 102 degrees. I put a fan on it and the temp is now in the low 90's. It failed immediately after reset. It's now been running for 26 minutes and the temp is down to 87-89 degrees.

Jim.

Recompiling my application under Arduino-0018 improved the "uptime" to about 4 to 6 days. Previously, the Ethernet shield would freeze in under 2 days (using Arduino-0017).

I looked at the release notes, and the only things I see that might affect this are:

  • No longer disabling interrupts in delayMicroseconds().
  • Fixed bug w/ micros() returning incorrect values from within an interrupt.

I would be interested to know if anyone else sees an improvement under Arduino-0018?

I have a similar problem to those described in this post. I try to initialize the wiznet chip with some code (arduino-0018):

#include <Ethernet.h>

byte mac[] = { 0x90, 0xA2, 0xDA, 0x00, 0x01, 0x23 };
byte ip[] = { 10, 10, 10, 12 };

//Server server(80);

void setup()
{
Ethernet.begin(mac, ip);
}

void loop()
{

}

And the Ethernet shield is completely unresponsive. I am using Ubuntu Linux 9.10 (Karmic Koala) and a Duemilanove board with an Adruino Ethernet Shield fitted with a wiznet W5100 chip. I check my arp table, and the Ethernet library is not registering it's mac address. Here is my arp -a output after initializing the board with the code above:

qwestmodem.domain (10.10.10.1) at 00:24:7b:c6:1d:70 [ether] on eth2
? (10.10.10.12) at on eth2

In my Windows 7 Environment, I get no arp entry at all. I've read about the Issue 34 fix but this fix applied only after the interface was completely initialized.

I bought the board from SparkFun last week and I live near Boulder so its not a big deal for me to exchange it. I feel like the board is faulty after reading this post, and I'm going to order another board on Friday and I'll post again with my results using different hardware. It would be nice if they would QA this thing a little better. It makes it hard for anyone messing with electronics that doesn't know the first thing about how network interfaces and tcp/ip works.

;D

I forgot to mention that I have got this to work intermittently last week, and I couldn't make sense of it because it would work sometimes, other times it would not. When it did work, I was able to ping the interface but any other kind of communication did not work. I tried simple HTTP clients and servers, none worked.

At a certain point sometime last week the chip completely stopped working and I haven't been able to get it working again no matter what I try. I went as far as hacking the libraries a bit and using some serial output debug messages in all ethernet.cpp library functions to see if there was a problem with the library and the code seems to be pretty good. Each function does it's job the way it should. It doesn't seem to be a problem in the software, and the code works on other chips of the same model.

As of now I would have to blame the hardware build of the board that I have, but I will know for sure when I get another board on Friday. If that one gives me the same headache I'll try exchanging one more time.

Has anyone else had these types of problems when ordering from different supply companies than SparkFun?