Atmega328 hangs. Even WDT fails to rescue!!

I have been working on a project where we use Atmega328 and Sim900A chips to transmit some data over GPRS. The first prototype is working since 4 months while second prototype is working since more than a month. Our code has evolved quite a bit through this time.

The next set of boards we have made uses same PCBs as prototype 2 and same set components except for two components (but almost identical specs) in power supply of sim900. This set of boards was built after successful operation of prototype 2 for couple of weeks. Even though the latest set has practically same h/w and s/w, the boards are suffering very weird problem. The atmega328 gets hanged during operation and stops working.

I am struggling since more than a week to find the reason for the problem but have not had any success so far. Any help/suggestion will be highly appreciated. Following are relevant points

-The circuit is supposed to work 24x7. It is found to hang at various times like right at the start-up, few minutes after start-up, or few hours after start-up. I tried to get exact point where it hangs through serial log, however there isn't any particular point at which it hangs. -The boards have atmega328pu in place of atmega328p. I have always burned arduino bootloader before uploading code via USBASP. The sketch takes up 93% of available flash space -The fuse bits are set to same values as they are while burning bootloader from Arduino IDE. I even verified this by reading fuse bits from one chip. -Global variables take 79% of dynamic memory. However local variables are kept to minimum. I observed free space in stack at various points as per this and found that at least 275 bytes were always free. Hence I ruled out stack overflow possibility -The circuit has onboard SMPS. Atmega works on 5V while Sim900 is working on 4.3V. I checked supply voltage to Atmega328 as well as Sim900 on oscilloscope. The voltage variation was less than 100mV. -The reset pin has been pulled up, so no possibility of noise there. However the Rx and Tx remain open. -As the code has evolved over a long time with previous version working well for more than a month, I don't suspect coding bugs like infinite loops causing this problem. Still it has been thoroughly checked for such possibilities. -Watch dog timer has not been used. Its a possible workaround but I would ideally like to find reason why the board hangs in first place.

After checking all the possibilities I can think of, I still don't have any clue about reasons behind this weird problem ::) :( . A sincere request to post any suggestion you may have.

Thanks in advance.

It is line 26 that is wrong. If not then my psychic powers have failed and you will have to :- https://www.youtube.com/watch?v=tdQL_7osGzc

I can understand the difficulty in understanding the problem without the code. However I am still not sure whether its a s/w bug or h/w issue. I will put some more efforts before asking you to go through such a long code.

How about the oscilator? Internal or external and what speed and load capacitance?

I am still struggling to find a solution to above problem. We have tried many things in the meanwhile, following are some which can lead to some conclusion

-The uC hangend at least once while running a very minimal code. Thus we concluded the issue is with hardware and NOT with the firmware -The uC hangs even with watchdog timer in place. -Two probable causes we have worked on so far are glitch in power supply and RF interference -We tested the performance with a power supply of higher current rating, and the problem still occurred. Also as the issue is not repeating regularly, we do not suspect it to be a power supply issue.

Based on above elimination, RF interference seems to be the only remaining cause. As the circuit also has sim900 for gprs connection, we suspect some kind of RF interference causing the problem.

The most perplexing question is why does WDT fail to reset the chip when it hangs? We use WDT to purposely reset the chip once in the code and hence are sure that we have enabled it correctly. Following is the code used to initialize WDT.

void wdt_init(void)
{
  /* Disable interrupts before enabling WDT. */
  cli();
  mcusr_mirror = MCUSR;
  MCUSR = 0;
  wdt_disable();
  wdt_enable(WDTO_2S);
  /* re-enale interrupts */
  sei();
}

Any idea under what conditions WDT can fail to reset the chip? (Apart from infinite loop with WDR, as we can figure out from serial log if the code is stuck in some loop.)

How about the oscilator? Internal or external and what speed and load capacitance?

We are using external 16MHz crystal. Sorry but I do not understand what do you mean by load capacitance.

22pF caps from crystal pins to Gnd.

Do Sim900A chips need a lot of current? What is the power source for the system?

It's cellular radio! You can be damned sure it uses a lot of current - 200-400 average during data tx/rx, with peaks during transmit to 2 A (no wonder using cell connection eats battery life on your phone!)

Dollars to donuts, it's a power problem caused by above.

Which two parts did you replace with supposedly identical ones? Is the board layout unchanged between the non-working version and the one that works? What are the part numbers of these supposedly identical parts?

I had some fairly simple code which misbehaved until I added input protection to the input pins. If you are in the presence of RF that is probably a good start. However I agree with DrAzzy that power could also be an issue. Do you have lots of decoupling capacitors? Also a larger capacitor on the 5V line to absorb sudden power drains by other components?

The most perplexing question is why does WDT fail to reset the chip when it hangs? We use WDT to purposely reset the chip once in the code and hence are sure that we have enabled it correctly.

Possibly it is going into some brown-out state from which the processor does not recover. Do you have brown-out detection active?

CrossRoads:
Do Sim900A chips need a lot of current? What is the power source for the system?

Yes we do have 22pF capacitors right next to crystal

CrossRoads: Do Sim900A chips need a lot of current? What is the power source for the system?

Yes Sim900 requires short bursts of 2A for 577 µs every 4.6 ms.

We are using a 800mA smps with additional 2x220uF capacitor to compensate for sim900 current bursts. We were concerned how good this will be but decided to stuck to it as the prototypes worked fine for weeks without any problems.

We did believe for a long time that it cannot be anything but power supply. However I believe its something different after trying out following -Put 1000uF capacitor across supply of Atmega328 and it still hanged at least once -Increased capacitor for sim900 to 1000+220uF still to no avail -Tested with a 2A supply and still the issue repeated once -Sim900 itself is very sensitive to supply voltage and it has been posted on many forums that a glitch for even few nano seconds can cause it to restart. However we have observed that when Atmega hangs, sim900 doesnt even restart.

[quote author=Nick Gammon link=msg=2282514 date=1434695272] Possibly it is going into some brown-out state from which the processor does not recover. Do you have brown-out detection active? [/quote] The fuse settings are same as that for standard arduino. Thus we have brown out at 2.7V. Can you please explain more on processor going into brown-out and failing to recover?

[quote author=Nick Gammon link=msg=2282512 date=1434695202] I had some fairly simple code which misbehaved until I added input protection to the input pins. If you are in the presence of RF that is probably a good start. [/quote]

Some of the input pins are left open on the board. I wish I had not done that but I cant change it now. What do you mean by adding input protection to input pins?

We will be relieved even if we can get WDR to work whenever controller hangs. Does it make any difference enabling WDT in code vs setting Watchdog always on fuse?

I doubt if it would make much difference. You could try it.

What do you mean by adding input protection to input pins?

Floating pins could be an issue, but pins with wire runs connected are more likely to be as they could act as aerial to pick up RF. See, for example: http://www.digikey.com/en/articles/techzone/2012/apr/protecting-inputs-in-digital-electronics

Also http://www.thebox.myzen.co.uk/Tutorial/Protection.html

Failing that, please post your code (you can attach code). Just posting small snippets doesn't really cut it.

Thank you everybody for your suggestions. We tried reducing the chances of interference by getting rid of some cables coming out of the board. It has reduced the occurance of atmega freeze but hasnt eliminated the problem.

It has now become very difficult to test effect of any changes as some boards have run fine for more than a week before freezing.

[quote author=Nick Gammon link=msg=2283596 date=1434751317]
Failing that, please post your code (you can attach code). Just posting small snippets doesn’t really cut it.[/quote]
As I mentioned earlier, even a minimalist code hanged once. I am attaching that one.

We have concluded that putting up an external watchdog is the only way out now. Planning to use this chip from TI.

Meanwhile I found at least one other similar experience; that too with a a probable insight into the cause of the issue.

However I believe the problem of Atmega328 freezing (due to h/w issue) can be faced by anyone in the future and we must find out why it happens and how to prevent it.

led_blink_test.ino (2.85 KB)

anandteke, I have same type of problem. My circuit is very simple. Code is running very well but when my hand near to oscillator / mcu then mcu is hanging. Some time after reset it by main power on off hanging is going.

I can not understand why hanging is not over after reset it by main power on off.

Any one have any idea.