Does an Arduino need an external hardware watchdog circuit?

There has been some discussion regarding the use of an Arduino for embedded systems. This discussion recently then focused on hardware watchdogs.

Is an external hardware based watchdog circuit useful? Something that might be centred around a Maxim DS1232 for example. I've not found a thread here regarding watchdog hardware. All watchdog threads seem to dwell on the internal watchdog circuitry. This is clearly a software based approach, and might still be susceptible to software issues that an external hardware circuit would be immune to.

On the other hand, most Arduino projects don't make it into Jupiter's radiation belt so is this all academic? My home server had been up 287 days (as of today) without a latch up. It's somewhat faster and more complex than an Uno and doesn't have a hardware watchdog...

Actually the ATMega328p watchdog is a timer, a piece of hardware that can be used to reset the MCU in case it hangs - independent of software. The basic idea is to enable the Watchdog and define a time-out for it. Look the example:

#include <avr/wdt.h>
void setup()
{
  Serial.begin(9600);
  wdt_enable(WDTO_1S);
}
void loop()
{
  // the program is alive...for now. 
  wdt_reset();
  Serial.println("Hello");
  while (1)
    ; // do nothing. the program will lockup here. 
  Serial.println("Can't get here");
}

Source: Detect Lockups | Articles | MegunoLink

On the example above, you can see that if you don't call the "wdt_reset()" within "1second" the MCU will restart. The ideia is that if your MCU hangs, the WD Timer will expire and reset it for you.

An external IC supervisor will work similarly, you'll need to keep sending a "ping" from the one of the MCU IO to the supervisor. If the "ping" stops coming, the supervisor will trigger the "reset" for a few milliseconds.

Supervisors normally comes with other features like voltage monitors: if the supplied voltage goes out of the threshold it'll trigger the reset and only release it after the voltage stabilizes. Anyway, AVRs also have the built-in BOD, which can be configured via fuses to perform very similar function.

Either the internal watchdog or an external watchdog will need a bit of software to keep resetting the counters, but both will trigger in case you software hangs (or a piece of code takes longer to run than it was expected). Just don't do "development" with WDT on, only turn it on at final stages of your project to avoid frustrations :wink:

Cheers,

This is clearly a software based approach

Please explain?

sterretje, I mean that if it's an on chip watch dog, it's probably tweaking an NMI or RST line, which might then indirect to an interrupt handler routine. It doesn't toggle the main CPU power thereby forcing a cold restart, I know that a desktop PC isn't a micro controller, but I've had occasion to pull the power lead from my desktop. No amount of CTRL C, CRTL H or CRTL ALT DELs would help.

There are comments on an associated thread suggesting that a watch dog should toggle the power to the whole chip. That seems more robust than tweaking the RST pin.

cossoft:
There are comments on an associated thread suggesting that a watch dog should toggle the power to the whole chip. That seems more robust than tweaking the RST pin.

Reference?

Excuse me. Arduino embedded suitability replies #1 & #6 specifically...

There is a plethora of readily available material from reputable vendors that covers CMOS latch-up at length. Most of it is 15 or more years old. If you believe chip designers have not solved the problems by now then an external watchdog is a reasonable choice. Bear in mind that, based on that premise, an external watchdog is also subject to latch-up and poses the same risk. In addition, adding more parts adds more points of failure. It can be reasonably argued that an external watchdog when an internal one is available would make a system less robust.

Is an external hardware based watchdog circuit useful?

Maybe.

Something that might be centred around a Maxim DS1232 for example.

I'll go and say that a DS1232-like circuit is NOT useful. The functions of the DS1232 are mostly included on-chip in the AVRs on Arduino. Brownout detection, somewhat flexible reset timing, and a watchdog timer. A lot of the "system monitor" chips date from a time when processors didn't have that sort of capability on-chip (we used a brownout/reset controller on some 68000-based systems, for example. Perhaps incorrectly.)

Watchdog timers are primarily a fail-safe for software problems, although "desired behavior" can be difficult to define. (A system that has some bug that causes a watchdog reset every 24 hours is probably useful. A system that has a watchdog reset during initialization - not so much.)

The prior discussion (Arduino vs embedded system 2.0 - Microcontrollers - Arduino Forum) was about latch-up and problems requiring a hard power cycle to fix. Those are much less common, and much harder to correct for, and a simple chip like the 1232 won't do it.

The one circumstance where an external watchdog might make sense on an AVR is if you're designing to a Safety Specification that requires features not present on the AVR itself. More subtle brownout or power fail detection, window watchdog timers, reset signal shaping caused by an external signal not within spec, etc.

CMOS latch-up is a difficult problem to deal with for high availability systems. As westfw has said a reset (from internal or external watchdog) will not clear the condition, it requires removing power from the device (e.g. discharge any supply capacitance below about .4V). Since it may be difficult to identify which CMOS device has the condition it is often best to remove power from one-sided of a redundant system. Proper power supply bypassing and shielding should make it a very rare condition but can still happen from ionizing radiation sources like cosmic rays. Perhaps a simple MCU like an ATmega328 may operate over its lifetime without experiencing a latch-up. My understanding is that the rate of occurrence of latch-up is related to the die area, the number of parasitic thyristor elements on the silicon die, and the radiation flux. This means large more complex MCU are more likely to experience latch-up during their operating life.

I have been a little obsessed with how redundant stuff might be done with a full duplex RS485 bus I've been tinkering with. Anyway, this is what I arrived at, the example idea is for flow meters.

The redundant power box would be a pain, I can easily see spending numerous iterations trying to sort it out. Notice that each MCU board controls a digital line that can turn off power to its partner, it can hold the power off until all the capacitors have had time to discharge.

That's an impressive amount of work. Are you irrigating the whole state of Texas :slight_smile:

This reply has nothing to do with this thread, but I wonder as to your reasons for building such a system. If it's for self gratification and as a hobby, have at it with gusto. If it's just for automating the irrigation, did you consider going pro and using PLCs? This is what would be done in a factory, and all of the RS-485 comms, I/O, level shifting, safety and flexibility have already been designed and refined. After all, PLCs run nuclear reactors. And you can get a decent entry level PLC for $300. The four CPUs in your Trustworthy Environment must have cost a few $$$ anyway.

Probably best to ignore this post though...

That was just an itch to scratch. I worked with Telcom equipment for a while and it exposed me to some ideas. I got to ponder over what a stuck bit was in ECC memory and why it would not clear until after a power cycle. It really did not make sense to me at the time.

The RS485 stuff was also an itch, but the transceivers are not that expensive, though the one I used seems to have gotten a little more pricey for some reason. Sparkfun had an RS485 board that used little power while resting, it is gone now but that was what I based this transceiver setup on. It sips power while a radio gulps. Each multidrop serial line has one power gulping WiFi radio. So I can run a CAT5 cable between several nodes (flow meter, irrigation controller, or whatever) with a small solar panel (3W) and battery (7AHr). Only one node needs the large (20W) solar panel and battery (40AHr) with the shared WiFi device. It turns out that power is the most expensive part of the game so I would like to minimize that as much as possible.

I was peripherally involved with (wrote software, didn't design hardware) a lot of rather complex computer networking gear (which should have a high reliability), and I don't ever recall "latchup" being a serious enough problem that the HW engineers worried about or customers demanded "explicit self-power-cycle ability." By the time we were building systems with redundant power supplies and etc, that was more to meet "standards requirements" than actual need, and more to allow hot-swap of broken or upgraded cards than to address transitory HW issues like latch-up.

(Allowing the main CPU to manipulate the RESET signal on interface cards/chips, in case they stopped responding, was mandatory, though!)