Reliable and programmable Wachdog

Hi, I'm trying to do a reliable and programmable Watchdog (think of it as an Industrial Watchdog).

I arrived to the conclusion that the best way to do it, is to attach a second ATmega (that can serve also as a co-processor for other tasks) which waits for a signal on "X" pin every "X" seconds, if it doesn't receive it then it executes the desired operation: Reset, print a message, alarm, etc.
The same will do the principal processor, so this way I can detect if it's the co-processor down.

I was also thinking that depending on which part of the loop the program is, I can write a value to a variable (Ex. a byte value) so if the processor/program freeze at this part, I can also print the error message.

Is there any better way to implement this? What I'm looking for is reliability.

Is there any better way to implement this? What I’m looking for is reliability.

A watchdog reset is a band-aid for poorly written code (not all of which will be your fault). Fixing the code is a much better solution.

The second processor more than doubles the chance of a hardware failure and something like triples the odds of a software problem.

Think about using a handshake from the PC. (Handshake - a regular message from one processor to another that must be responded to).


If what you are trying to detect is temporary malfunction, use the builtin watchdog. If you are trying to detect permanent malfunction, then you need 2 mcus that monitor each other, just as you describe.

What are you going to do when a malfunction is detected? You need to be careful not to re-introduce a single point of failure.

Thanks all for the answers.

We use the ATmega's for PLC and CNC machines, so all for industrial use. Code lines are normally +3000 lines and normally are ok, but problem can happen also on hardware side.

Either the way I need to double check if an error was made, and if it happens evaluate it and act accordingly.
To give you an example: Remember when on old windows 98 machines (I think also on newer ones) when you just suddenly plugged out the power connector and restarted the computer again the SO knew you didn't powered off correctly? This is done because when the SO boots up it stores the error in non-volatile storage, and when you power down the SO it set this to "0" again.
Same happens with machines, if something strange happens, as it is a partially blind system, you need don't know where the motors or others are positioned or if an operator hand is in between of the axis travel.

I don't want a watch-dog reset, I can't do that. I want a watch-dog that can evaluate an execute a program based on the reported error. Or at least a watch-dog that halts completely the system an prints an error message.
We also don't use any PC, it is an standalone system.

So I wanted to share my thoughts and recollect any better idea, because ideas distributed across several people work much much better.

Thanks to all!