Hot Spare/Heartbeat Synched Backup

I have an idea I'd like to try, but it's doubtful I'm the first to think of it. That is why I'd like to see if anyone has tried to:

Set up two Arduinos (of any model) or bare ATMega chips side-by-side, sharing all inputs between both, except for one. Put one Arduino into a loop or low-power mode, and use an interrupt to wake it if the other one fails or gets locked up at which point it would take over the duties of the first Arduino. Whether a "heartbeat" or an interrupt would be better, I do not know. But I'm curious if anyone has tried this or could offer any insight as to the best approach (if it's even possible!)

Looking forward to comments, Thanks!

Peter

If you don't want the project to fail, use a good quality official Arduino board, and let others check your sketch.

With two Arduino boards parallel, the projects is getting more complex, and thus has more chance to fail. A shared power supply could fail, the sketch with extra code has more chance to fail. If an electric EMC pulse damages one board, it is likely it will damage the other board too.

Suppose you have one million Arduino boards parallel, that has a high fail rate, since just one Arduino board could block a signal on a pin. If one million Arduino boards is less safe, so is it with only two boards.

There are better ways to keep something going. For example using the WatchDog inside the ATmega chip, or two complete and seperated systems. But it depends on the project what is best.

There are watchdog monitor chips (Microchip) that will force a low If the watchdog reset period is exceeded..
IMO attempting to make a redundant Arduino is an exercise in futility as the standby chip can't be in sync with whatever the main chip was doing at the point of failure. Nor can it be truly in parallel unless the reset was asserted on the standby chip.. this to hold the I/O in high impedance condition on the 'backup' microcontroller.
The best method I can think of is just to write code that doesn't hang when/if faulty input causes undetermined states to occur.
Write the Code once but write it carefully and those issues go away.. Usually...

Doc

I think all of the versions of the Ardupilot boards implement a watchdog with two AVRs talking together. The schematics and firmware is all available online if you want to poke at it.

Here's an option:


http://www.crossroadsfencing.com/BobuinoRev17/

petermetzger:
I have an idea I'd like to try, but it's doubtful I'm the first to think of it.

Not even close. The phrase to search for is "redundant controller" or "redundant system".

Set up two Arduinos (of any model) or bare ATMega chips side-by-side, sharing all inputs between both, except for one.

How do you decide which is master at startup?

What happens if the "dead" processor has one of the pins stuck as an output?

How does the backup determine that the primary has died?

Do you imagine a system that supports a single fail-over or an arbitrary number of fail-overs?

How do you transfer state from the master to the backup?

...or gets locked up...

A "lock up" is nearly always the result of a software bug. Presumably both processors will be running nearly the same program which means the backup is also going to lock up. It's a paradox of redundant systems: how do you avoid a dual software fault.

Whether a "heartbeat" or an interrupt would be better...

Heartbeat.

The more things that are included in a project, the more things can fail. In general, adding more (in this case a second processor with supporting circuitry and code) actually makes a system less reliable.

I think parallel operating ATmega chips would be less failsafe. So I have checked the links, but so far I can't see a parallel operating ATmega chips. Perhaps it could be school project to learn from it and it would be nice to have a number how much more it is less failsafe. I'm guessing 10% to 50% more chance for failure.

The Ardupilot uses two different ATmega chips, each with its own specific task.
Look for the eagle files here: http://3drobotics.com/learn/

The crossroads dual ATmega has two ATmega chips to get a large number of pins. They are not working parallel.

Thanks to everyone for their opinions. I guess this only reinforces what I already know, but all knowledge is useful knowledge.

I think I heard/read somewhere that Airbus aircraft have triple parallel flight computers with different processors and different code doing the same thing. Presumably everything is OK if they all agree and I guess they go by the two that are the same (and land quickly) if one disagrees.

I suspect (and hope) they have more rigorous software development and testing regimes than I use with my Arduino.

If you have (say) a boat with twin engines there is a greater probability of an engine failure than if you only had one of the same type of engine but the probability of both engines failing will be lower than in the case of the single engine. Having two different engines would reduce the risk of being stranded due to a "type" fault. And a problem with a common fuel system (like running out of fuel) could cause both engines to fail. For example the Boeing 777 that crashed at Heathrow with both engines out due to ice in the fuel system.

The problem with "improving" the reliability of an Arduino based project is to identify and quantify what are the major risks. Without doing that it would be prudent to assume that duplication reduces reliability by introducing more connections and more complex code. And you would probably need triplication so that you have some basis for identifying which device is at fault. And some way to physically isolate the faulty device.

I suspect it would be more productive to focus on arrangements that reduce the impact of an Arduino failure, and accept the very occasional failure.

...R

Robin,

You make many valid points. It makes me think of "Minority Report" (having a 3rd perspective as a tie breaker).

I'll grant everyone that it may seem like a very major undertaking with very little return, given all that can be done to prevent a hardware failure in the first place. And, I'm not planning on using it any time soon for that very reason. I only think it's something that shouldn't be completely dismissed if in fact it is possible. There is a reason people RAID their hard drives or have redundant power supplies in medical monitoring equipment. Failures can and do happen. Once you've dealt with all that you can to prevent a failure, it's only natural to look at what you can do next.

Again, I understand it might complicate what would otherwise be a simple system. I certainly don't want to spark an all-out debate about the virtues of redundancy. I was just curious about what others' experiences were when it came to this topic. Again, thanks for the advice and insight!

RAID protects against a relatively common event - mechanical failure in a hard drive. Unfortunately, if you get a virus/ Trojan such as CryptoLocker, both or all mirrors are corrupt. There's a lesson there!

If you have multiple power supplies and multiple computing modules (blades), you have to have a "crossover bridge" which can cope with one module shorting out as well as one supply failing. It isn't as simple as just some diodes. The critical process is called "arbitration" and requires a whole extra system in itself.

Alright. Like I said, I didn't want to argue the merits of redundancy. I'm well aware of the limitations of RAID (which was only an example) and stacking power supplies (which are practically an industry standard in enterprise grade switches). If you look at it from a slightly broader perspective, perhaps, you see that every system in some way has a level of limitation and adds a layer of complexity.

It's easy to start sounding preachy or antagonistic, Paul__B, but I'm going to chalk that up as a pitfall of asynchronous communication. I've got plenty of experience with redundant network and server equipment, which is what piqued my curiosity in the first place. We could go around and around on this if nobody wants to talk about the actual topic of the thread, but I think all our times will be better spent elsewhere.

petermetzger:
I'm curious if anyone has tried this or could offer any insight as to the best approach (if it's even possible!)

It doesn't sound like many people have had cause to investigate backup/failover/redundant systems with their Arduino projects, but if I learn anything I'll make sure I pass it along.

Thanks again to all!