Possible Ways to Run Health Checks for Arduino

Hi guys. I'm currently working on a redundancy switching mechanism for the avionics part of our near space balloon project. The basic idea is, because the entire system will be sent to the near space (where the temperature drops below -60 Celsius, 1% ambient pressure compared to sea level, and with high exposure to cosmic rays), we need to make sure that our system has maximum survivability if things go wrong under such harsh condition. Alike redundant avionics in airplanes, we want to have 3 Arduino nanos that programed identically on board. In order to preserve the battery, the control of the entire system is handled by one Arduino at a time, i.e. Arduino A is in control, Arduino B & C are power off. If A fails (failure refers to complete or temporary death of an Arduino. Miscalculation or flip of bits due to cosmic rays and such is tolerable), Arduino B will be powered on, the control right will be switched to B in software, and then power off Arduino A. Switch to C if B fails, etc.

So I run into a problem that I have no idea how to comprehensively check the health condition of an Arduino board (in turn Atmega 328 chip). Should it be done in software on each Arduino itself or by a separate circuit? Is there a chip or some kinda circuit that handles health check for another microcontroller that I can interface it with Arduinos? Really, all we need to do is to be able to tell if one Arduino is dead so that we can switch to another redundant one.

One thing to notice is that the redundancy switching system should be as simple as possible, at least shouldn't be more complicated than each Arduino, otherwise the redundancy system will be more vulnerable to failure than Arduino and the whole point of having it is lost. So I think the redundancy system shouldn't involve any microcontrollers (maybe slower controllers are considerable?).

P.S.
I've heard some ideas about how someones handle their redundancy switching mechanisms, which is like a voting system based on logic gates. But I'm not sure how that works. What signal will be sent to the gates? Pulling voltages? By just pulling and sending voltage from a microcontrollers, does it always tell you the health condition necessarily?

Designing failsafe systems using redundancy or not can be quite a challenge. You are right that in some cases all the added components used can result in overall higher exposure to failures.

That aside the usually mechanism to detect a fault condition in a running processor is using a watchdog timer. This is like a dead mans switch where if a pulse sent (as part of the normally running program) to the WDT isn't detected in a given time period it forces the device into doing a reset operation on the processor. The AVR chips do have such an internal WDT and may or may not be useful for what you are looking for. I kind of like external hardware WDT, where all the processors have to continously send a sanity pulse out and if external logic circuitry detects a processor not alive, just switches to the redundant processor. There are a zillion details to protect a complete system so you really just need to start whiteboarding ideas.

You looking for a triple modular redundancy Triple modular redundancy - Wikipedia it actually takes more than three units.

Standard microcontroller watchdog integrated circuits will read a pulse from the microcontroller and this resets the watchdog counter to zero. You can do that with a dedicated circuit or if you need some repeatability and precision I'd do it with a microcontroller. See the rant below.

Since the Arduino was probably not built with the highest rated parts for military and aerospace applications and not made to class three soldering specifications you are constrained in testing units to see if they have some chance of surviving.

One of the simplest things you can do is to us one of the Arduino variants that use no socketed parts and then conformal coat the board excluding connectors or areas that need to be soldered later.

The other is to run them across the temperature range they will be exposed to and see if they fail and if possible make corrections to the board or supply environmental controls to keep the temperature where they can survive. Thermal cycling usually kills solder connections but can kill components not rated for it.

The voting system can actually be a rigorously tested Arduino. It can also act as a watchdog for the other systems. You'd want an external watchdog on it to reboot it if needed. I'd be tempted to use an ATTINY as a watchdog rather than the available dedicated watchdog integrated circuits. That is due to the following experience.

.rant. I've worked with the Dallas external watchdog integrated circuit. DS1232. I can't spit on it hard enough to kill the engineer that built this piece of *t. It has no consistency across temperature, it has no consistency between parts and if you get one batch to work the next batch you buy might not. All of the clones made by other vendors of this chip have the exact same problems. That was their DIP8 part, their surface mount part is just as bad when I tried it. Instead of redesigning the system or touching the firmware which would cost us 10s of thousands of dollars we hand select parts that work. ./rant.

nor suffered from catastrophic failure from temperature

I believe that the typical strategy for surviving the low temperature is to provide enough insulation that the electronics won't get too cold. All the ones I've read about didn't need very much insulation.

My suspicion is the battery is much more vulnerable than the AVR processor.

Why three identically-programmed devices?
Surely you'd be better with three systems programmed by separate individuals/teams, but to the same specification.

The number one killer of balloon avionics is temperature. You wont have that problem with Arduino unless youre driving heavy currents with many pins. (disclaimer: I have never flown an Arduino but have flown many BasicStamps and other stuff)

Use a styrofoam box, keeps your batteries from freezing. All your low power electronics go inside, including Arduino, they will be fine.

Anything with a heat sink is trouble. The air is very cold but very low density, so a heat sink hanging outside may overheat at 55 below bkoz convection does not work well. Even worse the sun is brighter and a black heatsink get quite hot. For your big transmitters, regulators, etc. use a big heatsink outside and keep it in the shade no matter what.

Whatever is inside will balance out depending on insulation versus dissipation. Its so hard to model and the electronics are so forgiving that its mostly done empirically (suck it and see). Log your inside and outside temperature from telemetry to analyze after flight. If the thermal balance is so horribly bad that the insides start dying then switching to another (Arduino, battery, etc.) is not too helpful.

Groove makes a good point, google "heterogeneous redundancy". To this point we usually include a hardwired (no software!) extremely simple "recovery beacon" with independent battery sized to last a few days on ground "in case all else fails". (and a note with your email/phone )

But if thats not what you got yet heart set upon, in professional avionics we use a combination of methods similar to retrolefty's post. You REALLY DO gotta be careful about voting circuits, exception handlers, etc that can make things worse not better.

At minimum we have WDT (watchdog) in hardware and in "timeline", see ARINC 653 - Wikipedia. The idea is if any "partiton" takes too long or too much resources, it gets yanked before it harms the rest of the system. ARINC 653 wont fit on ATmega, and it will be challenging to implement even a few of its concepts in Arduino.

We also use watchdogs between processors, aka "heartbeat". Each processor exchanges heartbeat messages with the others and stands ready to take over for one that stops responding or babbles incoherently. Often we include a "heterogeneous redundancy" processor that is much simpler and only minimally capable, to take over as last resort in case of loss of all primary processors. When it comes down to that, its irrevocable; no matter how sane a primary processor may appear to be later on, it can no longer regain control.

If youre just lookin to fly balloons, stick to the basics. But if you are itching to really get into the redundancy management biz, set aside plenty of long nights for an arduous voyage of discovery, and good luck!

Coding Badly nailed it

My suspicion is the battery is much more vulnerable than the AVR processor.

But assuming you wanna press on with this redundancy management challenge here are a few ideas:

(1) My first Arduino project needs good analog voltages for +5v and reference,, (maybe USB power not good enough), so measure 3.3V with an analogRead(). The onboard regulator for Uno is 1% so failing this test is a pretty good indication that something is wrong with at least one supply. This test could be included in void loop ().

(2) "Wrap Back" of outputs to inputs. A cardinal rule of some systems is every output needs to be self-monitored and confirmed by an input. BEWARE that you must distinguish between an internal fault and an external fault. If using conventional programming languages its almost a self-eating watermelon during a mission, only good for detecting wiring faults during Startup Built In Test (SBIT)

Thats pretty much what ive applied to my current high integrity project, but here are some more;

(3) A timeline counter. This is ideal for void loop(), take a time hack " hack = millis(); " at the start of void loop(), and at critical points check if it has taken much longer than expected to get here. BEWARE use of interrupts can cause false fails (or true ones, it can be hard to tell)

(4) Check sums and the like. Some parts of memory space should not change during error free operation, take a check sum and declare yourself unhealthy if it fails. this might be difficult dont see where Arduino supports peek() or poke() nor where one can gosub() to assembly language. See especially the god members comments in
http://www.arduino.cc/cgi-bin/yabb2/YaBB.pl?num=1287169320

(5) Bounds checking. Some I/O and calculation results exceed a certain range only in error conditions. Declare yourself unhealthy. BEWARE this is especially vulnerable in unanticipated situations to "common mode failure" where each processor repeatedly "falls on the same sword".

Some general comments, how does one processor "take control"? With physical IO to servos, discretes and such the unhealthy one would set its outputs to inputs (high impedance) so the next processor could be free so set the state to HIGH or LOW at will. But if you are too unhealthy to be in control, how can we trust you to relinquish all your I/O?

If only 1 of three processors is powered, you need some independent hardware to select the powered one. That extra hardware is a "voter" which also could fail. For high integrity voters google the nuclear power industry. Its not trivial.

Most of the ideas above are at least somewhat vulnerable to "common mode failure" where unanticipated situation leads to endless round-robin of faults.

Not saying this is impossible, but its probably much bigger than you think, and probably not worth wile for the balloon alone, but perhaps as a voyage of discovery.

I have found the most reliable fail-safe is gravity. Never once has a payload of mine become stuck in the sky ;D

Some general comments, how does one processor "take control"? With physical IO to servos, discretes and such the unhealthy one would set its outputs to inputs (high impedance) so the next processor could be free so set the state to HIGH or LOW at will. But if you are too unhealthy to be in control, how can we trust you to relinquish all your I/O?

Hum, you just teased a idea for a ultra simple processor redundancy switchover concept using two (or three?) AVR mega 328 chips. Envision piggy backing two (or three) 328 chips with all there corresponding legs soldered together, except the reset pins (might have to think about the crystal clock pins, or use internal clocks only). When a AVR is held in reset, all it's pins are in tristate mode. This would seem to me to be a pretty simple way for a small amount of external logic to be able to keep just one avr active but have the ability to quickly fall over to another based on a failed external watchdog timer, by manipulating the individual reset pins on the processors.

However for a complete system like a balloon flight package I suspect that the AVR is probably the least likely to fail of all the other electronic subsystems and of course the battery package.

Just white-boarding ideas.... :sunglasses:

Lefty

Sorry... a bit of a digression from the redundancy/ health check focus, but I couldn't resist. Is there a forum for high altitude telemetry where such things as the following have already been discussed?

After years of seeing discussions about fighting heat buildup in electronics, I was amused to read of people who are fighting a situation where there isn't enough heat! (I presume they are further frustrated by having heat buildup problems while the balloon is still near the ground?)

Its so hard to model and the electronics are so forgiving that its mostly done empirically (suck it and see). Log your inside and outside temperature from telemetry to analyze after flight. If the thermal balance is so horribly bad that the insides start dying then switching to another (Arduino, battery, etc.) is not too helpful...

and

Anything with a heat sink is trouble....

I presume minimizing the payload weight is an issue, but would the following "fly"...

Give the batteries, etc, a well insulated (Styrofoam?) home. Let the "waste" heat which is usually such a problem help keep the batteries warm. But provide a way to vent the heat/ "let in" the cold when (if!) you get a "too much heat" problem?

In the later stages, would evaporative cooling be an answer, given the fact that heatsinks have trouble dissipating heat into a near vacuum? I suppose the weight price of lifting the liquid to be evaporated rules that out?

In the later stages, would evaporative cooling be an answer, given the fact that heatsinks have trouble dissipating heat into a near vacuum? I suppose the weight price of lifting the liquid to be evaporated rules that out?

The standard adiabatic lapse rate is 4 degrees F per thousand feet. Doesn't take much elevation gain to get below freezing even on a warm day.

Once the liquid freezes, it's not much good for evaporative cooling. Any anti-freeze substances interfere with the ability of the liquid to evaporate, so that wouldn't work.

Above all, everyone thank you so much for replying so much extremely helpful and informative information to me. I'm just an undergraduate student in ECE, my knowledge and experience are very limited compared to any of you, but with your help there is nothing that can't be solved!

For the redundancy system itself, it is specified by th scope of this project. I'm actually already aware the fact that there weren't many people integrated redundancy system into their successful near space balloons, this may means redundancy system is not very necessary in an armature balloon system (people even launched and recovered iPhone or Android phones successfully lol). However, since the redundancy is part of our project scope, I just treat it as one of challenges we need to solve.

So far, I have two basic design concepts:

  1. Partially mimicking the redundancy system on autopilot for aircrafts, which runs 3 controllers simultaneously. All the outputs from the 3 controllers are compared and the most different output will be ignored. For example, when the ambient temperature is -60 outside the balloon, Arduino A reads -60, B reads -58, C reads -10, the result from C will be ignored. But there are two problems, one is that I'm not sure if this can be done without microcontrollers (might be at least as complicated as Arduino itself); another one is that how can I make sure that the program flow on the 3 Arduinos are the same? For example, if Arduino A wants to save temperature data to the SD card, while B wants to save GPS coordinates?

  2. All three Arduinos are connected to the avionic identically, but we runn one Arduino at a time, the Arduino is monitored by a dedicated watchdog circuit externally. When Arduino A can't respond to the WDT's pull fast enough, WDT will reset this Arduino. If reset still can't make it go faster, this Arduino will be turned off by turning off the MOSFET connected to Vin pin of it, and then turn on Arduino B, etc. There are two problems in this scenario. First, even if an Arduino can respond to WDT fast enough, does this necessarily mean that this Arduino is all good? Second, how to do the "turn off & turn on" switching using WDT circuit?

The following are some questions and ideas from reading the posts. My knowledge is merely undergrade level and I'm learning from you guys every day, so my questions may be stupid, please bear with me :slight_smile:

To retrolefty:

The AVR chips do have such an internal WDT and may or may not be useful for what you are looking for. I kind of like external hardware WDT, where all the processors have to continously send a sanity pulse out and if external logic circuitry detects a processor not alive, just switches to the redundant processor.

I fully agree with you. Using internal WDT is pointless if the controller itself is not responding. But how can I "switch" once a controller is determined dead?

To mrmeval:

Standard microcontroller watchdog integrated circuits will read a pulse from the microcontroller and this resets the watchdog counter to zero. You can do that with a dedicated circuit or if you need some repeatability and precision I'd do it with a microcontroller

The point of having redundancy is because theoretically more complicated microcontrollers have higher failure rate than simple passive components such as logic gates (I think?). The bottom line is, the redundancy system cannot fail, otherwise the entire system will be messed up and the equipments will be gone forever.

You'd want an external watchdog on it to reboot it if needed. I'd be tempted to use an ATTINY as a watchdog rather than the available dedicated watchdog integrated circuits.

Isn't ATTINY for onboard debugging?

I've worked with the Dallas external watchdog integrated circuit. DS1232. I can't spit on it hard enough to kill the engineer that built this piece of *t. It has no consistency across temperature, it has no consistency between parts and if you get one batch to work the next batch you buy might not.

OMG man you almost saved my butt. I was looking right on this chip! Its descriptions are so attracting. But upon what you've said, I'm not going with it. Now I'm thinking about another chip from Maxim Mixed-signal and digital signal processing ICs | Analog Devices, what do you think?

To Richard Crowley:

My gut feel is that attempting to deploy redundant systems, plus implementing some kind of cross-check/voting mechanism will quite possibly make your system LESS reliable unless you are doing graduate-level research on those topics. In which case coming here for advice seems odd.

That's why the redundancy system needs to be as simple as possible. The redundancy system is required by the scope of the project which I can't change, although I personally agree that it is not seemingly necessary at all. I'm still an undergrade student with knowledge of a monkey on circuit designing and interfacing, but with your help, I'm growing fast :slight_smile:

To Coding Badly:

My suspicion is the battery is much more vulnerable than the AVR processor.

That's right. In order to counter this, I have put thick layers of insulation around the batteries. The choice of batteries are Energizer L91, those Li primary cells suppose to remain an acceptable state of charge at low temperature. Also, phase change materials may be used to hold the temperature further.

To Groove:

Why three identically-programmed devices?

The reason for using three identically programmed controllers is because we need to switch from one controller to another (as in scenario 2), the system behavior can be constant.

To tkbyd:

After years of seeing discussions about fighting heat buildup in electronics, I was amused to read of people who are fighting a situation where there isn't enough heat! (I presume they are further frustrated by having heat buildup problems while the balloon is still near the ground?)

Hummm, that's a great point! We are in Winnipeg in Canada, a city can get to -40 C in the winter, and pretty much we are going to launch our balloon during February - the coldest month of a year. Is getting hot still an issue then? I'm trying my best to wrap around the avionics to make it loss heat slower in near space. But according to what you said, I need to "vent" the heat?

To AltairLabs:
Your reply is extremely informative to me. It is my honor to know anybody has actually launched balloon before! By saying "take control", I meant to turn the malfunctioning controller off and the back up controller on. Because the code in each controller is the same and it is purely conditional, it shouldn't matter if we switch from controller A to B, the system behavior should remain the same.

I think I need to read more to absorb your words better. I will reply to you more thoroughly once I have done more research on your points.

Once again, thank all of you who had left ideas to me. You guys are truly amazing!

What exactly does the Arduino(s) control?

Arduino A reads -60, B reads -58, C reads -10, the result from C will be ignored.

This is difficult because there's error thresholds and timing to consider, but also how does the supervisor get the information? There would have to be a protocol and three serial links, all of which is error prone.

What I'm thinking is it doesn't matter if the numbers are 60, 58 and 10 it's the action taken based on that info that matters. If said action is to control a single pin then this can be handled with some simple 2 of 3 voting system. Of course if you get to the point of shutting down one uC you're back to square one or worse, you can't arbitrate between 2 devices (a man with two watches never knows the right time :))

Following from that, if there are only a couple of control OPs then it should be easy, if there's 50 that's a different story.

Anything not doing real time control, such as data logging, can just be done three times independantly and the results compared later on a PC, at which point you can implement filtering of dodgy data.

Of course if the error is caused by a coding bug then they will all have the same problem. This (I beleive) is why they have three teams write different code to the same spec on serious applications.

how can I make sure that the program flow on the 3 Arduinos are the same?

If they have the same code and the same clock source they should be within a knat's fart. However if you ar emaking decisions based on fuzzy inputs (ie analogue) I don't think you can ensure they will be locked.

if Arduino A wants to save temperature data to the SD card, while B wants to save GPS coordinates?

Then they are running different code, no way to have arbitration as far as I can see.

When Arduino A can't respond to the WDT's pull fast enough, WDT will reset this Arduino. If reset still can't make it go faster, this Arduino will be turned off by turning off the MOSFET connected to Vin pin of it, and then turn on Arduino B, etc.

Once again, if A isn't fast enough B and C won't be either. You have to set the WDT so the chip is fast enough.

Second, how to do the "turn off & turn on" switching using WDT circuit?

You have to "kick" the watch dog by performing writes to the correct registers within 4 clock cycles of each other. Here's an ASM example that seems to work for me.

wdr
ldi      r_temp1,(1<<WDIE)
out      WDTCR,r_temp1

The code should not be in an ISR as interrupts can often still work when nothing else is.

Isn't ATTINY for onboard debugging?

Eh? It's just another of the AVR processors, only a very small one (8 pins).


Rob

You can go minimalist with it and use the 2313, It has six I/O lines so it can monitor 3 units and activate 3 resets. Of course it can do more but I would use this or any other part in lieu of any maxim watchdog product. I'm done with maxim for those. I'd suggest the automotive rated one for this part but earlier you mentioned finding out if this had all been done before and that would be something to check on first.