Watchdog ISR question

I've used the watchdog before to reset my board if something went wrong and hung up my code. Works great.

I want to talk about using the Watchdog ISR to save some critical state variables to EEPROM right before the reset so I can come back up in the same state I was in before reset. That is a fairly simple thing to code, but I wanted to ask about the reliability of that ISR. I mean how much can I count on that to run and run properly. Let's say for instance that I had foolishly used the String library in a part of my code and corrupted the stack or the heap and locked up the program. When the watchdog goes off, will the ISR be able to run in that state? And how would I ever be able to know that the variables weren't corrupted in the crash? If the code is truly locked up and the ISR can't complete, will the reset still happen or is it just going to hang there waiting for the ISR to finish?

Is there a piece of the approach that I am missing here?

One last question... Is plopping a while(true){} somewhere in my code an acceptable way to simulate a hang in order to test the watchdog code? Or should I intentionally do something wrong (like with the String library) to simulate a real fault.

Delta_G:
Let's say for instance that I had foolishly used the String library in a part of my code and corrupted the stack or the heap and locked up the program. When the watchdog goes off, will the ISR be able to run in that state?

Yes. As long as the processor has not melted into slag, and interrupts are enabled, the watchdog ISR will run.

And how would I ever be able to know that the variables weren't corrupted in the crash?

The variables are now shared with an interrupt service routine so multi-byte variables must be protected with a critical section (disable interrupts during access).

A CRC works fairly well for detecting unwanted changes.

An inverted copy of the data is faster than a CRC and works a bit better for detecting unwanted changes.

If the code is truly locked up and the ISR can't complete, will the reset still happen

That is my understanding from reading the datasheet. If you do not pet-the-dog, the second firing resets the processor. Even if the ISR has not yet been called.

Is there a piece of the approach that I am missing here?

Make certain the watchdog configuration / management leaves enough time to write the data to the EEPROM.

One last question... Is plopping a while(true){} somewhere in my code an acceptable way to simulate a hang in order to test the watchdog code?

Absolutely.

Or should I intentionally do something wrong (like with the String library) to simulate a real fault.

Use that technique to test your "is my data corrupt" code.

Thanks for that. Awesome response.

Fortunately everything is in 8-bit variables so I think I can go without critical section. I do need to learn how to do that anyway though.

I do want to ask about the inverted copy vs CRC. I understand the CRC although I've never implemented it. Why does the inverted copy work faster? Would I just do a XOR on it and the original data and consider anything answer other than 255 as corrupted? If I saw a bit that was zero, would there be any way to tell if it was the original data or the inverted copy that was corrupted?

The last question has to do with the timing of the interrupt. If I understand your fourth answer above you are saying that the ISR needs to be faster than the watchdog timer or it may not be able to finish before the reset. I think I'm going to be safe anyway, I only need to write three or four bytes to EEPROM. But I've gone back to the datasheet and I see that executing the ISR clears both WDIF and WDIE if in the interrupt / reset mode but would leave WDE set. So does that mean theoretically I could pat the dog inside the ISR to get more time and then hang myself at the end of the ISR to wait for the reset?

Is it safe to intentionally hang at the end of the ISR with a while(true) {} to make sure that nothing strange will happen from the rest of the code if it were to return and instead wait for the watchdog to give me the reset?

Delta_G:
Why does the inverted copy work faster?

It's just a bitwise-not (versus table lookups, shifts, and loops)...

Check = ~ Variable;

if ( Check != ~Variable ) { data is corrupt }

The advantage with a CRC is that you can gain a fairly high level of safety with a single 16 bit CRC for all of the data to be validated (less memory). In your case (four single-byte values), the savings is just two bytes.

Would I just do a XOR on it and the original data and consider anything answer other than 255 as corrupted?

Yes. (Or what I did above)

If I saw a bit that was zero, would there be any way to tell if it was the original data or the inverted copy that was corrupted?

No. The technique is not capable of correction.

The only 100% reliable technique would be to use something other than SRAM to store either the check or the value.

If I understand your fourth answer above you are saying that the ISR needs to be faster than the watchdog timer or it may not be able to finish before the reset.

Exactly.

I think I'm going to be safe anyway, I only need to write three or four bytes to EEPROM.

4 * ~3.3 ms = 13.2 ms

But I've gone back to the datasheet and I see that executing the ISR clears both WDIF and WDIE if in the interrupt / reset mode but would leave WDE set. So does that mean theoretically I could pat the dog inside the ISR to get more time and then hang myself at the end of the ISR to wait for the reset?

I believe the answer is yes. I can't find anything the datasheet to contradict that.

Unless you need a time-critical reset (processor must reset within T milliseconds of a fault), I suggest you add a WDR before and after writing each byte to the EEPROM.

Is it safe to intentionally hang at the end of the ISR with a while(true) {} to make sure that nothing strange will happen from the rest of the code if it were to return and instead wait for the watchdog to give me the reset?

Absolutely.

Awesome! Thanks a bunch!

I've never written anything safety critical before. The project I have in mind is for my aquarium and while nobody would be hurt in a failure scenario, I could potentially incur significant expense from a failure like dead fish or water on the floor. So I'm trying to learn as much as I can before I start. You've got me far enough to start experimenting with the watchdog ISR. Many many thanks.

I would also make sure brownout detection is enabled and set at a reasonable value (eg. 4.3V not 2.7V which I think is the default).

Otherwise this might occur: A power failure occurs, followed by a slow restoration of power where the processor is in an unstable state. Thus it cannot set up the watchdog timer. The brownout enable fuse should ensure that it resets under those conditions.

How long would a condition like that last? Or could something like that possible hang me up permanently?

Only if the power was too low indefinitely I would have thought.

Let's say for instance that I had foolishly used the String library in a part of my code and corrupted the stack or the heap and locked up the program. When the watchdog goes off, will the ISR be able to run in that state?

That's an interesting question. Let's assume you don't use the heap in the ISR, but only the stack. If the stack pointed to some non-existant piece of memory, then the things the ISR could do would be limited. For example, calling other functions simply might not work.

Personally I wouldn't be saving variables if the reason the watchdog timer was called was that memory might be corrupted. You might be saving bad variables.

I would rather use it for situations where, for example, you used the I2C library which might hang if an interrupt went unnoticed. No variables were corrupted, so you simply recover.

And as for EEPROM, I think I would rather write to it from time to time (eg. every couple of minutes) and only then if the data had changed. And do that in "normal" operation, not in the watchdog ISR.

If the stack pointed to some non-existant piece of memory...

The stack pointer is exactly sized to SRAM. Assuming the stack pointer wraps when it reaches zero it will always point to a valid SRAM address.

Depends a bit how you define "valid". Say it pointed to 0x5D (the stack pointer itself) then I am guessing that calling a function would not work correctly. The attempt to push the return address onto the stack would in itself corrupt the stack pointer, so that returning would not work.

Ooh, good catch. A bit of assembly solves that problem...

ISR( WDT_vect, ISR_NAKED )
{
  // Ensure the stack pointer isn't going to give us trouble
  asm volatile
  (
    "ldi  r16, %[rel]"     "\n\t"
    "out  __SP_L__, r16"   "\n\t"
    "ldi  r16, %[reh]"     "\n\t"
    "out  __SP_H__, r16"   "\n\t"
    "eor  r1, r1"          "\n\t"
    :
    :
      [rel] "M" ( (RAMEND >> 0) & 0xFF ),
      [reh] "M" ( (RAMEND >> 8) & 0xFF )
    :
      "r16"
  );

  // User code goes here.

  // Never ever return!
  while ( true ) { }

  // This path leads to chaos and insanity!
}

Of course, the best solution is to develop bug-free code. :smiley:

I agree. I would regard the watchdog as a "get out of jail" for some external event, like a brownout, or an interrupt that went missing. Not as something to compensate for you writing all over memory because of a bug.