I'd like to suggest an approach which doesn't address the problem, but makes your system resilient to it: a watchdog.
There's a lot to be said about watchdogs, but in summary they are an essential part of any system which must remain working over very long periods in what might be a hostile environment. In other words, they provide a valuable layer of resilience.
They should not be used as an excuse to leave bugs extant, of course. But they are a realistic response to the imperfection of the environment (and our coding skills).
So, I'm not proposing it as a solution to your problem, but something for you to consider implementing anyway.