So this is a heck of a problem and will inspire the anger of anyone who wants me to post the code, but please be patient and bear with me while I explain the problem. I have a system that sleeps, wakes, checks a sensor, if triggered, snaps a picture, and then transfers by wireless to a gateway device. The system works perfectly for right around 8 Hours then starts crashing if the sensor is triggered or if it is time for its regular check in with the gateway device (call that a heart beat). Other than that it wakes normally and checks the sensor and appears to be operational, until of course a trigger event happens and it crashes again. This is where the problem gets truly bizarre.
I reprogram the device with a simple diagnostic program, find no problems, then reupload the original code and the crashing persists. The crashing only stops if the battery is removed and replaced then normal operations resume perfectly for another 8 hours and the problem starts over.
For the wakeups I use a rtc pcf8563 to generate a pin change interrupts
A watchdog timer is used as a fail safe backup to let the system recover in the event of a hang, or freeze, or well you know...
I used serial println to find the spot where it is crashing, which appears to be very soon after the wake up interrupt but not in the interrupt itself. For instance it appears to attempt to execute the first line of the relevant called subroutine after being awakened by the Interrupt and then it crashes
I truly wish I could post the code, but it is literally thousands of lines and I wouldn't willingly do that to my worst enemy. I would post a simplified code of the issue but I don't have the slightest clue where the issue is.
My first and most obvious guess is that somehow I am running out of memory, due to an expanding buffer or perhaps a constant string of new variables being declared somewhere, eating the heap byte by byte, but when I check the free memory using either the FreeRam code snippet or the memoryFree Libraries they both confirm that I have over 12KB still available.
My Build uses the atmega1284p at 16MHz at 8MHz (which has always been stable before) running a slightly modified dualOptiboot bootloader from lowpowerlabs if that helps anyone with theories.
I am truly confounded and just looking for someone's thoughts on where I should look to solve the issue.
Just in case it is in the interrupts or being caused by them I have included the code for the ISR functions.
ISR(PCINT0_vect){ // BUTTON PRESS FLAG 2
ArmStation=false; // DISARM THE STATION SO FUTURE INSTRUCTIONS CAN BE ENTERED
digitalWrite(LED_Blue,HIGH); // INFORM TECH THAT INSTRUCTION WAS RECIEVED
PCICR=0; // DISABLE PIN CHANGE INTERRUPTS
PCMSK0=0; // DISABLE BUTTON INTERRUPT
PCMSK3=0; // DISABLE RTC INTERRUPT
InterruptFlag=2; // SET THE INTERRUPT FLAG
}
ISR(PCINT3_vect){ // TIMER OR ALARM INTERRUPT RTC FLAG=1
PCICR|=0; // DISABLE PIN CHANGE INTERRUPTS
PCMSK0=0; // DISABLE BUTTON INTERRUPT
PCMSK3=0; // DISABLE RTC INTERRUPT
InterruptFlag=1;
}
ISR(WDT_vect){} // INTERRUPT HANDLER FOR WATCHDOG TIMER