[SOLVED] - CRASHING AFTER SEVERAL HOURS OF OPERATION

So this is a heck of a problem and will inspire the anger of anyone who wants me to post the code, but please be patient and bear with me while I explain the problem. I have a system that sleeps, wakes, checks a sensor, if triggered, snaps a picture, and then transfers by wireless to a gateway device. The system works perfectly for right around 8 Hours then starts crashing if the sensor is triggered or if it is time for its regular check in with the gateway device (call that a heart beat). Other than that it wakes normally and checks the sensor and appears to be operational, until of course a trigger event happens and it crashes again. This is where the problem gets truly bizarre.

I reprogram the device with a simple diagnostic program, find no problems, then reupload the original code and the crashing persists. The crashing only stops if the battery is removed and replaced then normal operations resume perfectly for another 8 hours and the problem starts over.

For the wakeups I use a rtc pcf8563 to generate a pin change interrupts A watchdog timer is used as a fail safe backup to let the system recover in the event of a hang, or freeze, or well you know...

I used serial println to find the spot where it is crashing, which appears to be very soon after the wake up interrupt but not in the interrupt itself. For instance it appears to attempt to execute the first line of the relevant called subroutine after being awakened by the Interrupt and then it crashes

I truly wish I could post the code, but it is literally thousands of lines and I wouldn't willingly do that to my worst enemy. I would post a simplified code of the issue but I don't have the slightest clue where the issue is.

My first and most obvious guess is that somehow I am running out of memory, due to an expanding buffer or perhaps a constant string of new variables being declared somewhere, eating the heap byte by byte, but when I check the free memory using either the FreeRam code snippet or the memoryFree Libraries they both confirm that I have over 12KB still available.

My Build uses the atmega1284p at 16MHz at 8MHz (which has always been stable before) running a slightly modified dualOptiboot bootloader from lowpowerlabs if that helps anyone with theories.

I am truly confounded and just looking for someone's thoughts on where I should look to solve the issue.

Just in case it is in the interrupts or being caused by them I have included the code for the ISR functions.

ISR(PCINT0_vect){                   //  BUTTON PRESS    FLAG 2
  ArmStation=false;                 //  DISARM THE STATION SO FUTURE INSTRUCTIONS CAN BE ENTERED
  digitalWrite(LED_Blue,HIGH);      //  INFORM TECH THAT INSTRUCTION WAS RECIEVED
  PCICR=0;                          //  DISABLE PIN CHANGE INTERRUPTS
  PCMSK0=0;                         //  DISABLE BUTTON INTERRUPT
  PCMSK3=0;                         //  DISABLE RTC INTERRUPT
  InterruptFlag=2;                  //  SET THE INTERRUPT FLAG
}
ISR(PCINT3_vect){                   //  TIMER OR ALARM INTERRUPT RTC FLAG=1
  PCICR|=0;                         //  DISABLE PIN CHANGE INTERRUPTS
  PCMSK0=0;                         //  DISABLE BUTTON INTERRUPT
  PCMSK3=0;                         //  DISABLE RTC INTERRUPT
  InterruptFlag=1;
}
ISR(WDT_vect){}                     //  INTERRUPT HANDLER FOR WATCHDOG TIMER

Are you using Strings anywhere?

That's a great question

Only inside the serial output that I am using to find the problem. Most of them are used in the context of

Serial.println(F("Beginning Sensor Check");
//  or
Serial.print(F("Sensor reading: "));
Serial.println(SensorReading);

There may be a few where I accidentally left it as

Serial.println("Testing something");

Control and Communications with the camera is done using serial communications but no commands are issued as strings. I just realized that it does however restart the Serial1 line every time the camera is turned on ie:

Serial1.begin(115200);

Do you think that could have something to do with it?

In addition I do have literally probably 500 of those serial.print debug lines throughout the code for monitoring and testing.

Are you using arrays in the program ?

Yes I am using a few arrays.

One is defined as WorkingBuf with a length of 50 and is continuously used, cleared, and reused

The battery control index has a length of 5 and more or less never really changes

The Serial number is an array of 4 bytes - after initial programming this is assigned and never changes

the Wireless Network ID is an array of 4 bytes - basically is never edited

Then there are 5 arrays to hold battery voltage levels and 5 arrays that hold those voltage levels relavant state of charge. These sets of arrays are used only to cross reference the battery voltage reading and the state of charge to report the batteries charge levels. They are however called and declared inside of the function to test the battery. These are large arrays for sure.

I don't know if that helps or not

Check very carefully that nothing ever writes outside of the array boundaries

Are you running completely on batteries? If you have power with battery backup, can you replace the batteries after 4 hours to see if time to error changes?

If you are only running on batteries, can you add another battery bank in parallel to the original pair and see if that changes the time to error?

Thank you both for your help, I really truly appreciate it.

With regards to the battery question, the device does run completely on a single battery. The total consumption in 8 hours is a fraction of 1%. I like the idea of adding a second battery and then having it swap its source every so often, but unfortunately when deployed I will only be able to have a single battery. Design restraints that were put on me. I’m sure you understand. Out of curiousity what is your theory?

I took your advice and have begun to review those arrays in particular the battery index cross reference functions to make sure that I am not trying to find a value beyond the indices of the arrays.

I did notice something that I had not considered, and was curious about if something is possible. Those battery search arrays are defined inside the function itself, and are constructed as const uint8_t and const float. The qualifier const is persistant so I am wondering if every time the function is called is it allocating an additional section of the heap to handle that. If indeed that is happening then that would certainly play into the possibility of an expanding buffer eating memory and causing the crash.

I am including that small section of the code for your consideration

BatteryIndex and BatteryPercent are global variables define several thousand lines of code above this and are used for reporting and tracking purposes

void SearchCharge25(float volt){                          //  CROSS REFERENCE TEMPERATURE AND VOLTAGE WITH CHARGE LEVEL OF BATTERY
  const float V25c[]={4.21, 3.99, 3.93, 3.90, 3.88, 3.87, 3.85, 3.83, 3.82, 3.80, 3.78, 3.76,
              3.74, 3.72, 3.71, 3.69, 3.68, 3.66, 3.65, 3.63, 3.62, 3.60, 3.59, 3.57,
              3.55, 3.54, 3.52, 3.50, 3.49, 3.47, 3.46, 3.44, 3.43, 3.42, 3.40, 3.39,
              3.38, 3.37, 3.36, 3.34, 3.33, 3.32, 3.31, 3.30, 3.28, 3.27, 3.26, 3.24,
              3.23, 3.21, 3.20, 3.18, 3.16, 3.14, 3.12, 3.09, 3.07, 3.04, 3.00};
  const uint8_t B25c[]={100,100, 98, 96, 94, 92, 91, 89, 87, 85, 84, 82,
                 80, 78, 77, 75, 73, 71, 70, 68, 66, 65, 63, 61,
                 59, 58, 56, 54, 52, 51, 49, 47, 45, 44, 42, 40,
                 38, 37, 35, 33, 31, 30, 28, 26, 24, 23, 21, 19,
                 17, 16, 14, 12, 10,  9,  7,  5,  3,  2,  0};

  int a=0;
  while(a<2){
    Serial.println(F("25c power range!!!!!!"));
    Serial.print(F("Bat Index: "));Serial.println(BatteryIndex[1]);
    for (int i=BatteryIndex[1];i<59;i++){
      if((volt<V25c[i-1])&&(volt>=V25c[i])){
        BatteryPercentage=B25c[i];
        BatteryIndex[1]=i;
        a=1;
        break;
      }
    }
    if(a==0){
      BatteryIndex[1]=1;
    }
    a++;
  }
}

Charlie1985: Thank you both for your help, I really truly appreciate it.

With regards to the battery question, the device does run completely on a single battery. The total consumption in 8 hours is a fraction of 1%. I like the idea of adding a second battery and then having it swap its source every so often, but unfortunately when deployed I will only be able to have a single battery. Design restraints that were put on me. I'm sure you understand. Out of curiousity what is your theory?

The extra batteries could just be used in testing for this issue.

You state that things run fine for about 8 hours and normal operations return, after a battery replacement. To troubleshoot this issue I'd change the ability of the batteries to supply power. If, after a change of available power and 8 hours pass, the issue does not happen, that is one thing. If, around 8ish hours with a greater supply capability of the batteries, an issue happens then that is another thing.

Charlie1985: That's a great question

Only inside the serial output that I am using to find the problem. Most of them are used in the context of

Serial.println(F("Beginning Sensor Check");
//  or
Serial.print(F("Sensor reading: "));
Serial.println(SensorReading);

Those are cstrings and they present no risk.

The question in Reply #1 was about the use of the String (capital S) class, which can cause problems of memory corruption in the small memory of an Arduino.

...R

Thank you for the replies and the help guys. I still haven't solved this issue but I have deployed a few tools and tricks and have chased it down to the exact moment when the system crashes. I cannot fathom a source of this crash except memory corruption, but I have no idea what could possibly be causing this. Here is what I have done so far, and here is the exact command in my code that the crash happens at.

I used FreeMemory and getFragmentation() to check my memory and possible heap fragmentation. Heap fragmentation reports as 0.00% and free memory is right around 12,800. It did run for 10 hours this time which is nice but it is still about 364 days and 14 hours short of what I need lol. Fortunately I wised up this time and since I know that the crashes continue happening after reprogramming so long as I don't pull the battery I just kept adding Serial.println(F("1"));delay(1000); then "2","3",etc...until I found the exact spot where it dies. Ironically it dies on the second time it passes this section. All of the hardware peripherals are tested and verified during the setup and the put to sleep or powered down until they are needed later in the code. Apparently when I turn on the camera that is exact moment of the crash but not when the camera is actually being initialized, which would make way more sense. It crashes when I am literally powering it on. All it is doing there is writing the enable pin on the LDO in the camera HIGH. How could that possibly be causing a crash?

digitalWrite(Cam_Enable,HIGH);

I am truly baffled as to how of all lines in the entire code that once could possibly cause any issues.

Just so I am on the same page with everyone here I'm including the code for declaring Cam_Enable, and the pinMode for that pin

const int Cam_Enable=14;            //PD6;

and

pinMode(Cam_Enable,OUTPUT);         //  CAMERA LDO CONTROL - ON(HIGH)

Please tell me that someone has an explanation.

I am going to go back through all of the libraries and make sure that there are no Strings ever used, which there may be in my radio manager, as I wrote that code 2 years ago before I knew better but even if there are strings there, I don't see how this could be happening. Is there something that I am missing?

If it helps here's a little more information. The image processing chip is a vc0703 - controlled by a slightly modified vc0706 library put out by adafruit. It is controlled using 3 pins. Serial 1 Tx Serial 1 Rx and Cam_Enable (Which controls the ldo that turns the camera and image processor on and off)

I think that your instincts are likely correct and that the use of Strings are to blame. Found several of them scattered throughout the application level functions of my radio communications specifically meant to add end to end acknowledgements to a hop to hop acknowledged network. In particular I was casting a lot as Strings. I've included an example of this from the library. I just finished adapting all of them to char arrays but I am still curious if that could really be the cause of the problem, and if so, could someone point me to some literature that can help me wrap my head around it. I know Strings can cause memory corruption, but can it corrupt the SRAM and if so by what mechanism or means does it do that. On another note does anyone happen to know of the top of their heads if there is a way to clear the ram without pulling the power. Like a subroutine I could add in the event of a watchdog reset or crash to solve this problem, in the event that removing the Strings doesn't do the trick.

example of a bad idea from years ago:

if(String((char*)WorkingBuf)=="OK")){
Serial.println("PRAY THAT THIS CODE DOESN'T EVER BITE YOU!")
}

Keep in mind using Serial.print to find the source of a crash can be VERY mis-leading. Serial data goes out rather slowly, so the last thing you see printed was probably actually queued up LONG before the crash, and the last message queued up may not have made it out at all. You can follow each print with Serial.flush(), which will create a whole other set of problems.

Regards, Ray L.

That is a fair point. I had considered it and thought that adding a full second delay would help make sure that I found the correct spot. Perhaps that assumption was incorrect.

RayLivingston: Keep in mind using Serial.print to find the source of a crash can be VERY mis-leading.

When I am looking for a crash I put some Serial.println("Here1"); Serial.println("Here2"); etc at places where I think the problem might lie so I can get an idea of what parts the program reaches (or doesn't).

It has worked well for me.

However I think an important part of debugging is forming a testable hypothesis about the cause of the problem. Shaking confetti all over it rarely helps.

...R

The Serial.flush() can actually be important as it prevents the code from continuing till the print is complete. Easier to find where the crash happens.

It obviously comes at the cost of blocking.

I appreciate the flush advice and wanted to give you guys an update.

The problem is still not solve. I replaced all Strings in my radio communications control code with Char arrays. Still feel silly that I didn't think about that old library sooner. In any event this time the crash occurred just short of hour 6. I definitely don't think replacing strings with char arrays made it worse. Just highlighted that the real cause isn't actually time bound.

I did use the Serial.flush(); to replace the delays after the serial.println(); to try and zero in on the issue.

The place where the code dies is in the exact same spot as it died before. I am including the Serial output and the code section where it dies, but I can't fathom and am almost unwilling to accept that digitalWrite could cause a crash. There must be an underlying condition that is bringing down the system and the digitalWrite is merely the straw that breaks the camals back. Please if anyone has ever had a crash after calling digitalWrite please let me know.

Thank you so much. Charlie

Here is the exact spot in the code where it dies, though I sincerely doubt that it is actually the cause

bool StartCamera(){                                       //  TURN ON CAMERA
  Serial.println(F("Starting Camera"));   //  DEBUGGING ONLY
  Serial.flush();
  //delay(1000);
  Serial.println(F("3.4"));
  Serial.flush();
  //delay(1000);
  digitalWrite(Cam_Enable,HIGH);          //  WRITE ENABLE PIN HIGH
  Serial.println(F("3.5"));
  Serial.flush();
  //delay(1000);

Here is the Serial output

Regular Timer Reading: 299 Difference: 0

Regular Timer Reading: 299 Difference: 0

Regular Timer Reading: 53 Difference: 246 - TRIGGER Current Mode is: 1 2 2.1 Starting Camera 3.4

-------------- Starting Node ------------------ Manufacturing test: 7 Operating Mode: 1 NodeID: 3 NetID: 23,24,25,26, Serial Number: 2,3,4,5, Voltage Adjustment: 198.22 Temperature Adjustment: 166.00 Timer interval: 0:10 Alarm time is: 17:20

The camera being enabled or energized, can you run the code without the camera physically attached, has not been eliminated as the source of the failure?

Charlie1985: Here is the exact spot in the code where it dies,

That does not seem to be the case. There is a lot of stuff printed after the printing in your code snippet.

...R

Unfortunately I cannot physically isolate the camera from the mcu in this design as they are on a custom board that I built and the connections are made on the PCB. Though when I get back to Arizona to my fabrication lab I can isolate the camera with the development boards that I built.

Robin2 - I apologize that was my miscommunication and my fault for being unclear. I understand about there being a lot of prints after that, but the last print before the crash is "3.4".

The line where it says

-------------- Starting Node ------------------

is after it crashes and reboots. The very first lines in void Setup() are where that statement is printed.

void Setup(){
Serial.begin(9600);
Serial.println(F("--------------  Starting Node  ------------------));

Does anyone think that it could possibly be due to an unhandled interrupt on Serial1. Serial1 is how the camera communicates with the mcu