Problems with code getting stuck in a loop somewhere

The problem: every once in awhile, I get stuck in a loop and need a hardware reset to get out. It happens once every 20 minutes to 2 hours on one machine, more like once every 8 hours on another. It happens with no user inputs changing - the unit can be sitting, running the main loop.

The background: I have a large project(at about 60K of code) that I migrated from an Arduino prototype. I use the same I/O pins so that the libraries that I used in prototyping remain useful. I am using: Mega1280 processor SD card following the SD shield design Wiz811MJ ethernet card 16x2 LCD display 7 NKK smartswitches (SPI interface for the display side) Encoder generating interrupts

This project started with Arduino 15 or so and I have migrated to 1.0 recently. This is a low volume (about 50 units) production job. I use mstimer2 for shorter times (75ms variable for encoder sampling) and TimerOne for a 100ms timing clock

The application uses a telnet client with the ethernet library to connect to a camera that uses ASCII strings to control it and answers in ASCII. I use EEPROM to store setup parameters. Aside from standard communications, I do a handshake with the camera about every 10 seconds to be sure the connection is good. Because I have to buffer the smartswitch images (I read them from the SD card), I've moved all my serial print statements to program memory using macros I found here prior to 1.0: Some time I may change to the newer serial.print in 1.0 The ethernet transmit and receive use circular buffers and are independent of each other (I don't wait for answers).

I run a free memory test occasionally and am above 1K free. I did the patch to the ethernet library (Issue 605) that is supposed to cure hanging in a loop.

I added kind of reverse watchdog timer - my interrupt routine increments a counter - if it exceeds 10 seconds worth of counting, the interrupt routine causes a hard reset. The main loop of the software resets the counter and occasionally saves the time since reset. I did this so that the bootloader doesn't have problems with the WDT. This works to reset when it is stuck in a loop so I know the interrupt routine is running and how long since the last problem. This is why I suspect I'm stuck in a loop.

I need advice on how to track this down - resetting is a temporary fix. I need it to keep running. Any creative thinking, either on this forum or PM would be greatly appreciated. I don't know how to continue. Thanks, -Steve

Very hard to say without seeing your code, although at 60k's worth it's likely too much to analyze anyway. Sounds like a buffer overflow somewhere. What is the difference between the two machines where it performs differently?

I have posted here several times about a problem I had with my wifi shield - it choked on large packets. My ISP periodically broadcasts them and over time the resulting overflows would crash my sketch. It's a bit of a longshot, but can you try running it on an isolated network?

For the most part, it is isolated - a direct connection between the camera and the controller. In some setups we do go through a switch. On my proto system I go through a switch for testing but have also run direct. The difference is not setup related - the boards are the same, the cameras are the same model. My proto is not in a box but nothing seems to be getting hot.

The times to reset vary. The last test had resets at 22 minutes, 27 minutes and 47 minutes.

When Windoze telnet is used, the cameras seem very well behaved - of course manual telnet sessions don't query the camera every few seconds for hours.

I was hoping someone would have had a problem with one of the libraries - I thought the 605 bug in the ethernet library would have solved it since it described the problem well.

Any thoughts on how to discover where it might be stuck? Thanks, Steve

Only thing I can suggest is a whole bunch of serial prints. Find out the last thing that happened before the crash, then pepper that area with prints until you binary search your way to the problem area. Only disadvantage with this is that it may change the program's behavior. If you have spare pins, you can achieve the same thing lighting leds, less invasive, but takes longer to zero in, unless you have lots of leds. Oh, I see you have an LCD and an SD card. Send your 'here I am' messages to either or both.

Since I don't want to influence the operation too much (mess with the user displays) and most of the errors seem to occur when the controller is in use, I want the test to be minimally invasive. You did give me an idea though. Just like I log the time to EEPROM, I can have a set of flags that are set to one when I call a library routine and set to 0 when I return, stored in EEPROM. When I reboot, I immediately display all the flags on the serial port. If I get a reset and a flag is set to 1, that is the offending routine.

By the way, the EEPROM_readAnything function is great. I struggled for a long time storing and reading structures with different length variables. -Steve

Just remember that the EEPROM has finite life - 100,000 write cycles. Depending on how your program is structured, your strategy might exceed that in a few seconds.

Remember the write limit on EEPROM, it should be fine for a test though. Note that it takes quite a time to do a write, maybe even as long as a short print, depending on the baud rate.