Program bug after >2 days

I have a project that I finished a couple months ago that reads 433MHz RF, decodes the signal, and outputs data to MQTT for a variety of sensors. The code is running on an Arduino Uno + Ethernet shield + Superheterodyne 433MHz receiver module. The code is based on the ookDecode framework (a finite state machine) with different header files for each sensor type. I’m currently using 3 sensors:

  • Acurite 5in1 weather station
  • Acurite 592TX temperature sensor
  • Blueline power meter reader

All functions as expected when initially started as I can watch the MQTT traffic arriving at the broker. The issue is that after approximately 2 days the Blueline data ceases. I know the sensor is transmitting as I have the receiver that came with it and can see data is being updated. If I press the reset button on the Arduino, the Blueline data immediately starts again.

The only thing I can think of is a workaround and not a good one. If I add something in the code to reset the Arduino every 2 days, it might eliminate the issue. BUT, that seems like a really shoddy way to deal with this. It also causes some other potential issues as the weather data may need to be stored to output cumulative values (not currently being done, but a possible future change).

I’m at a loss as far as how to debug this. If the program failed within a couple minutes, I could simply add some debug statements and run my code connected to the PC so I could see the serial output. I suppose I could use MQTT as a means to debug, but >2 days between debug attempts seems like it could take a year to fix. I was hoping that someone could either glance at my code or give some advice on how to go about fixing an issue that doesn’t pop up for such a long delay. Not knowing what is causing the data stream to stop being processed makes it difficult for me to shorten the debug cycle.

ookDecoder-ben8.zip (12.4 KB)

You might get help if you post your code in code tags. That way people who don't download stuff and people with phones and tablets that don't have the IDE handy can see your code.

Is 2 days about the limit on the millisecond or microsecond timers before it wraps around to zero again?

Another possible candidate is memory fragmentation if your using String objects.

You might want to check the amount of free RAM to make sure that it stays constant and does not increase over time. You might have a memory leak.

Just print the value to serial every 15 minutes or so.

If you don't have a serial connection available you can just record these values into EEPROM and write a sketch that reads them. Install this sketch after the crash.

You could also add a "heartbeat" that blinks the LED at the start of loop(). If that stops blinking when the data stops you know that there was a bad crash that is totally messing up the code.

millis() rolls over every 49+ days. micros() rolls over every 71+ minutes.

KeithRB: You could also add a "heartbeat" that blinks the LED at the start of loop(). If that stops blinking when the data stops you know that there was a bad crash that is totally messing up the code.

I like this idea. Particularity if you have a LED segment display, you can blink one of the decimals every few seconds. It is unnoticeable to a user, but to someone debugging if it stops you know something happened.

I don't currently have serial available, but I could use MQTT in the same way I suppose. Blinking an LED might work thoug I don't currently have any hooked up. I am still using a protoboard, so adding debugging LED's is an option. The setup is currently hanging from the ceiling off the LAN and power wires, so if I add too much weight, I'll have to add a temporary platform of some kind.

I'll post the code in tags if that helps. I didn't want to make a massive post and annoy people with phones, but I see your point about downloading and the IDE.

If you have the program space, you could use some SPI debugging.
I have used the Bubblelicious board Crossroads offers as debugging hardware and SPI to display debugging output.

http://www.crossroadsfencing.com/BobuinoRev17/

This is more or less a curiosity question. Does the process between turning the LED 13 on and off last long enough to actually see the LED13 flashing? And how do you tell which process is doing the ON/ OFF business correctly? They all operate LED 13 in same fashion.

Is there as possibility for "packet" to overrun buffer , if buffer is in use? It is not apparent from code. Do any of the functions return "success / failure " values? So it really does no resets , just stops collecting Blue data. And if the processes take known ( estimate) time why not use WDT? Maybe you could collect some trace data before it stops. Just a thought.

It depends on how long loop() takes. I guess it would have been better to say “Toggle” the LED, rather than blink it. When the program crashes, you have three states:
dim or flashing LED (depending on how long loop() takes) - loop() still being called
led off or on - program stopped or in an infinite loop.

The LED stays on for a fraction of a second which in itself would not be long enough to do much good. However, because it flashes on/off for each packet and the different sensors have different length packets, different number of them, and they repeat at different intervals, it is possible to see a somewhat distinct flash pattern to tell if the Blueline has been received (for instance). If there is a lot of noise on 433MHz, it does become more difficult to discern the difference between noise and packets though. Either way, the LED is not a great tool for debugging (in it's current implementation), just a way to see that RF is coming in.

I went to add the code to the original message and found that there is a limit of 9000 characters per post. My code has 35k characters total and a couple files are over 9k each, so I'm not sure if there's a good way to post the code directly.

bkenobi: I went to add the code to the original message and found that there is a limit of 9000 characters per post. My code has 35k characters total and a couple files are over 9k each, so I'm not sure if there's a good way to post the code directly.

Use attachments.

bkenobi:
The issue is that after approximately 2 days the Blueline data ceases. I know the sensor is transmitting as I have the receiver that came with it and can see data is being updated. If I press the reset button on the Arduino, the Blueline data immediately starts again.

You posted only an excerpt of the total code (i.e. “PubSubClient.h” seems to be missing), but your code contains typical beginner mistakes that create possibly buffer overflows, such like:

void reportSerial (const char* s, class DecodeOOK& decoder) {
    byte pos;  // this variable is not initialized to any value
...
    for (byte i = 0; i < pos; ++i) {  // then this variable is used as the stop condition in a for-loop
...

I’d check all of the code for possible accessing non-initialized variables and buffer overflows.

Besides of that I’d replace all usage of “String” objects like in file Blueline.h:

     debug = String(String(++i,DEC) + "/" + String(width,DEC));

with usage of normal nullterminated strings (char arrays).
Using “String” objects of different size, dynamically created at runtime, tend to eat up RAM memory over time. At least check available RAM memory while the program is running for debugging reasons.

Sorry for the delay, I didn’t get the notification email so I thought this thread was dead.

odometer:
Use attachments.

The first post contains the full code.

jurs:
You posted only an excerpt of the total code (i.e. “PubSubClient.h” seems to be missing), …

PubSubClient.h is a MQTT library available on GitHub that I assumed was standard. I have it installed with the other libraries so I didn’t think to include it with my code. Here’s a link:

jurs:
but your code contains typical beginner mistakes that create possibly buffer overflows, such like:

void reportSerial (const char* s, class DecodeOOK& decoder) {

byte pos;  // this variable is not initialized to any value

   for (byte i = 0; i < pos; ++i) {  // then this variable is used as the stop condition in a for-loop




I'd check all of the code for possible accessing non-initialized variables and buffer overflows.

Good catch. I didn’t write that code, it was part of the framework that came with ookDecoder. I believe the original author is jeelabs, but I could be wrong.
http://jeelabs.net/projects/cafe/wiki/Decoding_the_Oregon_Scientific_V2_protocol

I was using a combination of their code and this repository:

There is one line that you didn’t show in your code that’s important:

void reportSerial (const char* s, class DecodeOOK& decoder) {
    byte pos;
    const byte* data = decoder.getData(pos);
...

and the code for getData is:

   const byte* getData (byte& count) const {
        count = pos;
        return data;
    }

I’m not great with pointers, but I believe that in the getData function count becomes a pointer to pos in reportSerial. So, when count is set to pos in the getData function it is actually assigning pos in reportSerial. I was initially confused by that as well, so it took me a few minutes to understand what it was doing.

jurs:
Besides of that I’d replace all usage of “String” objects like in file Blueline.h:

     debug = String(String(++i,DEC) + "/" + String(width,DEC));

with usage of normal nullterminated strings (char arrays).
Using “String” objects of different size, dynamically created at runtime, tend to eat up RAM memory over time. At least check available RAM memory while the program is running for debugging reasons.

My first attempt at debugging an early version of this code was to write data to serial. I quickly learned that when writing to serial, it takes too long and causes issues when trying to receive RF. I changed the code to build a debug string that used String() commands and things worked a little better, but I was having some issues. I posted a question here and was guided to use char arrays with the sprintf() approach and that worked MUCH better. I stripped all of the debug statements out except for 3 missed locations (the one you found and the definition char debug[100]; in Acurite592TX.h.

The peculiar thing is that after a power failure last weekend, I reset the Arduino a couple times to get it up and running (sometimes it doesn’t start up correctly for a currently unknown reason) and it’s been running fine for 5 days. The picture shows output for the last couple weeks. The straight line between 9/21 and 9/24 is the last time I had no data. I reset on 9/24 and it worked until the power failure on 9/25. I reset again in the morning of 9/26 and it’s been working since. I see no reason why it started working so I assume it will stop working at some point soon.

blueline sensor.JPG