Testing firmware for long-term installations: what's the best strategy?

Friends,

I'm working on replacing some code for a weather station and awning controller that I built in my back yard. The original version of this was written directly for the Raspberry Pi Pico SDK. However, it had some sort of instability and required a manual reset to keep working after a few days or a week.

I have rewritten the project using the Arduino API, partly because many of the peripherals I'm using for this project already have driver libraries written for Arduino. I figured if I could reduce the amount of my own code in the loop, I'd reduce the risk of bugs introduced by me. :slight_smile:

Anyway, my question here is about testing methodologies. What would you do to test something like a weather station reporting data over MQTT? What I have done so far is I programmed a second raspberry pi pico (again, using the Arduino API) to simulate the input signals to the weatherstation board. I have one pin ticking at various rates that I expect the anemometer to tick at, another ticking like a rain gauge, and a final pin set up to emit PWM signals similar to the windvane I have. I've been running this setup for a few days, periodically kicking the server over so that the setup has to reconnect. So far things have proven reliable.

My question then is: what else would you do before flashing the firmware onto the actual weatherstation? What are the best testing methodologies for an installation that is meant to be installed for many weeks or months without intervention?

Full code: GitHub - jaguilar/weatherstation_arduino

#include <Arduino.h>
#include <FreeRTOS.h>
#include <HADevice.h>
#include <HAMqtt.h>
#include <WiFi.h>
#include <device-types/HASensor.h>
#include <semphr.h>
#include <task.h>
#include <timers.h>

#include <array>
#include <atomic>
#include <cmath>
#include <functional>
#include <string_view>
#include <utility>

WiFiClient client;
HADevice ha_device("wsta");
HAMqtt mqtt(client, ha_device);

constexpr auto kWindDirectionPin = A0;

std::string_view windvane_level_to_direction(int32_t adc_reading) {
  // ADC target levels assume a divider impedance of 3377 ohms, which can be
  // achieved by putting a 5100 ohm and 10000 ohm resistor in parallel.
  constexpr std::array<std::pair<std::string_view, int32_t>, 16> adc_targets{
      // New: I have the divider, then the ADC, then the windvane, then ground.
      {std::make_pair("NE", 2901), std::make_pair("E", 936),
       std::make_pair("SE", 1616), std::make_pair("S", 2195),
       std::make_pair("SW", 3382), std::make_pair("W", 3984),
       std::make_pair("NW", 3893), std::make_pair("N", 3716),
       std::make_pair("NNE", 2705), std::make_pair("ENE", 855),
       std::make_pair("ESE", 693), std::make_pair("SSE", 1204),
       std::make_pair("SSW", 1972), std::make_pair("WSW", 3305),
       std::make_pair("WNW", 3792), std::make_pair("NNW", 3548)}};
  int min_diff = std::numeric_limits<int>::max();
  int min_index = -1;
  for (int i = 0; i < adc_targets.size(); ++i) {
    const int diff = std::abs(adc_targets[i].second - adc_reading);
    if (min_diff > diff) {
      min_diff = diff;
      min_index = i;
    }
  }
  return adc_targets[min_index].first;
}

HASensor ha_windvane("wv");

void sample_windvane(TimerHandle_t) {
  const uint16_t wind_voltage = analogRead(kWindDirectionPin);
  const std::string_view direction = windvane_level_to_direction(wind_voltage);
  ha_windvane.setValue(direction.data());
}

void SetupWindvane() {
  ha_windvane.setName("Windvane");
  ha_windvane.setDeviceClass("enum");
  ha_windvane.setForceUpdate(true);
  ha_windvane.setJsonAttributes(
      "options: \"{{['NE', 'E', 'SE', 'S', 'SW', 'W', 'NW', 'N', 'NNE', "
      "'ENE', 'ESE', 'SSE', 'SSW', 'WSW', 'WNW', 'NNW']}}\" ");

  analogReadResolution(12);
  xTimerStart(xTimerCreate("windvane_reporter", pdMS_TO_TICKS(4997), true,
                           nullptr, sample_windvane),
              portMAX_DELAY);
}

// A rate limited counter. It can be incremented at most once per update_period.
// If an increment is received in less time than the update period from the
// previous, it is a noop. This counter is not thread safe, so anything that
// is flushing the counter needs to ensure that the counter is not being
// incremented at the same time.
struct RateLimitedCounter {
  const int32_t update_period;
  uint32_t next_update = 0;
  int32_t count = 0;

  void Inc() {
    const uint32_t now = millis();
    if (static_cast<int32_t>(next_update - now) > 0) return;
    ++count;
    next_update = now + update_period;
  }

  // Sets the counter value to zero and returns the current value.
  int32_t Flush() { return std::exchange(count, 0); }
};

// Counts each time a pin goes low, subject to a maximum rate. Multiplies
// the count by a scale and reports it to an HASensor periodically.
class ScaledSamplePinCounter {
 public:
  static constexpr int32_t CalcCounterUpdatePeriodMs(float scale,
                                                     float max_value,
                                                     float period_length_s) {
    const float max_ticks_per_period = max_value / scale;
    const float max_ticks_per_s = max_ticks_per_period / period_length_s;
    return 1000 / max_ticks_per_s;
  }

  ScaledSamplePinCounter(int pin, float scale, float max_value, float period_s,
                         HASensor& destination)
      : pin_(pin),
        scale_(scale),
        period_s_(period_s),
        counter_{.update_period =
                     CalcCounterUpdatePeriodMs(scale, max_value, period_s)},
        dest_(destination) {}

  ~ScaledSamplePinCounter() {
    xTimerDelete(flush_timer_, portMAX_DELAY);
    detachInterrupt(pin_);
  }

  // begin() sets up interrupts and a timer to flush the accumulated counter
  // increments to MQTT.
  void begin() {
    // Each time the pin goes low, increment the counter. The counter
    // naturally debounces due to the maximum update frequency.
    pinMode(pin_, INPUT_PULLUP);
    attachInterruptParam(
        pin_,
        +[](void* counter) {
          reinterpret_cast<RateLimitedCounter*>(counter)->Inc();
        },
        FALLING, &counter_);

    // Schedule a timer to update the sensor's value with the counter's value.
    flush_timer_ = xTimerCreate(
        "counter_flusher", pdMS_TO_TICKS(period_s_ * 1000), true, this,
        +[](TimerHandle_t timer) {
          reinterpret_cast<ScaledSamplePinCounter*>(pvTimerGetTimerID(timer))
              ->FlushCounterValueToSensor();
        });
    xTimerStart(flush_timer_, portMAX_DELAY);
  }

  void FlushCounterValueToSensor() {
    int32_t ticks;
    taskENTER_CRITICAL();
    ticks = counter_.Flush();
    taskEXIT_CRITICAL();
    const float value = ticks * scale_;
    char buffer[32];
    configASSERT(snprintf(buffer, sizeof(buffer), "%.3f", value) < 32);
    dest_.setValue(buffer);
  }

  const int pin_;
  const float scale_;
  const float period_s_;
  RateLimitedCounter counter_;
  TimerHandle_t flush_timer_;
  HASensor& dest_;
};

HASensor anemometer("an");

constexpr int kWindPin = 14;
// I.e. if the anemometer ticks once per second then the windspeed is 1.73
// mph. We (falsely) assume a linear relationship between ticks and speed,
// and therefore compute the average simply by scaling the speed down by the
// number of seconds over which we're reporting.
constexpr float kWindSpeedPerTickPerSecond = 1.73;  // mph
constexpr float kWindReportPeriodS = 5;
constexpr float kWindScale = kWindSpeedPerTickPerSecond / kWindReportPeriodS;
constexpr float kWindMaxSpeed = 100;

ScaledSamplePinCounter wind_counter(kWindPin, kWindScale, kWindMaxSpeed,
                                    kWindReportPeriodS, anemometer);

void SetupAnemometer() {
  anemometer.setName("Anemometer");
  anemometer.setUnitOfMeasurement("mph");
  anemometer.setDeviceClass("wind_speed");
  anemometer.setStateClass("measurement");
  anemometer.setForceUpdate(true);
  wind_counter.begin();
}

HASensor rain_sensor("rg");

// In the case of the rain gauge, the total amount of rain is simply the
// number of times the gauge has ticked.
constexpr int kRainPin = 15;
constexpr float kRainSpeedPerTickPerSecond = 0.011;  // inches of rain
constexpr float kRainReportPeriodS = 15 * 60;
constexpr float kRainScale = 1.;
// We will limit our report rate to 12 inches per hour.
// If we get more rain than that the weatherstation will certainly not be
// around any more to report it.
constexpr float kRainMaxValue = 12. / 4;

ScaledSamplePinCounter rain_counter(kRainPin, kRainScale, kRainMaxValue,
                                    kRainReportPeriodS, rain_sensor);

void SetupRainGauge() {
  rain_sensor.setName("Rainfall");
  rain_sensor.setUnitOfMeasurement("in");
  rain_sensor.setDeviceClass("precipitation");
  rain_sensor.setStateClass("total_increasing");
  rain_sensor.setForceUpdate(true);
  rain_counter.begin();
}

#define SSTR(s) (#s)
#define STR(s) SSTR(s)

void setup() {
  ha_device.setName("Weatherstation");
  ha_device.enableSharedAvailability();
  ha_device.enableLastWill();
  ha_device.setAvailability(true);
  ha_device.enableExtendedUniqueIds();

  Serial1.begin();
  WiFi.setHostname("weatherstation");
  Serial1.println("WiFi.begin");
  WiFi.begin(STR(WIFI_SSID), STR(WIFI_PASSWORD));

  while (WiFi.status() != WL_CONNECTED) {
    Serial1.printf("Waiting for connection to %s\n", STR(WIFI_SSID));
    delay(1000);
  }

  Serial1.println("WiFi ready");

  SetupWindvane();
  SetupRainGauge();
  SetupAnemometer();

  Serial1.printf("mqtt.begin %s\n", STR(MQTT_HOSTNAME));
  mqtt.begin(STR(MQTT_HOSTNAME), STR(MQTT_USER), STR(MQTT_PASSWORD));

  static auto mqtt_lock = xSemaphoreCreateMutex();
  xTaskCreate(
      [](void*) {
        vTaskCoreAffinitySet(xTaskGetCurrentTaskHandle(), 0b1);
        while (true) {
          xSemaphoreTake(mqtt_lock, portMAX_DELAY);
          mqtt.loop();
          xSemaphoreGive(mqtt_lock);
          delay(5);
        }
      },
      "mqtt_loop", 1024, nullptr, 1, nullptr);

  xSemaphoreTake(mqtt_lock, portMAX_DELAY);
  while (!mqtt.isConnected()) {
    xSemaphoreGive(mqtt_lock);
    delay(1000);
    xSemaphoreTake(mqtt_lock, portMAX_DELAY);
  }
  xSemaphoreGive(mqtt_lock);
  Serial1.printf("MQTT up\n");
};

void loop() { delay(10000); }

Do you think this may be caused by a memory error or a misbehaved pointer? An integer overflow, maybe?

I can only guess that without looking at your code.

It could be any type of bug on my part. I found it extremely difficult to debug because whenever I would try to attach to the board, it would often restart, leading to a loss of the relevant debugging state.

I'm more looking for advice about what general sorts of strategies people engage in to test long-term installations, rather than help debugging the old and busted version of my code, which is not even written against the Arduino API. :slight_smile:

Best to have other people install your project and get it to run. Rely on their feedback!

A great idea, but a big ask in this case since my project has limited applicability. I will certainly consider it if I design something that has a wider use case! :slight_smile:

I have applied the following strategy to β€œ accelerate time” with some success in the past. Something like this might help you find a bug that occurs after long running times.

const unsigned long warpFactor = 1000;// time acceleration factor 
#define myMillis (millis*warpFactor)
#define millis myMillis

You have to look into your code looking for literal constants such as β€œ1000” and others that may interfere with warping code execution this way.

My bet is on mixing floats and integers, and in comparing time-related large integer values with calculations within your code

Ooo, that's a smart trick. I'll give it a crack.

Check that the network and MQTT error handlers have actually been created and do something sensible and never give up trying to reconnect, don't leak memory , and properly rate limit themselves.

Recently I discovered by accident that network stack in my router gets confused if too many WiFi connection attempts are made in a short period of time. The router eventually stops accepting new connections with the only fix being a reboot. There's no long term fix other than to accept the limitiation and work around it.

1 Like

Yeah I should do something like turn off the server for ten minutes and see that I can still reconnect after that! Good idea.