Help identifying program hang ESP32

Hi,

I have built a simple circuit to upgrade my standard doorbell to send a MQTT message when it is rung. It takes an 18v AC voltage that is sent when the doorbell is pressed, drops the voltage, sends it via a diode and then to an optocoupler then on to an ESP32 board. I am fairly sure (but I could be wrong) that the hardware side is fine but I have an issue with the software I have written.

The doorbell works wonderfully, but about every 24 hours (could be more frequent) it locks up and required me to restart the ESP board, it then continues to work fine.

I have worked around this by using a watchdog interrupt which reboots the board if it cannot loop for more than 6 seconds. This is not ideal as it is not getting to the heart of the problem, I was wondering if anyone can help me understand why this maybe locking up and what I can do to resolve this.

Thank you.

Schematic

Here is the schematic for the hardware, (the Wemos D1 mini is an ESP32 dev board now but the schematic is effectively the same).

Environment

It sits above a consumer unit (switch board) I don’t know if that would have any impact.

Programming Environment

I am using PlatformIO inside VS Code on a Mac with the latest official ESP32 library.

Code

Here is the main.cpp, showing the code that handles the interrupts and the door press. I also pull in some standard files i use to setup Wifi, OTA and MQTT. I have not included these but can do if needed (these are almost identical to my other projects that are working fine).

#include <Arduino.h>
#include <config.h>
#include <connections.h>
#include "esp_system.h"

config cfg;
const int wdtTimeout = 6000;  //time in ms to trigger the watchdog
hw_timer_t *timer = NULL;

void IRAM_ATTR resetModule() {
  ets_printf("watchdog reboot\n");
  esp_restart();
}

const unsigned long debounceInterval = 1000;
unsigned long currentTime;

struct Doorbell{
  const uint8_t PIN;
  bool pressed;
  unsigned long lastRung;
};

Doorbell frontDoor = {34, false, 0};

void IRAM_ATTR rintEvent(){
  if(currentTime - frontDoor.lastRung >= debounceInterval){
    frontDoor.pressed = true;
    frontDoor.lastRung = currentTime;
  }
}

void setup() {
  pinMode(frontDoor.PIN, INPUT);
  attachInterrupt(frontDoor.PIN, rintEvent, FALLING);

  Serial.begin(115200);
  cfg.load();

  setupWiFi(cfg);
  setupOTA(cfg);
  setupMQTT(cfg);

  timer = timerBegin(0, 80, true);                  //timer 0, div 80
  timerAttachInterrupt(timer, &resetModule, true);  //attach callback
  timerAlarmWrite(timer, wdtTimeout * 1000, false); //set time in us
  timerAlarmEnable(timer);                          //enable interrupt
}

void loop() {
  timerWrite(timer, 0); //reset timer (feed watchdog)
  currentTime = millis();
  ArduinoOTA.handle();
  if (!MQTTclient.connected())
  {
    connectMQTT(cfg);
  }

  if(frontDoor.pressed){
    Serial.println("RING");
    publishDoorbellRing(cfg);
    frontDoor.pressed = false;
  }

  MQTTclient.loop();
}

Code - Without Watchdog

Here is a code version (as requested) without the watchdog code.

#include <Arduino.h>
#include <config.h>
#include <connections.h>

config cfg; 

const unsigned long debounceInterval = 1000;
unsigned long currentTime;

struct Doorbell{
  const uint8_t PIN;
  bool pressed;
  unsigned long lastRung;
};

Doorbell frontDoor = {34, false, 0};

void IRAM_ATTR rintEvent(){
  if(currentTime - frontDoor.lastRung >= debounceInterval){
    frontDoor.pressed = true;
    frontDoor.lastRung = currentTime;
  }
}

void setup() {
  pinMode(frontDoor.PIN, INPUT);
  attachInterrupt(frontDoor.PIN, rintEvent, FALLING);

  Serial.begin(115200);
  cfg.load(); //Load config (from SPIFS) and create config class

  setupWiFi(cfg); //WiFi Setup 
  setupOTA(cfg);  //OTA Setup
  setupMQTT(cfg); //MQTT Setup
}

void loop() {
  currentTime = millis();
  ArduinoOTA.handle();
  if (!MQTTclient.connected())
  {
    connectMQTT(cfg);
  }

  if(frontDoor.pressed){
    Serial.println("RING");
    publishDoorbellRing(cfg);
    frontDoor.pressed = false;
  }

  MQTTclient.loop();
}

To make it easier to concentrate on the lock up, can you post the code that does not have the watchdog code? The implementation is already quite complex.

Certainly, I have updated my original post to include a version with the watchdog code removed. My apologies if it made it overly complex.

aarg:
To make it easier to concentrate on the lock up, can you post the code that does not have the watchdog code? The implementation is already quite complex.

ghostseven:
Certainly, I have updated my original post to include a version with the watchdog code removed. My apologies if it made it overly complex.

Thanks, that is easier to read. Am I right that this is essentially unrelated to the sketch operation? Can you remove it for testing?

  ArduinoOTA.handle();

I would start using a divide and conquer strategy. Start partitioning out the functionality. Can you remove the MQTT stuff? If it doesn't crash without it, then you know where to focus for the next steps. That's just an example.

The tricky thing is, sometimes it is a hardware problem not a software problem. It's best to keep both forks of the road in mind, until you have some evidence.

I ran a WDT till I sorted out my issues.

The keep alive loop() is supposed to do 2 things. 1, determine disconnect status and 2. cause the callback to trigger.

void MQTTkeepalive( void *pvParameters )
{
  // setting must be set before a mqtt connection is made
  MQTTclient.setKeepAlive( 90 ); // setting keep alive to 90 seconds
  for (;;)
  {
    if ( (wifiClient.connected()) && (WiFi.status() == WL_CONNECTED) )
    {
      xSemaphoreTake( sema_MQTT_KeepAlive, portMAX_DELAY ); //
      MQTTclient.loop();
      xSemaphoreGive( sema_MQTT_KeepAlive );
    }
    else {
      log_i( "MQTT keep alive found MQTT status %s WiFi status %s", String(wifiClient.connected()), String(WiFi.status()) );
      if ( !(WiFi.status() == WL_CONNECTED) )
      {
        connectToWiFi();
      }
      connectToMQTT();
    }
    vTaskDelay( 250 );
  }
  vTaskDelete ( NULL );
}

The above is how I am running the keep alive loop(). The error I experienced the most that causes a ‘disconnect’ but “everything” looks OK is 208 and 104. OK, I think its 208, its been a while since I had one. The biggest things about maintaining a continuous connection is this setting MQTTclient.setKeepAlive( 90 ); // setting keep alive to 90 seconds and the number of times the MQTT.loop() runs. The keepalive loop does not do a great job with maintaining a connection as it could.

This WiFi callback is very useful for troubleshooting

void WiFiEvent(WiFiEvent_t event)
{
  // log_i( "[WiFi-event] event: %d\n", event );
  switch (event) {
    //    case SYSTEM_EVENT_WIFI_READY:
    //      log_i("WiFi interface ready");
    //      break;
    //    case SYSTEM_EVENT_SCAN_DONE:
    //      log_i("Completed scan for access points");
    //      break;
    //    case SYSTEM_EVENT_STA_START:
    //      log_i("WiFi client started");
    //      break;
    //    case SYSTEM_EVENT_STA_STOP:
    //      log_i("WiFi clients stopped");
    //      break;
    case SYSTEM_EVENT_STA_CONNECTED:
      log_i("Connected to access point");
      break;
    case SYSTEM_EVENT_STA_DISCONNECTED:
      log_i("Disconnected from WiFi access point");
      break;
    //    case SYSTEM_EVENT_STA_AUTHMODE_CHANGE:
    //      log_i("Authentication mode of access point has changed");
    //      break;
    //    case SYSTEM_EVENT_STA_GOT_IP:
    //      log_i ("Obtained IP address: %s",  WiFi.localIP() );
    //      break;
    //    case SYSTEM_EVENT_STA_LOST_IP:
    //      log_i("Lost IP address and IP address is reset to 0");
    //      //      vTaskDelay( 5000 );
    //      //      ESP.restart();
    //      break;
    //    case SYSTEM_EVENT_STA_WPS_ER_SUCCESS:
    //      log_i("WiFi Protected Setup (WPS): succeeded in enrollee mode");
    //      break;
    //    case SYSTEM_EVENT_STA_WPS_ER_FAILED:
    //      log_i("WiFi Protected Setup (WPS): failed in enrollee mode");
    //      //      ESP.restart();
    //      break;
    //    case SYSTEM_EVENT_STA_WPS_ER_TIMEOUT:
    //      log_i("WiFi Protected Setup (WPS): timeout in enrollee mode");
    //      break;
    //    case SYSTEM_EVENT_STA_WPS_ER_PIN:
    //      log_i("WiFi Protected Setup (WPS): pin code in enrollee mode");
    //      break;
    //    case SYSTEM_EVENT_AP_START:
    //      log_i("WiFi access point started");
    //      break;
    //    case SYSTEM_EVENT_AP_STOP:
    //      log_i("WiFi access point  stopped");
    //      //      WiFi.mode(WIFI_OFF);
    //      //      esp_sleep_enable_timer_wakeup( 1000000 * 2 ); // 1 second times how many seconds wanted
    //      //      esp_deep_sleep_start();
    //      break;
    //    case SYSTEM_EVENT_AP_STACONNECTED:
    //      log_i("Client connected");
    //      break;
    case SYSTEM_EVENT_AP_STADISCONNECTED:
      log_i("WiFi client disconnected");
    //      break;
    //    case SYSTEM_EVENT_AP_STAIPASSIGNED:
    //      log_i("Assigned IP address to client");
    //      break;
    //    case SYSTEM_EVENT_AP_PROBEREQRECVED:
    //      log_i("Received probe request");
    //      break;
    //    case SYSTEM_EVENT_GOT_IP6:
    //      log_i("IPv6 is preferred");
    //      break;
    //    case SYSTEM_EVENT_ETH_GOT_IP:
    //      log_i("Obtained IP address");
    //      break;
    default: break;
  }
}

Uncomment all the commentd lines for troubleshooting.

Another aspect that I found was how the MQTT payloads, destined for a ESP32, are sent to the MQTT Broker.

The retain=true flag is an important thingy. I set QOS=1.

When the MQTT client subscribes to the MQTT Broker the Broker will send out an initial payload. The Broker has a persist setting to allow each payload to be retained. Most likely the Broker is set to persist data, the default. The retain payload setting causes the Broker to send out the last payload received until the Broker is updated with a new payload. If the retain setting is false then a NULL is sent. The MQTT keep alive loop does not handle a NULL payload very well.

Instead of using millis(); I use vTaskDelayUntil or esp_timer_get_time();

vTaskDelayUntil runs on a millisecond clock and roll over is handled by the ESP32’s built in OS, freeRTOS.

esp_timer_get_time() is a micro-second clock that rolls over once every +200 years.

example use

void fDo_AudioReadFreq( void *pvParameters )
{
  int FreqVal[7];
  const int NOISE = 10; // noise that you want to chop off
  const int A_D_ConversionBits = 4096; // arduino use 1024, ESP32 use 4096
  Analyzer Audio = Analyzer( 5, 15, 36 );//Strobe pin ->15  RST pin ->2 Analog Pin ->36
  Audio.Init(); // start the audio analyzer
  int64_t EndTime = esp_timer_get_time();
  int64_t StartTime = esp_timer_get_time(); //gets time in uSeconds like Arduino Micros
  for (;;)
  {
    xEventGroupWaitBits (eg, evtDo_AudioReadFreq, pdTRUE, pdTRUE, portMAX_DELAY);
    EndTime = esp_timer_get_time() - StartTime;
    // log_i( "TimeSpentOnTasks: %d", EndTime );
    Audio.ReadFreq(FreqVal);
    for (int i = 0; i < 7; i++)
    {
      FreqVal[i] = constrain( FreqVal[i], NOISE, A_D_ConversionBits );
      FreqVal[i] = map( FreqVal[i], NOISE, A_D_ConversionBits, 0, 255 );
      // log_i( "Freq %d Value: %d", i, FreqVal[i]);//used for debugging and Freq choosing
    }
    xQueueSend( xQ_LED_Info, ( void * ) &FreqVal, 0 );
    StartTime = esp_timer_get_time();
  }
  vTaskDelete( NULL );
} // fDo_ AudioReadFreq( void *pvParameters )

Thank you aarg and Idahowalker. There are lots of things I can start working on to help resolve this.

I am going to build an identical test version on the bench and leave it running, I will confirm that it breaks the same way as the one that is currently in place and then start working through things.

I will keep stripping it back until it stops failing, I can then work up from there.

Thanks Idahowalker for the code examples, I was not aware of vTaskDelayUntil or esp_timer_get_time(); either of these look more suited to the debounce check testing than what I am currently using.

I was also not aware of the issues with MQTT NULL retained, I am sending (or should be) non retained but that is another part for me to check.