ESP07S based device started to reset at irregular interval

I have two locations 100 km apart with smart electricity meters where I have deployed a reader based on an ESP07S module. I chose the ESP07s because it has an external antenna connector so I can run the WiFi antenna through a hole in the metal enclosure.

Both run the same f/w and both are located in a metal meter box outside of the respective houses and connect to the Internet via an ASUS WiFi router in each place. Signal strength is between -70 and -85 dBm, which is enough for stable connections.

My problem is that the newer of the installations has started to reboot at seemingly random times and I would like to know what to look for in the f/w to pinpoint the cause.
The other installation never rebooted except via the f/w update OTA interface on command from me.
But the other has been running as good until a week or so ago when it started to reboot now and then for no apparent reason...

In operation the device reads and decodes serial data from the meter and packages it into an MQTT telegram which is then sent to an MQTT broker.
My home automation system subscribes to the data via the MQTT broker.

The f/w is based on a GitHub project:
ESP8266 P1 Meter

Any ideas what to look for?

It is powered from the meter.
The enclosure for the meter and main fuses is the box the electricity company set up and I made a 3 mm hole in the wall to get the antenna out.
5V power is fed from the RJ12 connector and dropped to 3.6V by two diodes and there is a 1000 uF capacitor across the 3.3V line in order to buffer WiFi transmit surges.
The f/w measures the WiFi level and posts that to MQTT as well so I can see what it is dealing with. Turns out to vary between -80 and -83 dBm on this unit.

Yes, it might but it has never happened on the other device, which has been running for almost a year by now and it uses the same sneak solution.
I will check if it is possible to fit in a 3.3V regulator there instead. But that would mean some circuit reconfiguration with downtime.
The thing is that the resets only disturbs the data flow by having the first transmitted value missing the previous value so a delta will be erroneously calculated...
I could at first see if I can introduce some f/w change to handle that in some way, but I am really curious about what the mechanism is for the reset. Is there a watchdog timer that triggers, if so how can one control the timeout of that?

it’s NEVER happened until now other than for OTA updates.

What changed?

Is it possible that an update has destabilised one of the nodes in a different way than the ither ?

The only real differences I can see are:

  1. The MQTT broker is on the same LAN as the "stable" device while the other is on a LAN which connects to the MQTT broker LAN via OpenVPN on the router. Should be the same but is farther away.
  2. The two sites have different electricity suppliers and they have installed different brands of smart meters.
  3. WiFi strength is about 10 dB weaker on the site where the problems occur,

If the watchdog is involved in this then maybe there is some issues within the MQTT handling that could wind up in a longish loop when responses over the network are slow and thus trigger the watchdog?
But I don't know enough to dig into that part of the sketch code...

It should have had that from the get go. A 3.3v lineair Regulator in a TO-220 package should do the trick

Actually i would say under voltage is more of an issue. In case of over voltage the ESP tends to just 'break beyond repair' whereas a reset may occur with under voltage. Those diodes may have a voltage drop of 0.7v under a small load, but when they warm up that voltage drop may increase significantly. Of course without the particular type of diode that is just a suspicion (and even with them it may still be, just a suspicion) The advantage of a regulator is that it will just keep the voltage stable regardless of the varying circumstances (until it fully overheats, which will make it shut down all together)

the wdt on an ESP8266 has a timeout of 1.5s, which can be modified but it is better to leave it that way and make sure it gets reset often enough.
Basically it is there to make sure background tasks are performed on a regular interval.

It is not hugely complex to find if the background tasks are being executed. Calling

yield();

executes scheduled tasks. yield() is automatically called at the end of loop() and every time

delay();

is called (which actually calls yield() until the waiting time has expired)
Usually those libraries that deal with MQTT make a call to delay() for stability purposes, so i somehow suspect this is not the issue.

The ESP will throw out an error msg over Serial at 78440 after a reset with the cause of the reset, but i guess it will not be easy to record and message. Until you do however, it is all guess work. Over voltage, under voltage, power drop out or code error, who knows ?!

Are you certain that this is caused by a reset of the ESP ? If you can include data in the transmission which gives an indication of how long the ESP has been running that at least could be confirmed.

If the device reads serial data is the a possibility that somehow gets flooded with data and doesn't call yield() in time before the wdt is triggered, that the meter just doesn't send correct data for some other reason ?

I have included code in the f/w that sends MQTT messages on startup with some basic data like the f/w version etc.
So I know when it starts and also what was the last action it took before the reset.
All MQTT messages get logged on the server.
So based on that I don't see anything happening close before the restart, so it seems to be just idling waiting for a message to send.
It sends telegrams by MQTT every minute, but they arrive at a 5 s interval so most are discarded.
The meters are different though, the one where there are no restarts sends telegrams every 10 seconds and this does every 5 seconds.

The code that reads the serial data looks like this:

void read_p1_hardwareserial()
{
    if (Serial.available())
    {
        memset(telegram, 0, sizeof(telegram));

        while (Serial.available())
        {
            ESP.wdtDisable();
            int len = Serial.readBytesUntil('\n', telegram, P1_MAXLINELENGTH);
            ESP.wdtEnable(1);
            processLine(len);
        }
    }
}

And inside loop() it gets called like this:

    //Look for HAN data ony once per UPDATE_INTERVAL (1 minute)
     if (now - LAST_UPDATE_SENT > UPDATE_INTERVAL) 
    {
         read_p1_hardwareserial();
    }

Personally i think you'd be better of like this

void read_p1_hardwareserial()
{
    if (Serial.available())
    {
        memset(telegram, 0, sizeof(telegram));

        while (Serial.available())
        {
            //ESP.wdtDisable(); // with the buffersize limited it can really run out, though i never use readBytesUntil()
            int len = Serial.readBytesUntil('\n', telegram, P1_MAXLINELENGTH);
            //ESP.wdtEnable(1);
            processLine(len);  // don't know what happens here ?!
            if (!Serial.available()) delay(1); // calls yield() and makes sure the buffer is emptied out, length of delay 
                           // depends on the baud-rate, it should be at least long enough to read 1 more byte. 1ms Will do 
                           // for 19200, 2ms is safer for 9600
        }
    }
}

If only we could know what the cause of the reset was...

I have now found a possible culprit...
On the home location (where the resets happen) I have a WiFi network with the main router some 40+ m away from the el meter, but there is an access point router (a slave) inside the closest room to the meter only some 10-12 meters away. This typically gives about -78 dBm signal strength, whereas the main router is less than -94 dBm (barely usable if at all).
If the ESP somehow switches its connection to the faraway WiFi it is probably going to lose connection during a transmission.
I am sending the WiFi strength by MQTT regularly so I can monitor it and now I found that a few times the last days it has dropped to very low level like -93 dBm!

Question:
Is there a way to connect to WiFi using the access point's MAC address rather than the SSID name?
That would ensure connection to the closest AP at all times.

This is how the connection is started at the moment:

    WiFi.mode(WIFI_STA);
    WiFi.setPhyMode(WIFI_PHY_MODE_11G);
    WiFi.hostname(MyHostname);
    WiFi.begin(MySSID, MyPasswd);

So if there is an alternate Begin where a MAC can be used it should be possible to fortify this...

This i actually don't know, there was a thread before but somehow i think the conclusion was that the ESP would connect to the strongest available network, it is strange that it switches to begin with. Personally i have no experience with any of this.

Today it really went bad and restarted a lot, so I took it offline and checked out the board.
I made a slight change to the power system (the 2 diodes + 1000 uF capacitor) so I get better filtering directly on the ESP power pins.
I also ordered a 3.3V regulator to put into the circuit instead of the 2 diodes, but it will not get here in time before I go on a week-long trip.
So I will have to reinstall the updated board tomorrow and see what happens.
Too dark now to be able to do it.

UPDATE
In a discussion on ESP8266 usage with electricity meters there was a post that seems a bit interesting for my purpose:

ESP 8266 webserver save power with a small delay

I don't have time now to check if this helps, but when I return back home I will test.