Soft WDT reset on second call to WiFiClientSecure:connect(). HELP!

My project is on a NodeMCU 1.0 (ESP-12E). It is all about real-time data collection. To keep the data collection running at regular intervals, I created several "low delay state machine" modules for NTP, DDNS, and Sending Email.
The idea is that each module keeps track of whether it is idle, or waiting for a response, or needs to do something. The NTP and DDNS modules work perfectly. The emailSender, not so much.

The sendEmailStateMachine is based on EmailSender, with the State Machine idea coming from the discussion sending e-mails using esp8266 and arduino IDE.
If the WiFiClientSecure::connect() succeeds, everything is fine. But if it fails, the state machine logic tries to call connect() again, and that causes a soft WDT reset.

Normally everything works great. However, I have a cabin in the North Country, with a DSL link that works about as fast as dialup (and worse when there are multiple phones, tablets and computers all competing for bandwidth - no joke!) So quite often, it crashes repeatedly.

Note: full code are below.

Here is a sample console log to demonstrate the problem, with the stack dump replaced by the traceback from the ESP Exception Decoder.
WiFiClientSecure::connect() is first called just after 18:10:27.505, and gracefully times out / fails, at 18:10:50.871, about 30.3 seconds later. Then the state machine has it retry the connection at 18:10:51.380, and it gets a soft WDT reset at 18:10:59.707, only 8.3 seconds later. Why didn't it gracefully fail, as before? and Why is the Watchdog timer not being maintained? [ From many other trials, the time from connect() to WDT reset is usually between 2 and 3 seconds, not 8, as in this example, in case that helps anyone... ]

I've tried this code with the 8266 core 2.6.3 (as shown), and with 2.7.4, with identical results. I'm really hesitant to say its a bug in the core code, but I've tried dozens of modifications to my code with the same result, and I'm at a loss to see the bug. That's why I'm asking the community for help!

18:10:19.060 -> SDK:2.2.2-dev(38a443e)/Core:2.6.3=20603000/lwIP:STABLE-2_1_2_RELEASE/glue:1.2-16-ge23a07e/BearSSL:89454af
 (shortened to meet length rule)
18:10:27.471 -> loop now=9
18:10:27.471 -> M.Mail Subject: Testing Email StateMachine
18:10:27.471 -> Mail Message: This is the test email body 1
--ME
18:10:27.505 -> SendEmail to: john.burgess.74@alumni.brown.edu
18:10:27.505 ->    subject: Email Test ServerTesting Email StateMachine
18:10:27.505 ->    body: This is the test email body 1
--ME - EMAIL_SM
18:10:27.505 -> Connecting to: smtp.comcast.net:465
18:10:32.473 -> pm open,type:2 0
18:10:35.860 -> BSSL:_connectSSL: start connection
18:10:38.555 -> BSSL:CERT: 30 82 07 5d 30 82 06 45 a0 03 02 01 02 02 11 00 9a 8f 52 68 32 4d fb 9d f9 94 89 fe 24 96 84 7f 30 0d 06 09 2a 86 48 86 f7 0d 01 01 0b 05 00 30 81 96 31 0b 30 09 06 03 55 04 06 13 02 47 42 31 1b 30 19 06 03 55 04 08 13 12 47 72 65 61 74 65 72 20 4d 61 6e 63 68 65 73 74 65 72 31 10 30 0e 06 03 55 04 07 13 07 53 61 6c 66 6f 72 64 31 1a 30 18 06 03 55 04 0a 13 11 43 4f 4d 4f 44 4f 20 43 41 20 4c 69 6d 69 74 65 64 31 3c 30 3a 06 03 55 04 03 13 33 43 4f 4d 4f 44 4f 20 52 53 41 20 4f 72 67 61 6e 69 7a 61 74 69 6f 6e 20 56 61 6c 69 64 61 74 69 6f 6e 20 53 65 63 75 72 65 20 53 65 72 76 65 72 20 43 41 30 1e 17 0d 32 30 30 38 31 32 30 30 30 30 30 30 5a 17 0d 32 32 30 38 31 32 32 33 35 39 35 39 5a 30 81 b9 31 0b 30 09 06 03 55 04 06 13 02 55 53 31 0e 30 0c 06 03 55 04 
18:10:50.871 -> BSSL:_run_until: Timeout
18:10:50.871 -> BSSL:_wait_for_handshake: failed
18:10:50.871 -> BSSL:Couldn't connect. Error = 'Unknown error code.'
18:10:50.871 -> email client connection failed
18:10:50.871 -> Could not connect to mail server smtp.comcast.net on port 465
18:10:50.904 -> Mail respons: 5 = ES_RETRY
18:10:50.904 -> FREE HEAP! loopEmail(): 19696, was 41608, diff 21912
18:10:51.380 -> 
18:10:51.380 -> loop now=33
18:10:51.380 -> M
18:10:51.380 -> pollEmail step: ST_CONNECT
18:10:51.380 ->  expect resp: 220
18:10:51.380 -> Connecting to: smtp.comcast.net:465
18:10:57.624 -> BSSL:_connectSSL: start connection
18:10:59.707 -> 
18:10:59.707 -> Soft WDT reset
18:10:59.707 -> 

Decoding stack results
0x40204edb: BearSSL::WiFiClientSecure::_run_until(unsigned int, bool) at ...esp8266/PolledTimeout.h line 218
0x40220172: br_ssl_engine_hs_reset at src/ssl/ssl_engine.c line 1305
0x402051e4: BearSSL::WiFiClientSecure::_wait_for_handshake() at .../esp8266/2.6.3/libraries/ESP8266WiFi/src/WiFiClientSecureBearSSL.cpp line 564
0x402053ef: BearSSL::WiFiClientSecure::_connectSSL(char const*) at .../esp8266/2.6.3/libraries/ESP8266WiFi/src/WiFiClientSecureBearSSL.cpp line 1122
0x40100971: check_poison_neighbors(unsigned short) at ...esp8266/umm_malloc/umm_local.c line 71
0x40100ca4: umm_malloc(size_t) at ...esp8266/umm_malloc/umm_malloc.cpp line 511
0x4010085c: get_poisoned(void*, size_t) at ...esp8266/umm_malloc/umm_poison.c line 112
0x4021b038: ip4_output_if_opt at core/ipv4/ip4.c line 1577
0x401002d4: malloc(size_t) at ...esp8266/heap.cpp line 232
0x4010085c: get_poisoned(void*, size_t) at ...esp8266/umm_malloc/umm_poison.c line 112
0x4021b05e: ip4_output_if at core/ipv4/ip4.c line 1550
0x4021bbf7: ip_chksum_pseudo at core/inet_chksum.c line 395
0x40217142: tcp_output at core/tcp_out.c line 1361
0x4021642d: tcp_create_segment at core/tcp_out.c line 190
0x402078e3: __esp_yield() at ...esp8266/core_esp8266_main.cpp line 107
0x40207ef9: __delay(unsigned long) at ...esp8266/core_esp8266_wiring.cpp line 57
0x40207eee: __delay(unsigned long) at ...esp8266/core_esp8266_wiring.cpp line 54
0x402039f8: WiFiClient::connect(IPAddress, unsigned short) at .../esp8266/2.6.3/libraries/ESP8266WiFi/src/include/ClientContext.h line 133
0x40205553: BearSSL::WiFiClientSecure::connect(char const*, unsigned short) at .../esp8266/2.6.3/libraries/ESP8266WiFi/src/WiFiClientSecureBearSSL.cpp line 231
0x40201df7: sendEmailStateMachine::connect(char const*, unsigned int) at mypath/libraries/sendEmailStateMachine/sendEmailStateMachine.cpp line 494
0x40206a94: Print::println() at ...esp8266/Print.cpp line 186
0x402021a4: sendEmailStateMachine::pollEmail(long) at mypath/libraries/sendEmailStateMachine/sendEmailStateMachine.cpp line 316
     (a bunch of calls deleted because I don't see the relevance, and to reduce post size)
0x402011cd: loopEmail(long) at mypath/Temp_Sensor_Project/testEmailSend_SM/SendEmail.ino line 51
0x40206c64: Print::println(long, int) at ...esp8266/Print.cpp line 228
0x40201557: loop() at mypath/Temp_Sensor_Project/testEmailSend_SM/testEmailSend_SM.ino line 65
0x40206a94: Print::println() at ...esp8266/Print.cpp line 186
0x4010017c: ets_post(uint8, ETSSignal, ETSParam) at ...esp8266/core_esp8266_main.cpp line 160
0x402079f8: loop_wrapper() at ...esp8266/core_esp8266_main.cpp line 180
0x401000c0: app_entry() at ...esp8266/core_esp8266_main.cpp line 303
0x40220b22: br_ssl_hs_client_run at src/ssl/ssl_hs_client.c line 1142

18:11:04.906 ->  ets Jan  8 2013,rst cause:4, boot mode:(3,7)
(and reboot, repeat...)

The code including everything from setup() and loop() to sendEmail() and the sendEmailStateMachine module (.cpp and .h) are attached.

sendEmailStateMachine.cpp (20.3 KB)

sendEmailStateMachine.h (3.4 KB)

SendEmail.ino (2.11 KB)

testEmailSend_SM.ino (2.66 KB)

You no doubt get a stack trace with your error.

Install the ESP exception decoder and see where exactly the error happens, that should give you a good idea on where to start debugging.

Avoid using the String class where you can; and if you must use it use it carefully to not waste memory. https is also a memory hog, and can lead to crashes.

A common cause for a WDT reset is getting stuck in a loop for too long, so no time for the background processes. A yield() or delay(0) call can help here.

wvmarle:
You no doubt get a stack trace with your error.
. . .

Indeed he did and published the decoded results. But suggesting yield() was a good idea!

Ah, missed the stack trace there.

Lots of malloc() calls related to BearSSL means to me OP is probably running out of memory.

SSL takes a lots of memory, ~30k of RAM kind of lots, and demands (most of) it to be a single block. I've had a really hard time with it, crash after crash. It was a combination of https and encrypted wifi or https and running WiFi in both station and AP mode., something like that, which caused the memory issues. I had to drop the encryption in the end to get it to work reliably.

It would work for a while, then suddenly crash, unpredictably. Sometimes it'd just get stuck, serving its web pages very very slowly (seconds of load time on an imageless page). Switching to plain http solved the crashes and made the web pages load fast.

An ESP32 can probably handle this much better.

I have tried adding yield() in sendEmailStateMachine::connect() right before the call to mailClient.connect() (aka WiFiSecureClient::connect) -- no help.

Yes, I've considered that it is running out of memory. You'll notice that some print strings are enclosed in the F() macro -- also no help. I even tried doing a wdt_disable/wdt_enable around the mailClient.connect() call -- also no help.

At one point, I changed the sendEmailStateMachine module to use char arrays and malloc/free instead of String, also no help. I reverted that change because at one point last summer I thought I had a working version -- and it used String.

There is a free heap check in loop().

Do you think it would help to reallocate FLASH to devote less to the filesystem (still using SPIFFS. Will modernize once this bug is resolved). It is currently set to FS:3MB OTA -512k. I don't use OTA updates for security. Can I reclaim that 1/2M?

18:10:50.904 -> FREE HEAP! loopEmail(): 19696, was 41608, diff 21912

Is 20k not enough for BearSSL, or maybe BearSSL is what consumed the 21.9k that went away?

None of these answers the question: Why does the BearSSL connect() gracefully time out after 30 seconds the first time, but trips the WDT the second time? It is obviously taking care of it that first time through!
These are all great suggestions. Keep em coming!

As said, it needs a huge continuous block of memory. Some Googling will give you more details on this, it's been discussed more on other sites. And as said, sometimes it manages to recover, sometimes not.

<20k free heap is disturbingly low. I've had problems with 30k free heap.

Thanks, @wvmarle.
I've read some of those threads. Free space is one of my theories. That's why I wrote in the Free Heap check.
I just don't know how to drastically increase the amount of Heap to start with, so there's 30+k available for BearSSL.

Any thoughts on that? Will changing the FLASH config to have 2M for filesystem helps (currently I have it set to 3M)? Can I get rid of the 1/2M reserved for OTA, which I don't use, and add that to the heap instead?

Heap is part of your RAM, that's limited to some 80 kB. The only way to make more available is to remove other memory-heavy functionality.

This topic was automatically closed 120 days after the last reply. New replies are no longer allowed.