Hi All,
I have more news and am now more confused than ever. When I originally reported the stalling, I had originally assumed that the code had corrupted the stack and the CPU ran off into space. But I hadn't waited long enough. After 62.635 minutes, the system resumes operation and continues normally. I have now captured three stalls each a little over an hour long.
Following the suggestions from xfpd, I rewrote all the delay statements throughout my code paying careful attention to variable and literal types (all uint32_t and xxxUL now). The code originally delayed using the form:
//with somevariable being initialized to millis() sometime before this.
if ( millis() >= somevariable + delayconstant)
{
somevariable = millis();
do something
}
to perform timed waits. I rewrote the code to call bool waitms( uint32_t var, uint32_t delay ) to perform delays. That function returns true when the time delay has expired (e.g., millis() >= var + delay) and false otherwise.
In that routine I test for delay being greater than a limit value, var being greater than millis() and the millis() - var being greater than the limit value. The function prints error messages and values of millis, var and delay if any of those conditions are true then returns immediately. waitms also maintain a static variable set to the last millis() value read. If mills() < that value that too is flagged as an error (I realize this will happen once the time range of millis() is exceeded, but the system has yet to run this long without issues). Here's the whole routine:
// Wait for millis() >= var + t
// But trap for the case where millis() rolls over its 40 day limit.
// True if millis() >= var + t, *or* |millis()-var-t > MAXMSDELAY|
bool waitms( int ID, uint32_t var, uint32_t t )
{
static uint32_t lastt = 0;
uint32_t ti = millis();
if ( ti < lastt )
Serial.println( F( "Timer rollover" ) );
else
lastt = ti;
if ( t >= MAXMSDELAY || ti - var >= MAXMSDELAY || ti < var )
{
Serial.print( "TRAP: waitms rollover ID " );
Serial.print( ID );
Serial.print( ", millis " );
Serial.print( ti );
Serial.print( ", var " );
Serial.print( var );
Serial.print( ", t " );
Serial.println( t );
return( true );
}
return( ti >= var + t );
}
I have now successfully captured three stalls. The first two stalls (from a system power-up) happen at nearly identical times (939524x ms) and last 62.635 minutes. The third stall was longer at 71.583 minutes long and ended 134.215 minutes after the first stall (ended).
Interestingly, when the system stalls there are no error messages printed before the stall. But, when the system resumes operation messages are printed because about an hour has elapsed. Also, the absolute value of millis() when the system resumes operation in the two stall cases are nearly identical (within 10 ms). I can't imagine what the software might be doing during that 62 minute stall period. The software doesn't use dynamic variables, e.g., no malloc()/free/delete calls, so there shouldn't be any garbage collection going on. Any thoughts?
Is this behavior ringing a bell with anyone? I can't explain the hour-long pauses in code operation.
Briefly, the project is a 10x10x10 LED cube. The cube uses 1000 WS2812 serial addressable LEDs that are driven with a little bit of 74LVCxx logic and the nano RP2040 Connect's SPI port. I also have an Adafruit 2.2" TFT LCD attached to the Nano's SPI as well as a resistor voltage divider and ADC channel to monitor LED supply voltage and a differential current monitor and ADC channel to monitor the LED supply current. The sketch uses SPI.h, api/HardwareSPI.h, Arduino_LSM6DSOX.h, AdafruitGFX.h and Adafruit_ILI9341.h, as well as LittleFS_Mbed_RP2040.h as well as code I've written.
Thanks All,
Scott