How to troubleshoot Hanging script

Hi All,

I have a relatively large Arduino script (~200K code, 110K variables) running on a nano RP2040 connect that randomly hangs after literally hours of operation in several operating modes of the script. When hung, the serial port doesn't disconnect from the PC and so I don't see the device appear in the PC's file manager device list as it has sometimes has after code crashes (e.g., invalid pointer operations).

I've used serial.print statements to narrow down the area in the code where code appears to hang, but the results are confusing because I've added code that ought to interact with Serial.print/read where I think the program hangs but get no interaction with the serial port.

I know there are no formal debug tools, but any thoughts on methods to debug this are much appreciated.

I am developing using Arduino IDE 2.3.3 18.242Z on a fully-updated Windows 10 Pro PC.

Best,
Scott

Check for insufficient type definitions (for example; a int where an unsigned int belongs) and overflow of millis() or micros().

I do use millis() extensively, however, the range is 50 days according to the docs. I've never run more than a couple of hours without the issue, this is a good clue, thanks!

Scott

... as an unsigned long 49 days.

Also, check literal values 1000 and over, and make them read 1000UL (UL = unsigned long), especially if you do any math with them, for example;

unsigned long hours = 60 * 60 * 1000UL; // ms per hour

Also great advise, thanks!

Scott

Just a SWAG (Scientific Wild-Assumed Guess), but I suggest checking your power supply and cooling system. These are common sources of issues and worth investigating.

Not experimented with it myself yet, but Tools -> Debug Level and select the debug level you require, probably Core to start with. Then Tools -> Debug port and either select to dump it to the default port or probably better, set up a serial to USB on another port. It might dump something useful.

Caveat: I am using the Earle Pillhower library. Not sure whether the Mbed one has these options or not.

Hi gilshiltz,

I've had an oscilloscope on the supply rail and see no excessive noise. The project is a 10x10x10 LED CUBE that is driven by a 400W 5V supply with plentiful tantalum bulk and ceramic supply bypass capacitors.

thanks,
Scott

Hi Bitseeker,

I don't see these options for the nano RP2040 Connect using the newest Arduino IDE 2.3.3.

Thanks,
Scott

Maybe you are using the mbed library then? Perhaps it doesn't have those options.

Hi All,

I have more news and am now more confused than ever. When I originally reported the stalling, I had originally assumed that the code had corrupted the stack and the CPU ran off into space. But I hadn't waited long enough. After 62.635 minutes, the system resumes operation and continues normally. I have now captured three stalls each a little over an hour long.

Following the suggestions from xfpd, I rewrote all the delay statements throughout my code paying careful attention to variable and literal types (all uint32_t and xxxUL now). The code originally delayed using the form:

//with somevariable being initialized to millis() sometime before this.
if ( millis() >= somevariable + delayconstant)
{
 somevariable = millis();
 do something
}

to perform timed waits. I rewrote the code to call bool waitms( uint32_t var, uint32_t delay ) to perform delays. That function returns true when the time delay has expired (e.g., millis() >= var + delay) and false otherwise.

In that routine I test for delay being greater than a limit value, var being greater than millis() and the millis() - var being greater than the limit value. The function prints error messages and values of millis, var and delay if any of those conditions are true then returns immediately. waitms also maintain a static variable set to the last millis() value read. If mills() < that value that too is flagged as an error (I realize this will happen once the time range of millis() is exceeded, but the system has yet to run this long without issues). Here's the whole routine:

//	Wait for millis() >= var + t
//	But trap for the case where millis() rolls over its 40 day limit.
//	True if millis() >= var + t, *or* |millis()-var-t > MAXMSDELAY|

bool waitms( int ID, uint32_t var, uint32_t t )
{
	static uint32_t lastt = 0;
	uint32_t		ti = millis();
	
	if ( ti < lastt )
		Serial.println( F( "Timer rollover" ) );
	else
		lastt = ti;
	
	if ( t >= MAXMSDELAY || ti - var >= MAXMSDELAY || ti < var )
	{
		Serial.print( "TRAP: waitms rollover ID " );
		Serial.print( ID );
		Serial.print( ", millis " );
		Serial.print( ti );
		Serial.print( ", var " );
		Serial.print( var );
		Serial.print( ", t " );
		Serial.println( t );
		return( true );
	}
	return( ti >= var + t );
}

I have now successfully captured three stalls. The first two stalls (from a system power-up) happen at nearly identical times (939524x ms) and last 62.635 minutes. The third stall was longer at 71.583 minutes long and ended 134.215 minutes after the first stall (ended).

Interestingly, when the system stalls there are no error messages printed before the stall. But, when the system resumes operation messages are printed because about an hour has elapsed. Also, the absolute value of millis() when the system resumes operation in the two stall cases are nearly identical (within 10 ms). I can't imagine what the software might be doing during that 62 minute stall period. The software doesn't use dynamic variables, e.g., no malloc()/free/delete calls, so there shouldn't be any garbage collection going on. Any thoughts?

Is this behavior ringing a bell with anyone? I can't explain the hour-long pauses in code operation.

Briefly, the project is a 10x10x10 LED cube. The cube uses 1000 WS2812 serial addressable LEDs that are driven with a little bit of 74LVCxx logic and the nano RP2040 Connect's SPI port. I also have an Adafruit 2.2" TFT LCD attached to the Nano's SPI as well as a resistor voltage divider and ADC channel to monitor LED supply voltage and a differential current monitor and ADC channel to monitor the LED supply current. The sketch uses SPI.h, api/HardwareSPI.h, Arduino_LSM6DSOX.h, AdafruitGFX.h and Adafruit_ILI9341.h, as well as LittleFS_Mbed_RP2040.h as well as code I've written.

Thanks All,
Scott

That sounds like a class of error that is due to the arrangement of variables try this

if ((millis() - Last) > Delay) {

Hi sonofcy,

Thanks for your reply but this isn't the issue. Aside from one form overflowing slightly before the other, they are identical. The delay implementation isn't the issue. As I said in my last post, the system freezes not while waiting for a delay to expire but for some other reason.

Best,
Scott

Ok, you are more familiar with that code than I but I just wanted you to be aware of that common bug. Good luck.

Hi All,
I have much more information and believe I have tracked down the source of the problem, although it's not exactly clear why.
Over the past week I have performed many tests. First, I was convinced that one of the other libraries I am using was the source of the problem. I'm using SPI, api/hardwareSPI, LMS6DSOX, Adafruit_GFX, Adafruit_ILI9341, LittleFS and LittleFS_Mbed_RP2040. My first step was to rewrite the code so I could eliminate all these libraries (except the SPI libraries which are essential to the device). With all but SPI and hardwarespi removed, the problem remains.
Much statistical analysis later, the script generally stops after running for 62 minutes and remains stopped for usually another 62 minutes although these times are somewhat variable.
Since my code is written as a state-machine it performs it's operation then exits loop() without holding control any longer than required, so it is periodically entering and exiting loop(). So I suspected that whatever calls the loop() function was keeping control causing the hang.
So I next setup a gpio pin as a oscilloscope trigger. I set the pin on entry to loop and clear it on exit from loop. This showed me that the stall was clearly inside my code, not outside of loop as I thought.
I proceeded to move the gpio set and clear statements to key points in my code to track down the section that is causing the stall. I eventually found this section:

					//	draw the launch path

					j = 0;
					if ( rocket[ i ].loc.z > 0 )
					{
						//	Add some variability to the launch path
						do
						{
/*							k = rand() / 2730;
if ( j > 6 )
{
	Serial.print( "j " );
	Serial.print( j );
	Serial.print( ", rnd " );
	Serial.println( random( 12 ) );
}
*/
#ifdef PULSESD
	digitalWrite( SD_NCS, HIGH );
#endif
							rocket[ i ].launch[ rocket[ i ].loc.z ].x = rocket[ i ].iloc.x + sign( rocket[ i ].loc.x - 4 ) * random( 12 ) / 10;	
#ifdef PULSESD
	digitalWrite( SD_NCS, LOW );
#endif

#ifdef PULSESD
	digitalWrite( SD_NCS, HIGH );
#endif
							rocket[ i ].launch[ rocket[ i ].loc.z ].y = rocket[ i ].iloc.y + sign( rocket[ i ].loc.y - 4 ) * random( 12 ) / 10;
#ifdef PULSESD
	digitalWrite( SD_NCS, LOW );
#endif

							rocket[ i ].launch[ rocket[ i ].loc.z ].z = rocket[ i ].loc.z;
						}
						while ( ++ j < 10 && !isoncube( rocket[ i ].launch[ rocket[ i ].loc.z ] ) );

						if ( j >= 10 )
						{
							Serial.println( F( "TRAP: not on cube" ) );
							rocket[ i ].launch[ rocket[ i ].loc.z ].x = rocket[ i ].iloc.x;	
							rocket[ i ].launch[ rocket[ i ].loc.z ].y = rocket[ i ].iloc.y;
						}							

When the program stalled SD_NCS remains high and the message "TRAP: not on cube" was printed. The sketch ran for about 80 minutes before stalling and remained stalled for about an hour.
The source of the stall is the random() function, which is blocking instead of immediately returning a random value. Note that j is simply acting as a iteration count to force the loop to quit. If the loop did indeed quit there would be no hanging, but it does (hang). Also, one could argue that the program goes off into space due to variable/stack corruption, but that doesn't explain how after 60 minutes or so, the code can resume. If the system is reset this particular more doesn't automatically restart.
Investigation shows the random function calls the trng() function since the nano RP2040 Connect is equipped with a "true random number generator". I believe it is firmware communicating with this chip that is causing the program to hang.

I was unable to actually find any Arduino library source code to investigate my theory further. Anyone familiar with this interface have any comments?

Thanks,
Scott

Hi All,

I have confirmed my original finding that the random() function, on an Arduino nano RP2040 Connect does indeed block after some amount of time or number of samples are read. Here's a short sketch that proves it:


//  TFT LCD pins connected on the test platform (these are "D" & "A" numbers, not gpio #s!)

#define TFT_DC		( 3 )
#define TFT_CS		( 2 )
#define CUBE_NCS	( 7 )
#define CUBE_CS		( 6 )
#define	LCD_NRST	( 10 )
#define	SD_NCS		( 16 )
#define	FAN			( 9 )
#define	TACH1		( 5 )														//	open-collector
#define	TACH2		( 4 )														//	open-collector

void setup( void )
{
	Serial.begin( 9600 );
	pinMode( TFT_CS, OUTPUT );
	digitalWrite( TFT_CS, LOW );
  while ( !Serial ) ;
}


void loop( void )
{
	digitalWrite( TFT_CS, HIGH );
	long z = random( 1000 );
	Serial.print( z );
	Serial.print( ", " );
	Serial.println( millis() );
	digitalWrite( TFT_CS, LOW );
  delay( 20 );
}

I ran this and logged the results over the course of 3 hours and five minutes taking samples at a nominal 20 milliseconds each. After 152.891 minutes of run time, the sketch stopped producing samples for 62.636 minutes. After the 62 minutes it resumed operation producing samples at the nominal 20 ms period. Notice the pause time is virtually identical to the pauses I measured in my other project.

I hope this saves some of you some time debugging.

Best,
Scott

1 Like

This is way above my head, just making sure you are using type identifier/extension for literals. That is to say, an value as small as 1000 in an equation with an unsigned long should be written 1000UL... or have you covered that already (I read the topic, but got lost).

Hi xfpd,

Thanks for your reply, but you kind of missed the point. If there were an issue with the compiler generating the wrong data type as an argument for random, which it doesn't because there are function prototypes in the header files telling the compiler what type of data is accepted and returned by functions, I would see random number statistics reflecting that. What I see are random values in the 0-999 range with great statistics (an average of 500.13 and std. deviation of 288.6, with a very flat histogram.

The problem is random number generation stops for over an hour during the 3 hours I ran the sketch.

Best,
Scott

If a timer says "do something at some number of microseconds" but the type has been overflowed, creating "negative time" for a signed long of -2,147,483,648 microseconds, you have 0.4 hours * 3 (hours) = 1.2 hours.

That seems rather low for the rate at which you are sending characters to the console. Does increasing it change its behaviour?

Also, as has been pointed out, the consistent period between failures its closeness to the micros() overflow of around 70 minutes do look odd.

I suppose I'd swap the random() function for something which simply gave say the next sequential number 0-999 in a cycle to prove the problem was not else where.

Anyway, it looks like you have done some praiseworthy debugging to have reduced the problem to that small sketch.