Arduino Mega watchdog and bootloader

Hello,

I've been trying to track down a sporadic error for a few days now. I've set up the watchdog timer sucessfully, but every now and then it fails to restart the MCU. Sometimes it restarts 20 times in a row without problems, but other times it fails to restart at all. This seems to depend a bit on the size of the sketch (the biggest version of the sketch is 160kByte), but that may be only random coincidence.

Let me mention all the things that DON'T influence this sporadic error: - it doesn't matter if I set the timeout period to something small like 1 second, or to 8 seconds. - it doesn't matter if a watchdog interrupt is set up or not. - it doesn't matter if the USB cable is connected to the PC or not.

If the watchdog interrupt is enabled, the interrupt will always execute flawlessly, but the reset remains flakey. A power off/on reset always brings the application code back to life.

The WDE bit inside the WDTCSR byte is always correct and not accidentally modified by other parts of the code - I checked this by Serial.printing this bit inside the interrupt function.

This issue is consistent across the original Arduino Mega Rev3, the new Genuino Mega, and a Sainsmart Mega. On the Sainsmart, I re-flashed the bootloader, but no difference.

The arduino is currently used naked, with no external hardware attached at all, so external hardware can't be the reason either.

Now I'm aware that nobody will be able to magically give me the solution to this issue. Google tells me that the watchdog on the mega was not functioning, until the stk500boot_v2_mega2560.hex bootloader came out in 2012. But since then, there's no real hot lead anymore.

But maybe someone can point me in the right direction, how to proceed further. So far my thinking goes this way: It can't really be caused by a bug in my application code - and even if, the whole point of the watchdog is to reset the system in case of a bug in the code. I also assume ATMEL produces a mcu with an actually functioning watchdog.

So, the only place left to look is the bootloader, as far as I understand it. This is unexplored territory for me. I found the C code for the bootloader in C:\Program Files (x86)\Arduino\hardware\arduino\avr\bootloaders\stk500v2\stk500boot.c

How do I compile this into an hex file? (I have a windows 10 PC and arduino 1.6.3)

From there on, I would use the built in IDE functions to upload the hex file. I've done this already with another mega as ISP, with the "Arduino as ISP" sketch from the IDE.

Second question: how can I Serial.print debugging information, while the bootloader code is executing?

Or are there other bootloaders for the mega around? Optiboot seems to be popular, but doesn't work with the mega.

Thank you Thomas

A watchdog reset should absolutely reset the board, 100% every time.

The issue with old bootloaders and WDT has been solved, as far as I know. It (or at least the version for which there's source code - I have no idea what's in the .hex files) checks WDT status, and if it reset as a result of WDT reset, it jumps right to the app.

What does it do when it fails to reset the board? Does it go into a hung state where it doesn't do anything, or does the app just keep running, or what?

How are you using it? Are you also using the WDT interrupt in the same sketch? I think there's some weird interaction there.

DrAzzy: What does it do when it fails to reset the board? Does it go into a hung state where it doesn't do anything, or does the app just keep running, or what?

Yes, it goes into a hung state. But I don't know whether it gets stuck before it reaches the bootloader, or if it gets stuck inside the bootloader

How are you using it? Are you also using the WDT interrupt in the same sketch? I think there's some weird interaction there.

Yes, I'm using the WDT interrupt in the same sketch. This is actually okay, according to the datasheet. Once the Interrupt function has completed, the mcu disables the WDT interrupt bit, leaving only the WDT reset bit enabled. This process is hardwired, and needs no code interaction at all. The application sketch then keeps running for another timeout period, and then the mcu starts to hang.

From the data sheet:

WDTON: 1 (high byte fuse, 1= unprogrammed) WDE: 1 watch dog enable WDIE: 1 watch dog interrupt enable

Mode: Interrupt and System Reset Mode Action on Time-out: Interrupt, then go to System Reset Mode

Executing the corresponding interrupt vector will clear WDIE and WDIF automatically by hardware (the Watchdog goes to System Reset Mode).

However, the sporadic failure even happens if I don't enable the interrupt at all.

I've now tried to permanently enabling the watchdog by programming the WDTON byte in the high byte fuse. WDTON Fuse set to “0“ means programmed and “1” means unprogrammed.

Now the uploading of a sketch fails. I assume this is expected behaviour, because the default prescaler value is 2048 cycles (16ms), and the bootloader does not change the prescaler value in time. I would like to try and modify the prescaler value in the bootloader, but so far I haven't figured out how to compile a bootloader.

I wasn't able to get STK500v2 to build under windows. I build it on linux, went pretty easy once I gave up on windows.

I still don't see why this should be necessary, though - the bootloader shouldn't be confused by the WDT - it should detect that it was a WDT reset and immediately start the application.

Are you sure that the cause of the issue isn't that whatever causes the WDT reset to time out in the first place isn't an external device not responding as expected, resulting in the board hanging while trying to communicate with it, and then when it restarts, the external fault condition hasn't been rectified, so it just keeps hanging waiting on the external device, then the WDT resets it again? (or something like that)?

DrAzzy: Are you sure that the cause of the issue isn't that whatever causes the WDT reset to time out in the first place isn't an external device not responding as expected, resulting in the board hanging while trying to communicate with it, and then when it restarts, the external fault condition hasn't been rectified, so it just keeps hanging waiting on the external device, then the WDT resets it again? (or something like that)?

There is nothing external connected to the mega (except the USB cable to the PC). The WDT timeout is intentional for debugging purposes - I don't call the wdt_reset() routine and wait. I have even tried without the USB cable and external 5V power supply, but no difference.

After some sleep and some work, I've done some more debugging. I've completely disabled any watchdog functionality. Instead, after some seconds of executing the code, I restart the mcu with either

asm volatile("jmp 0"); // goes to the beginning of the application section

or

asm volatile ("jmp 0x1F000"); // goes to the beginning of the bootloader section

I am aware that this doesn't reset the registers, this is only for investigating the issue.

Jumping to the application always works flawlessly. Jumping to the bootloader works most of the time - the application starts after a few ms. But some times the bootloader sits there for 10 seconds or more. However, it manages to recover and start the application again.

So I think this is evidence that the bootloader has at least something to do with the issue. I think it's time to debug the bootloader - daunting! Since this issue only seems to occur with really big sketches, e.g. 100k or more (every time I try to isolate the issue in a small sketch, it disappears), I suspect it's got something to do with register or memory addressing, and the bootloader simply assumes some addresses, which maybe are not reliably set to 0 by the atmega reset function. These addresses would normally be 0 with smaller sketches. At this point however, my theory is only speculation, just good as a starting point for debugging.

A little side-step: how to compile a bootloader with arduino and windows 10.

You need a command line tool called "make". This comes by default with linux systems, but not with windows, and also not with the arduino IDE.

There are several sources on the web. I tried tried http://www.mingw.org/, which is a collection of command line utilities. It worked for me, so I didn't try anthing else.

This command utility turns a .c file into a .hex file according to the instructions given in the makefile. It's not a compiler, it uses the avr-gcc compiler. This compiler is delivered with the arduino IDE.

How to use:

  • open the windows command prompt
  • navigate (change directory, cd) to the folder which contains both the .c file and the makefile
  • type "make" without the quotation marks, hit enter.

AAANND - it's not going to work. Windows doesn't know the command "make". However it tries to look for an executable called make, whereever it can find it. Specifically, there's a list of directories, which windows will search for a file called "make.exe" This list is called the PATH environment variable. how to edit it: http://superuser.com/questions/949560/how-do-i-set-system-environment-variables-in-windows-10

Add 2 paths to this list: C:\MinGW\msys\1.0\bin (or wherever you installed it. This folder should contain make.exe) C:\Program Files (x86)\Arduino\hardware\tools\avr\bin ( or whereever you installed it. This folder should contain avr-gcc.exe)

avr-gcc is the compiler specified in the makefile.

Actually, if you just type "make", it will create a whole lot of .hex files, because the makefile contains instructions for more than one mcu. you can select the one you need, in my case "make mega2560" More instructions for the use of make can be found in the makefile.

the .c file and the makefile for the mega are at C:\Program Files (x86)\Arduino\hardware\arduino\avr\bootloaders\stk500v2

If you get file read/write errors, or no errors at all, but the .hex file never appears, you might be running into permission problems. For starters, windows doesn't like it if you mess around in the Program files diretory. Try running the command prompt as administrator. Or copy your bootloader folder to another location.

I'm sure there are people reading this with more knowledge than me. Please correct any inaccuracies. For example, I'm still not sure if installing mingw is actually necessary. But I couldn't find the make utility anywhere in the arduino IDE.

asm volatile ("jmp 0x1F000");

?

I think that's supposed to be byte addressed, not word addressed?

DrAzzy: asm volatile ("jmp 0x1F000"); ? I think that's supposed to be byte addressed, not word addressed?

Yes, you're right. The atmega2560 natively uses word addressing, however the avr-gcc compiler expects all addresses to advance in byte steps, and internally converts them to word steps (divides them by 2), if the target is an avr mcu - even if it's a assembly command, which the avr-gcc compiler shouldn't actually touch.

byte step address 0x3E000 jumps to the bootloader. BTW, the exact address depends on the mcu and the bootloader size fuses. Check the datasheet chapter 29.6.16 ATmega2560/2561 Boot Loader Parameters

The interesting question is - why did the byte address 0x1F000 manage to restart the sketch? I assume the compiler realized I provided an incorrect jump target and decided to insert target 0 instead?

I think I found the problem.
If the restart was caused by the watchdog, the bootloader exits with

void (*app_start)(void) = 0x0000;
...
app_start();

In any other case, the bootloader exits with

		// exit the bootloader in an orderly fashion
		
		asm volatile ("nop");			// No operation. Do nothing for one clock cycle. Probably not necessary, but doesn't harm either. Copy and pasted from code snippet.
		UART_STATUS_REG	&=	0xfd;	//for the mega2560, this is the register UCSR0A. We clear bit 1 to zero here. 0xfd = 0b1111 1101 . Bit 1 – U2Xn: Double the USART Transmission Speed This bit only has effect for the asynchronous operation. Write this bit to zero when using synchronous operation.
		boot_rww_enable();				// from <avr/boot.h>: Bootloader Support Utilities. Enable the Read-While-Write memory section. enable application section

		// the next instruction is more complicated than necessary for the mega2560. asm jmp would do the job as well, but the jmp command is not available on all processors
		
		/* Indirect jump to the address pointed to by the Z (16 bits) pointer register in the register file.
			The Z pointer register is 16 bits wide and allows jump within the lowest 64K words (128K bytes) section of program memory.
			This instruction is not available in all devices. Refer to the device specific instruction set summary.
		*/
		
		// Z pointer is at registers 30 and 31
		
		asm volatile //the volatile keyword tells the compiler to disable certain optimizations. These optimizations would attempt to make the code smaller, but break it in the process.
		(
			"clr	r30		\n\t" // clear register 30. The new line and tab characters are only there so the assembler file will look nice and human readable.
			"clr	r31		\n\t" 
			"ijmp			\n\t" // indirect jump
		);


		 /*
		 * Never return (stay in an endless loop) to stop GCC to generate exit return code
		 * Actually we will never reach this point, because we jumped away earlier, but the compiler doesn't
		 * understand this.
		 */
		for(;;); // endless loop

With the orderly exit, the sporadic problem has disappeared.
Most probably

boot_rww_enable();

was the crucial thing that was missing, but the other instructions can’t hurt.

I’ll look for a place to submit this bug.

Hello hydrocontrol,

would you mind testing another bootloader? I added support for Atmega2560 to Optiboot along with MCUSR preservation. I own MEGA 2560 board and it works flawlessly for me, but I would appreciate any feedback about this modification. Pros: - more flash for application (it uccupies only 1KB instead of 8KB) - allows to write to flash by application (I think you don't need it, but it works:-) ) - it prevents contents of MCUSR where possible (on the contrary to original Optiboot which sets it to zero) Cons: - requires Avrdude 6.1 or newer (versions shipped with Arduino IDE are older, so you'll need to replace it with newer version) - a little experimental :-)

You could find it here (with ready precompiled hex): https://github.com/majekw/optiboot/tree/supermaster (branch supermaster). Thank you for any feedback. (even 'I don't mind doing that' :) )