RAM canaries

I don’t think that this is a frequently answered question, but my apologies if it turns out to be a canard.

I’m testing some code here on an unbadged “Uno compatible”. I fully accept that it might not be manufactured to the same standard as a genuine board, and that the problem I’ll describe might not occur on better hardware.

The program I’m running outputs, among other things, the maximum time that the main loop takes to run and the SRAM usage. The latter uses code from Measuring Memory Usage | Memories of an Arduino | Adafruit Learning System, in general both values are stable:

                SERIAL.print(F(" < "));
                SERIAL.print(avgLoop);
                SERIAL.print(F(" < "));
                SERIAL.print(busyLoop);
                SERIAL.print(F(" 0x"));
                SERIAL.print(busyLoop, HEX);
                SERIAL.print(F(" uSec"));
                SERIAL.print(F(", SRAM: "));
int zz = freeMemory();
                SERIAL.print(zz /* freeMemory() */ );
                SERIAL.print(F(" 0x"));
                SERIAL.print(zz, HEX);

Output normally looks like this:

Loop: 216 < 237 < 18728 0x4928 uSec, SRAM: 2285 0x8ED

In addition, the test version of the program outputs cryptic messages showing the transitions of a state machine, where those messages do not use F() so they come from RAM rather than being pulled directly from Flash.

What I’m seeing happen is that after roughly a week of operation (10 million lines of logged output), some of the state machine messages are overwritten with garbage, busyLoop- which is rarely recomputed- is corrupted, the operation of freeMemory() is corrupted, but avgLoop- which is updated regularly- is not corrupted. Specimen output looks like this, with … placeholders for stuff I believe to be irrelevant:

D0b
R->L
R0
...
Loop: 216 < 237 < 18728 0x4928 uSec, SRAM: 2285 0x8ED
...
...
D0b
�a>L�bYU
bYU
...
Loop: 216 < 244 < 621043328 0x25045E80 uSec, SRAM: 8960 0x2300

Once I get the suspect numbers etc. they remain unchanged. The program starts over with sensible output if the reset button is pressed, which suggests that Flash is not corrupted. Since both dismal and hex values are affected, it’s probably not some weird output library problem.

Has anybody looked at protecting areas of memory with canaries or checksums that could warn that something’s wrong and trigger a restart?

If not, what symbolic information is available defining the extent of rarely-changing messages etc. in RAM, so that I know what should be checksummed?

MarkMLl

on an unbadged “Uno compatible”.

SRAM: 2285 0x8ED :o

Does the mysterious sketch that you have not posted (HINT) use Strings by any chance ?

you run out of memory. and that is the cause of the ‘corruption’
SRAM 2285 doesn’t look good. ATmega328p has only 2kB SRAM

UKHeliBob:
Does the mysterious sketch that you have not posted (HINT) use Strings by any chance ?

Definitely not.

MarkMLl

…another five minutes passes

Juraj:
you run out of memory. and that is the cause of the ‘corruption’
SRAM 2285 doesn’t look good. ATmega328p has only 2kB SRAM

Ah. I plead guilty to not looking closely enough at the actual numbers.

Which presumably means that the first thing I should be looking at is why the freeMemory() function (which I cut-and-pasted) isn’t doing anything useful.

MarkMLl

MarkMLl:
Has anybody looked at protecting areas of memory with canaries or checksums that could warn that something's wrong and trigger a restart?

I'd look at finding what's writing to memory that it shouldn't instead of applying a sticking plaster.

I don't believe there's any structure to the data that the compiler places in the DATA section (e.g. constants grouped together) so there's no way to easily identify "rarely-changing messages etc."

Place messages in FLASH if you don't want them being corrupted. Of course that won't fix your underlying problem, something else will just get corrupted elsewhere.

TheMemberFormerlyKnownAsAWOL:
...another five minutes passes

My apologies, I didn't see a useful question or suggestion, I was engaged elsewhere, and as an occasional poster my rate is still limited.

OK, so the point you're making if I'm not mistaken is that I should have been looking rather more closely at the numbers, and not blithely trusting the freeMemory() routine.

If I can't trust that, what should I be using to verify the health of the (stack and) heap?

MarkMLl

pcbbc:
I’d look at finding what’s writing to memory that it shouldn’t instead of applying a sticking plaster.

I don’t believe there’s any structure to the data that the compiler places in the DATA section (e.g. constants grouped together) so there’s no way to easily identify “rarely-changing messages etc.”

OK, so GCC isn’t using a single section for data-copied-from-Flash :-/

It’s a pity, since if there were something recognisable and /if/ this problem had been caused by a power glitch etc., being able to checksum a defined block now and again would have been a useful indication that something was wrong.

Place messages in FLASH if you don’t want them being corrupted. Of course that won’t fix your underlying problem, something else will just get corrupted elsewhere.

As it is, most of the messages are debugging-only via macros.

I fully accept what people have said which appears to point to heap fragmentation/exhaustion. Since I’m doing no string manipulation (there’s one place where I’m treating a message as a char array to chop part of it off) I presume that the problem is in the output routines.

MarkMLl

MarkMLl:
OK, so GCC isn't using a single section for data-copied-from-Flash :-/

No, it is using a single DATA section for pre-initalised data, it's just you can guarantee which bits of that are const (read only) and which bits are read/write. It will be all jumbled up with no distinction between the two.

All non-initialized (zeroed) data goes in the BSS section.

This code exists in a NRF24 bootloader to do the initialisation of the DATA and BSS sections...

	  asm volatile (
	  
		"	ldi	r17, hi8(__data_end)\n"
		"	ldi	r26, lo8(__data_start)\n"
	"	ldi	r27, hi8(__data_start)\n"
	"	ldi	r30, lo8(__data_load_start)\n"
	"	ldi	r31, hi8(__data_load_start)\n"
	"	rjmp	cpchk\n"
	"copy:	lpm	__tmp_reg__, Z+\n"
	"	st	X+, __tmp_reg__\n"
	"cpchk:	cpi	r26, lo8(__data_end)\n"
	"	cpc	r27, r17\n"
	"	brne	copy\n");
  // Prepare .bss
  asm volatile (
	"	ldi	r17, hi8(__bss_end)\n"
	"	ldi	r26, lo8(__bss_start)\n"
	"	ldi	r27, hi8(__bss_start)\n"
	"	rjmp	clchk\n"
	"clear:	st	X+, __zero_reg__\n"
	"clchk:	cpi	r26, lo8(__bss_end)\n"
	"	cpc	r27, r17\n"
	"	brne	clear\n");

I suppose if you are sure that your entire preinitialised data is const, you can compare against that in FLASH.

Still I say it's a sticking plaster and you shouldn't use it.

pcbbc:
No, it is using a single DATA section for pre-initalised data, it's just you can guarantee which bits of that are const (read only) and which bits are read/write. It will be all jumbled up with no distinction between the two.

All non-initialized (zeroed) data goes in the BSS section.

...

I suppose if you are sure that your entire preinitialised data is const, you can compare against that in FLASH.

Still I say it's a sticking plaster and you shouldn't use it.

Thanks for that, detail noted. I'll be adding a watchdog to the program, and from there it would be no big deal to reboot every day or so (the uptime counter shows it had been running for about 4.5 days when the problem struck, and the debugging messages might have been making things worse).

But a periodic reboot is still "sticking plaster", and there's nothing in the script which is explicitly interacting with the heap or going recursive.

MarkMLl

my guess is you have the memory full from the start. because of that the memoryFree function shows a wrong value. how much dynamic memory use reports the IDE after build?

Juraj:
my guess is you have the memory full from the start. because of that the memoryFree function shows a wrong value. how much dynamic memory use reports the IDE after build?

Sketch uses 9036 bytes (28%) of program storage space. Maximum is 32256 bytes.
Global variables use 699 bytes (34%) of dynamic memory, leaving 1349 bytes for local variables. Maximum is 2048 bytes.

MarkMLl

Just post your code, otherwise we're doing nothing more than guessing.

What, if any, external hardware are you using, such as OLED displays, SD cards, addressable RGB LEDs (neopixels), etc, that might be allocating a large amount of ram at runtime? Note that the dynamic memory usage reported by the compiler does not include local variables, so you may be using considerably more than is shown. Would help considerably if you could post the actual code.

I'm testing some code here on an unbadged "Uno compatible". I fully accept that it might not be manufactured to the same standard as a genuine board, and that the problem I'll describe might not occur on better hardware.

Sry, lazy - did you ever just try this on a Genuino? I have used every kinda “cheap bordering on free” all the way to “paid way too much” Arduino-ish SBC. Aside from some USB excitement I have never traced a problem to the hardware… that is to say I expect you would find the same behaviour no matter.

a7

david_2018:
What, if any, external hardware are you using, such as OLED displays, SD cards, addressable RGB LEDs (neopixels), etc, that might be allocating a large amount of ram at runtime? Note that the dynamic memory usage reported by the compiler does not include local variables, so you may be using considerably more than is shown. Would help considerably if you could post the actual code.

This might be a duplicate reply, in which case I apologise.

No external hardware. avr/pgmspace.h to support freeMemory(), string.h for strrchr() which is applied to FILE, and board_info.h which is used once at startup. I could chop those out since they're basically eyecandy, but would be far happier if freeMemory() were working properly.

I think I need to concentrate on freeMemory(), and produce a dummy demo that either does or doesn't crash.

MarkMLl

The freeMemory() implementation that I used was from Measuring Memory Usage | Memories of an Arduino | Adafruit Learning System and has a check for three-digit version numbers:

#elif defined(CORE_TEENSY) || (ARDUINO > 103 && ARDUINO != 151)

However Arduino 1.8.12 now uses 10812 i.e. a five-digit number, and memory layout has changed so that that function is returning rubbish.

I've tweaked it so that for a program which compiles to

Sketch uses 8616 bytes (26%) of program storage space. Maximum is 32256 bytes.
Global variables use 609 bytes (29%) of dynamic memory, leaving 1439 bytes for local variables. Maximum is 2048 bytes.

free SRAM is reported at runtime to be 1420 bytes using

int freeMemory() {
  char top;
  return &top - __malloc_heap_start;
}

I don't know whether there's a definitive version of this function where this has been fixed.

Work continues investigating the overall problem, but on reflection I'm not convinced that it's a classic heap/stack collision since I'd expect the output strings which were being corrupt to be stored in the global/static area of SRAM rather than anywhere more dynamic.

MarkMLl