Memory Management Bugs in v1.6.5?

I've been battling random crashes in my rather huge application. It now appears to me there may be a problem in the memory allocation. I'm wondering if anyone knows of any problems in malloc/free/etc in v1.6.5? I've added statistics to track every time I create or destroy an object, do a malloc or free, etc., and my totals are ALWAYS balanced - a delete for every new, and a free for every malloc. But, nonetheless, if I track available memory (using mallinfo), I see memory getting chewed up seemingly at random. Perhaps I'm mis-interpreting what mallinfo is telling me, but for sure something is going wrong, because the system eventually crashes, with all the symptoms of a memory allocation fault.

Regards,
Ray L.

Sooooo.... Now I'm finding the mallinfo structure is not telling me what I thought it was....

I wrote this little test program:

#include <malloc.h>

extern "C" char *sbrk(int i);

uint32_t Initialarena = 0;
uint32_t Initialfordblks = 0;
uint32_t Initialkeepcost = 0;
uint32_t Initialordblks = 0;
uint32_t Initialuordblks = 0;
uint32_t TotalSize = 0;
uint16_t TotalBufs = 0;

const int maxBufs = 1000;
const int maxSize = 75;
struct mBlk
{
	void *p;
	uint16_t size;
};
mBlk Bufs[maxBufs];
int cnt = 0;


void setup()
{
	Serial.begin(115200);
	for (int i = 0; i < 100; i++)
	{
		Bufs[i].p = NULL;
		Bufs[i].size = 0;
	}
	showStats();

	// Partially fill the array
	int max = random(1, 2 * maxBufs / 3);
	int icnt = 0;
	while (icnt < max)
	{
		int i;
		while (1)
		{
			i = random(0, maxBufs);
			if (Bufs[i].p)
				continue;
			uint16_t size = random(1, maxSize);
			Bufs[i].p = (void *)malloc(size);
			Bufs[i].size = size;
			TotalSize += size;
			TotalBufs++;
			break;
		}
		icnt++;
	}
}


void showStats(void)
{
	struct mallinfo mi = mallinfo();
	
	if (Initialarena == 0)
	{
		Initialarena = (uint32_t)mi.arena;
		Initialfordblks = (uint32_t)mi.fordblks;
		Initialkeepcost = (uint32_t)mi.keepcost;
		Initialordblks = (uint32_t)mi.ordblks;
		Initialuordblks = (uint32_t)mi.uordblks;
	}
	char *heapend = sbrk(0);
	register char * stackptr asm("sp");
	Serial.printf("\n");
	Serial.printf("Total Size: %5lu in %d Buffers\n", TotalSize, TotalBufs);
	Serial.printf("     arena: %5lu/%5lu\tTotal space allocated from system\n", (uint32_t)mi.arena, Initialarena);
	Serial.printf("  fordblks: %5lu/%5lu\tTotal non-inuse space\n", (uint32_t)mi.fordblks, Initialfordblks);
	Serial.printf("  keepcost: %5lu/%5lu\tTop-most releasable space\n", (uint32_t)mi.keepcost, Initialkeepcost);
	Serial.printf("   ordblks: %5lu/%5lu\tNumber of non-inuse chunks\n", (uint32_t)mi.ordblks, Initialordblks);
	Serial.printf("  uordblks: %5lu/%5lu\tTotal allocated space\n", (uint32_t)mi.uordblks, Initialuordblks);
	Serial.printf("   heapend: 0x%08lx\n", (uint32_t)heapend);
	Serial.printf("  stackptr: 0x%08lx\n", (uint32_t)stackptr);
	Serial.printf("\n");
}


void loop()
{
	int nallocs = 0;

	while (nallocs < 100000)
	{
		int i = random(1, maxBufs * 13 / 10);
		int size = random(1, maxSize);
		if (i >= maxBufs)
		{
			// Randomly de-allocate a buffer
			i = random(1, maxBufs);
			if (Bufs[i].p)
			{
				free(Bufs[i].p);
				Bufs[i].p = NULL;
				TotalSize -= Bufs[i].size;
				Bufs[i].size = 0;
				TotalBufs--;
			}
		}
		else
		{
			// Randomly allocate or re-allocate a buffer
			if (Bufs[i].p)
			{
				free(Bufs[i].p);
				Bufs[i].p = NULL;
				TotalSize -= Bufs[i].size;
				Bufs[i].size = 0;
				TotalBufs--;
			}
			Bufs[i].p = (void *)malloc(size);
			Bufs[i].size = size;
			TotalSize += size;
			TotalBufs++;
			nallocs++;
		}
	}
	showStats();
	Serial.iprintf("\n\n");
}

It creates a whole lotta random-sized buffers using malloc, then randomly adds, deletes and replaces buffers with new, freshly malloc'd buffers, trying to thrash memory as much as possible. After it's created 100,000 buffers, it prints the mallinfo statistics, which don't make a lot of sense to me.

Here's an example of the output:

Total Size: 29732 in 786 Buffers
Arena: 41912/41912 Total space allocated from system
fordblks: 5136/ 6216 Total non-inuse space
keepcost: 2952/ 3008 Top-most releasable space
ordblks: 43/ 42 Number of non-inuse chunks
uordblks: 36776/35696 Total allocated space
heapend: 0x20085000
stackptr: 0x20087f78

For the fields with two numbers, the first is the current value, the second is the value at the start of the program, before any buffers are allocated. Total Size is the total size of all currently allocated buffers (by simply summing the malloc arguments).

None of these numbers make any sense to me. There is 96K of RAM, yet this seems to show a total of only ~77K allocated. keepcost, ordblks, and uordbks are nearly unchanged, despite almost 30K or RAM having been allocated thorugh malloc!

Can anyone make sense of this?

Regards,
Ray L.

mallinfo is clearly modeled after the unix/Linux mallinfo, but with several fields not used. Here is some explanation of the Linux mallinfo: mallinfo(3) - Linux manual page

Regards,
Ray L.

I think that due uses newlib for its malloc. There should be lots of net noise about newlib.

westfw:
I think that due uses newlib for its malloc. There should be lots of net noise about newlib.

Not surprisngly, malloc appears to be fine, which leaves me baffled as to what is really going on. I seem to be seeing large amounts of memory disappear when my application is not only not allocating memory, but actually freeing it.

The information in mallinfo is really hard to make use of. The arena value does NOT tell you anything useful about how much free RAM remains, but only how much is currently held by malloc. malloc gets memory in "chunks" from sbrk. It starts with a single chunk, and once that is largely allocated, it requests another chunk from sbrk. So, it appears the only way to know how much is really left, is to keep allocating more and more until a request is eventually rejected. So, even my understanding that my application is disappearing RAM is now in question. I could easily be getting fooled by the way malloc works.

Now truly baffled as to how to proceed to resolve my problem...

Regards,
Ray L.

There is certainly no way I am actually running out of memory. Here are the current statistics:

arena: 32044 - RAM Managed By malloc, in bytes
ordblks: 1 - Total # of Blocks (Free+Allocated) Held by malloc
uordblks: 28540 - RAM Allocated To Application, in bytes
fordblks: 3504 - Free RAM Held By malloc
keepcost: 3504 - Size of Top-Most Releaseable Block

Heap RAM Used: 28540 - Current Total Heap RAM In Use
Program RAM Used: 41684 - Current Total RAM In Use
Stack RAM Used: 112 - Current Stack RAM Use
Est. Free RAM: 27968 - Est. Space Between Heap and Stack

RAM Start: 0x20070000
Data/Bss end: 0x2007a2d4
Heap End: 0x20082000
Stack Ptr: 0x20087f90
RAM End: 0x20088000

With nearly 30K of RAM still free, there is zero chance I am using it all.

Regards,
Ray L.

I think, at least in part, I am being misled by the way malloc works, and the status reported through mallinfo. What I expected to see was if I do a series of mallocs, then frees, the free memory should return to what it was before I started. This appears only partially true. If you free ALL the blocks that were malloc'd, the it does appear to work as expected. If even a single block is left allocated, the remaining "freed" memory may, or may not, show up as free, until that one block is eventually freed.

I can look at free memory in my app, do an operation, then look at memory again. If I do this over and over, I find memory does return to its initial state most of the time, but on occasion will not. The amount of memory that does not "come back" varies from a small number (a few tens of bytes) to a rather large number (several Kbytes), and there is no discernable pattern. It appears totally random as to when the memory will disappear, and how much will "go missing". The only explanation I can see for this is that it is the specific sequence of mallocs (or news) and frees (or deletes) that is causing some quantity of RAM to be withheld from the free pool until a specific block is freed. The amount of memory that "goes missing" is often at least an order of magnitude larger than the largest possible object being created, so it is not as simple as an incorrect destructor not freeing all of its malloc'd memory or something similar.

This makes understanding what's happening extremely difficult, as I have no reliable way of auditing memory use. My system is doing real-time control of a CNC machine, and communicates via a messaging network, so I have no way of timing the creation of messages (which must allocate memory). Things have to happen when they have to happen. It would be easy to replace all the malloc and free calls with macros that would track the amount of memory being used from the applications perspective (in fact, they are already using macors, so I can track the number of mallocs and frees), but doing the same for new and delete would require a huge amount of re-factoring of the code. I do track the number of mallocs, frees, news and deletes, and ensure that the numbers balance. I have also re-checked all the classes, especially the destructors, and I am satisfied they are also correct, and always free any malloc'd memory. I have verified this by testing as well.

So, I am truly stumped here. How the heck do I debug this, and figure out if/when/how I am actually losing memory?

Regards,
Ray L.

This is turning into quite the puzzle! Here is a debug trace from my app:

27460> Operator Home All
28036> T: System: G 27612: SYSTEM_IDLE: System Rec'd from Console: SYSCMD_HOME_ATC, 0, 0
T: System: G 27732: SYSTEM_IDLE->SYSTEM_HOME_ALL: Trace Begin
T: System: G 27812: SYSTEM_HOME_ALL->SYSTEM_HOME_ALL_SUB: StateChange StateStack=SYSTEM_DONE/2
T: System: G 27988: SYSTEM_HOME_ALL_SUB->SYSTEM_HOME_ALL_SUB_1: StateChange StateStack=SYSTEM_DONE/2
T: System: G 27780: SYSTEM_HOME_ALL_SUB_1->SYSTEM_HOME_ALL_SUB_2: StateChange StateStack=SYSTEM_DONE/2

System State = SYSTEM_HOME_ALL_SUB_2
Carousel State = CAROUSEL_HOME_SUB_3
Arm State = ARM_HOME_SUB_1
Lift State = LIFT_IDLE
Lock State = LOCK_IDLE
PDB State = PDB_IDLE
Console State = CONSOLE_OPERATOR_TOP
Front Panel State = FP_STATUS
KFlop Host State = KFLOP_IDLE

OpRunTime=1379

27716> T: System: G 27868: SYSTEM_HOME_ALL_SUB_2: ASync Abort: Source=Console
T: System: G 27876: SYSTEM_HOME_ALL_SUB_2: Aborting
T: System: G 27964: SYSTEM_HOME_ALL_SUB_2: Aborting...
System ERROR: Operation Aborted/ATC Reset
E: System: 28044: Operation Aborted/ATC Reset: Failed State: SYSTEM_HOME_ALL_SUB_2 Top of Stack: SYSTEM_DONE
SystemState: SYSTEM_HOME_ALL_SUB_2
CarouselState: CAROUSEL_HOME_SUB_3
ArmState: ARM_WAIT_SEEK_DONE
LiftState: LIFT_IDLE
LockState: LOCK_IDLE
PDBState: PDB_IDLE
ConsoleState: CONSOLE_OPERATOR_TOP
FrontPanelState: FP_STATUS
ENetHostState: ENETHOST_IDLE
KFlopState: KFLOP_IDLE

T: System: G 31796: SYSTEM_HOME_ALL_SUB_2->SYSTEM_ABORTING: StateChange StateStack=SYSTEM_DONE/2
T: System: G 31908: SYSTEM_ABORTING: Waiting for Device Aborts: C A E L P
T: System: G 35804: SYSTEM_ABORTING: Waiting for Device Aborts: C A E L P
T: System: G 35916: SYSTEM_ABORTING: Waiting for Device Aborts: C A E L P
T: System: G 36028: SYSTEM_ABORTING->SYSTEM_ABORTED: StateChange
T: System: G 35940: SYSTEM_ABORTED: Aborted
T: System: G 36020: SYSTEM_ABORTED->SYSTEM_ABORTED_1: StateChange
T: System: G 35164: SYSTEM_ABORTED_1->SYSTEM_ABORTED_2: StateChange
T: System: G 35164: SYSTEM_ABORTED_2->SYSTEM_IDLE: Trace End
Command has Aborted, System Idle
OpRunTime=1662

35012>

What's important is the 5-digit numbers near the start of each line. Those show the free memory, as reported in mallinfo, at the time each message is created. The numbers look reasonable, up to the error message (the line starting with "E:", at which point it suddenly jumps up by ~3800 bytes to 31796. A few messages later, it jumps again, up to 35804, another ~3800 byte jump. Odd thing is, there is NOTHING in the code that makes allocations of anywhere near that size, and during this period, memory is actually being freed, as the command messages in process are being discarded and the memory freed. This at least greatly narrows the search area...

Regards,
Ray L.

Success! A stupid frickin' typo:

I had this:

char *Msg = malloc(strlen(perr->getMsg() + 1));
strcpy(Msg, perr->getMsg());

instead of this:

char *Msg = malloc(strlen(perr->getMsg()) + 1);
strcpy(Msg, perr->getMsg());

The result being, obviously, I allocated a buffer that was two characters too short for the string, then copied the string to the buffer anyway, clobbering the two bytes that followed.....

Seems to work perfectly now.

Regards,
Ray L.

I don't quite understand why malloc() ends up using sbrk() on deeply embedded systems with no OS. Even the AVR malloc() uses it. I guess if you really want to report on "free memory", you have to take into account the memory that malloc() isn't using yet, as well as free memory within malloc...

I have seen many threads reporting issues with malloc() usage. I wonder if it shouldn't be a safe practice to associate malloc() with an assert() function wrapped in some sort of malloc_assert() function, e.g. :

void setup() {                
   
    Serial.begin(250000);  
 
}


void loop() {


 

  // Each time a malloc is performed, e.g.  in your code :

  char *Msg = malloc(strlen(perr->getMsg()) + 1);
  // Did memory allocation failed ?
  if (Msg == 0)
  { // yes, failed
  
  // make assertion failed here !.
  //assert(false);
  __assert(__FUNCTION__, __FILE__, __LINE__, false);
   
}

// handle diagnostic informations given by assertion and abort program execution:
void __assert(const char *__func, const char *__file, int __lineno, boolean Flag) {

if (Flag == false) {    
    Serial.println(__func);
    Serial.println(__file);
    Serial.println(__lineno, DEC);
   
    Serial.flush();
    // abort program execution.
    abort();
}
}

Near as I can tell, there really is no good way to determine the total available memory with sbrk or malloc, short of testing with increasingly large requests until you get a failure. That surprises me, but seems to be the way it is. I am surprised malloc does not simply get all the space between the heap and stack (minus some reasonable stack margin), but that ain't how it works. I don't see the advantage to the way it is.

But, my problem is solved, so I don't care nearly as much as I did yesterday! :slight_smile:

Regards,
Ray L.

ard_newbie:
I have seen many threads reporting issues with malloc() usage. I wonder if it shouldn't be a safe practice to associate malloc() with an assert() function wrapped in some sort of malloc_assert() function, e.g. :

By whatever means, it is good practice to always check the return value from malloc. But, that would not have helped me one bit, since it gave me exactly what I asked for....

Regards,
Ray L.