Trying to better understand memory usage

I am trying to better understand memory allocation in my MEGA 2560 project.

In my complex smart car project I am trying to change some member variables into global variables. I think this should require the same amount of memory, either way, but what I am seeing is that each global variable requires 57 bytes of SRAM. There are 15 of these variables and I don’t want to lose 855 bytes of SRAM.

I created this small project to try to isolate the problem. Unfortunately, this does not show the problem I am really trying to solve, but hopefully I can get clues by analyzing the difference between the two results.

The code below can be run two ways, with #define nested true or false. I have run this in the standard Arduino IDE and also in the Eclipse/Sloeber environment, which is what I use most of the time. The output is the same either way. The Arduino IDE reports 168 or 166 bytes more flash usage.

There are lots of differences between the two primary conditions that I am trying to explain. Why does the heap usage and heap size change the way it does? The code I use to obtain the heap and stack size is included. It is based on code that was posted here. I have fixed a few bugs in the heap code, but there may be better ways to get this information (more accurately).

Thanks for helping me understand the details of what is happening here.
-Christopher

RESULTS

EXAMPLE 1

Arduino IDE:
avrdude: 5232 bytes of flash verified
Sketch uses 5232 bytes (2%) of program storage space. Maximum is 253952 bytes.
Global variables use 540 bytes (6%) of dynamic memory, leaving 7652 bytes for local variables. Maximum is 8192 bytes.

Eclipse/Sloeber: avrdude: 5064 bytes of flash verified

Memory study
Nested: 1
Heap before allocation
Heap Usage: 7649 / 15276 Stack Size: 19 Free RAM: 7627 [Stack 8700:8681 Heap 8703:1054]
Heap after allocation
Heap Usage: 7641 / 15260 Stack Size: 19 Free RAM: 7619 [Stack 8700:8681 Heap 8703:1062]
Inner value: 1
Sizeof Inner: 2 == 2
Sizeof Wrapper: 2 == 2
Heap in loop
Heap Usage: 7641 / 15260 Stack Size: 19 Free RAM: 7619 [Stack 8700:8681 Heap 8703:1062]

EXAMPLE 2

Arduino IDE:
avrdude: 5212 bytes of flash verified
Sketch uses 5212 bytes (2%) of program storage space. Maximum is 253952 bytes.

Global variables use 540 bytes (6%) of dynamic memory, leaving 7652 bytes for local variables. Maximum is 8192 bytes.

Eclipse/Sloeber: avrdude: 5046 bytes of flash verified

Memory study
Nested: 0
Heap before allocation
Heap Usage: 7649 / 15276 Stack Size: 19 Free RAM: 7627 [Stack 8700:8681 Heap 8703:1054]
Heap after allocation
Heap Usage: 7645 / 15268 Stack Size: 19 Free RAM: 7623 [Stack 8700:8681 Heap 8703:1058]
Inner value: 2
Sizeof Inner: 2 == 2
Sizeof Wrapper: 1 == 1
Heap in loop
Heap Usage: 7645 / 15268 Stack Size: 19 Free RAM: 7623 [Stack 8700:8681 Heap 8703:1058]

/*
 * mem.ino
 * Memory allocation study.
 * Christopher Eliot
 * cre@empiremaster.com
 */

#include <Arduino.h>
#include "heap.hpp"

#define BUF_SIZE (256)
char buffer[BUF_SIZE];

#define nested false

class Inner
{
    public:
        Inner (int value) :
                value(value)
        {
        }
        int value;
};

#if ! nested
Inner *inner;
#endif

class Wrapper
{
    public:
#if nested
        Wrapper () :
            inner(new Inner(1))
        {
        }

        Inner *inner;
#endif
        int get_inner_value ()
        {
            return inner->value;
        }

        size_t get_inner_size ()
        {
            return sizeof(*inner);
        }
};

Wrapper *wrapper;

void setup ()
{
    Serial.begin(115200);
    Serial.println(F("Memory study"));
    Serial.print(F("Nested: "));
    Serial.println(nested);

    Serial.println(F("Heap before allocation"));
    get_heap_state(buffer, BUF_SIZE);
    Serial.println(buffer);

#if ! nested
    inner = new Inner(2);
#endif
    wrapper = new Wrapper();

    Serial.println(F("Heap after allocation"));
    get_heap_state(buffer, BUF_SIZE);
    Serial.println(buffer);
    Serial.print(F("Inner value: "));
    Serial.println(wrapper->get_inner_value());
    Serial.print(F("Sizeof Inner: "));
    Serial.print(wrapper->get_inner_size());
    Serial.print(F(" == ")); // Should match
    Serial.println(sizeof(Inner));
    Serial.print(F("Sizeof Wrapper: "));
    Serial.print(sizeof(*wrapper));
    Serial.print(F(" == ")); // Should match
    Serial.println(sizeof(Wrapper));
}

void loop ()
{
    Serial.println(F("Heap in loop"));
    get_heap_state(buffer, BUF_SIZE);
    Serial.println(buffer);
    delay(50000);
}

/*
 * Heap.hpp
 *
 *  Created on: Jan 10, 2023
 *      Author: cre
 *
 * Based on: https://forum.arduino.cc/t/getting-heap-size-stack-size-and-free-ram-from-due/678195
 */

#pragma once

void get_heap_state (char *buffer, size_t size);
unsigned long get_free_ram ();
unsigned long get_heap_address ();

/*
 * Heap.cpp
 *
 *  Created on: Jan 10, 2023
 *      Author: cre
 */

#include <Arduino.h>
#include "Heap.hpp"

static const unsigned long initial_stack_address = SP;

void get_heap_state (char *buffer, size_t size)
{
    const unsigned long heap_address = get_heap_address();
    const unsigned long heap_size = RAMEND - heap_address;
    const long stack_size = initial_stack_address - SP;
    const unsigned long free_ram = get_free_ram();
    const unsigned long sp = SP;
    const unsigned long ramend = RAMEND;
    snprintf(buffer, size, "Heap Usage: %lu / %lu Stack Size: %lu Free RAM: %lu [Stack %lu:%lu Heap %lu:%lu]",
            heap_size, heap_size + free_ram, stack_size, free_ram, initial_stack_address, sp, ramend, heap_address);
}

unsigned long get_free_ram ()
{
    char *const heap_var = (char*) malloc(sizeof(char));
    const unsigned long heap_address = (unsigned long) heap_var;
    free(heap_var);

    return SP - heap_address;
}

unsigned long get_heap_address ()
{
    char *const heap_var = (char*) malloc(sizeof(char));
    const unsigned long heap_address = (unsigned long) heap_var;
    free(heap_var);
    return heap_address;
}


Without looking at the details of your test, I'm not sure why you're surprised. You are not creating simple variable, you are creating an instance of a class, which HAS some memory overhead, both FLASH and RAM. FLASH is used to store the code and initialization information for the member data and functions of the class. Then, when the class is instantiated, a block of RAM needs to be allocated, that requires a memory allocation stucture to be created, so malloc can keep track of what memory blocks are in use, where they are, and how big they are. The class itself also occupies memory, for its member variables, its vtable (if required), etc. and other book-keeping information needed to make the class function.

If any of our classes are unused, the corresponding code and data will be disappeared by the linker, since it is not needed. So, the first time you DO use the class, there will be FLASH memory hit to include the class code and data.

Thanks, I must have been unclear. I am not surprised that there are differences, I am trying to figure out how to calculate and predict exactly what they are. For instance in EXAMPLE 1 the RAM goes from 7627 to 7619 in the setup method, so I am using 8 bytes. There are two objects allocated, each 2 bytes in size. Presumably, the other 4 bytes are overhead from malloc.

Initially it is leaving 7652 bytes for local variables. When I first check there is RAM: 7627 and the stack is 19. 7627+19 = 7648 so where did the other 6 bytes go?

1 Like

Then you have to study compiler, linker and RTL code which affect the allocation and use of memory. And don't forget the impact of the optimization level. Also find out how well or bad dynamic memory works on a Mega with not really much RAM.

As long as you are surprised by the varying results you are searching in the wrong places or use inappropriate research code. Start with simpler code to verify your metering functions.

Thanks, I was hoping someone who has studied all that could help me get started.

1 Like

I have studied all that 50 years ago. The principles have not changed since then, I think.

Yes, I learned on PDP-8, PDP-10, OS-360, ITS, Lisp Machine etc about that time. The Arduino is much more powerful than the PDP-8, I think. In any case, I've only been looking at the Arduino for a couple months so I don't know this system well at all, yet.

Yes, of course, but the principles of memory management and use are widely independent from processing power. The separation into stack (SS) and heap (BS) memory is common consent. A constant memory is allocated either in program space (flash, ROM...) on a microcontroller and/or loaded into initialized data memory (DS) by the boot or program loader on a GPU.

When I wrote my first decompiler it was really funny to see how each compiler moved or copied the constant data into RAM instead of using a dedicated DS loaded from the binary disk image.

On a AVR machine I wonder what the F() macro really does. Does it copy the data from PROGMEM into SRAM for use with any other standard function?

Wow. That's flawed. It is possible the code that initializes initial_stack_address runs before the stack pointer (SP) is initialized.

How are you getting the 57?

F() doesn't actually DO anything (beyond putting the string in flash, same as PSTR.) It's a C++ typing hack to allow code to call different (overloaded) functions for a string in flash vs a string in RAM. C++ itself doesn't have a concept of some pointers pointing to different memory than other pointers, so F() fakes it.

Wow. That's flawed. It is possible the code that initializes initial_stack_address runs before the stack pointer (SP) is initialized.

Nah, it's fine. I'm pretty sure C/C++ requires that the stack pointer be set up before any actual C code is executed - even initialization. In any case the ATmega328p and all ARM processors initialize the SP in hardware, and even older AVRs initialize the stack pointer first thing in the startup code (order of startup stuff is defined here: Memory Sections )

#define nested false

OTOH, this strikes me as particularly dangerous, since you're counting on "false" evaluating to 0 at preprocessor time, when everyone knows that good C++ programmers probably did something like const int false = 0;

I get "inner" as two bytes, BTW. I'm also not seeing where you got 57 from, based on your quoted output.

 avr-nm -SC /tmp/Arduino1.8.13Build/*.elf | grep inner
00800175 00000002 b inner
#define nested 1
> avr-nm -SCn /tmp/Arduino1.8.13Build/*.elf | grep -v " [tTaAwW] "
00800100 D __data_start
00800100 00000002 D __malloc_heap_end
00800102 00000002 D __malloc_heap_start
00800104 00000002 D __malloc_margin
00800106 00000012 d vtable for HardwareSerial
0080016c B __bss_start
0080016c D __data_end
0080016c D _edata
0080016c 00000001 b timer0_fract
0080016d 00000004 b timer0_millis
00800171 00000004 b timer0_overflow_count
00800175 00000002 b wrapper
00800177 00000100 b buffer
00800277 0000009d b Serial
00800314 00000004 b initial_stack_address
00800318 00000002 B __brkval
0080031a 00000002 B __flp
0080031c B __bss_end
0080031c N __heap_start
0080031c N _end
00810000 N __eeprom_end
> 
#define nested 0
> avr-nm -SCn /tmp/Arduino1.8.13Build/*.elf | grep -v " [tTaAwW] "
00800100 D __data_start
00800100 00000002 D __malloc_heap_end
00800102 00000002 D __malloc_heap_start
00800104 00000002 D __malloc_margin
00800106 00000012 d vtable for HardwareSerial
0080016c B __bss_start
0080016c D __data_end
0080016c D _edata
0080016c 00000001 b timer0_fract
0080016d 00000004 b timer0_millis
00800171 00000004 b timer0_overflow_count
00800175 00000002 b inner
00800177 00000100 b buffer
00800277 0000009d b Serial
00800314 00000004 b initial_stack_address
00800318 00000002 B __brkval
0080031a 00000002 B __flp
0080031c B __bss_end
0080031c N __heap_start
0080031c N _end
00810000 N __eeprom_end

I learned on ... PDP-10 ... ITS

Ah, sweet memories. (billw@mit-dm/etc (not an MIT alum, though!))

What if no such overloaded function exists?
I'd expect that F() copies the string to RAM for processing with standard C/C++ functions.

Considering only 2048-byte RAM space of the 328P MCU, one would naturally desire to use some part of the Flash Memory for the storage of fixed-type variables like welcome/error messages. If this is the case, the message would be directly saved into the flash memory using SPM instruction and would be directly retrieved from flash memory using LPM instruction. Would there be any involvement of RAM memory (except 1-byte buffering) in this process? For example:

Serial.println("F(Welcome)");

Well, it doesn't.

#define F(string_literal) (reinterpret_cast<const __FlashStringHelper *>(PSTR(string_literal)))

Use it as an argument to a function expecting a char*, and you should get an error:

error: cannot convert 'const __FlashStringHelper*' to 'const char*' for argument '2' to 'char* strcpy(char*, const char*)'
   strcpy(buf, F("test string"));

These instructions have to be generated by the compiler, e.g. in an overloaded function version.

So I only was lucky in testing prepared (overloaded) functions. In addition to overloaded functions there seem to exist a couple of re-implemented functions like strcpy_P().

The 57 bytes is my measured value from the full smart car project. That has about 30 files and depends upon a complex breadboard setup so I am not planning to post it here.

There is an example of how to use F() strings in my logging library, GitHub - ChrisE2018/logging: Logging library for Arduino.

Look at the implementation of operator<< at line 31 of LogBuffer.cpp.

My understanding of the F() string macro is that it puts strings into PROGMEM, i.e., flash. You need to use a special technique to get the bytes out, but you can save a lot of SRAM with it. There is a PGM_P type for strings in flash and a pgm_read_byte function to get a byte from it and return a normal char.

In general, you are performing heap allocations, which is better to avoid on Arduino MCUs. I would not be sure that the relation between your expected memory usage and the actual memory allocation would be linear, since instructions like new may require additional commands to execute, that's why memory allocation in runtime is less welcomed than stack allocation (again on the MCUs).

what I am seeing is that each global variable requires 57 bytes of SRAM.

Well, in any case, a global variable should not require 57 bytes of RAM unless it is a 57 byte variable, and would probably require slightly more if allocated dynamically (maybe 59 bytes, as it includes an additional field to indicate the length of the malloc'ed block. Other malloc implementatins may have even higher overhead.)