Portenta H7 : using DTCM and ITCM

Potentially, you have realized that the CM7 core in Portenta H7 has also DTCM and ITCM (besides the caches). Here my examples how to use DTCM, ITCM on Portenta H7 and what it is (some details to share).

Dear Arduino Portenta H7 team:
Please, can you add the hooks to use DTCM and ITCM into the default mbed LIB and linker script?
It would be great to have it usable without to figure out how to add, what to modify (e.g. linker script) and to initialize ITCM already when the startup code (coming from file startup_stm32h747xx.o in libmbded.a) is executed. Thank you.

Does anybody has a clue how to have my own startup_stm32h747xx.S file and to substitute the one taken from the libmbed.a? What is the code in this file? (in order to add my loop to copy as initialization of ITCM there - the correct way to do)

Why using DTCM and ITCM?
Besides the data cache (DCache) and instruction cache (ICache) - the CM7 core (not the CM4 core) has also Tightly Coupled Memory (TCM), for Data (DTCM) and Instructions (ITCM).

These TCMs run with the MCU core clock: no wait cycles (as on FLASH memory), no caches involved (no penalties for a cache miss and cache line fill behind the scene).

Using DTCM makes a bit more data memory (SRAM) available. For instance: I place the .stack and also the .heap (for malloc() ) on DTCM RAM. This makes free some space on regular SRAM (e.g. for global and static data). Why not using the 128K of DTCM?

Using ITCM does not gain for larger code size: you need still all the code, even it will be placed on ITCM, as a copy in FLASH. The startup code has to copy the code from FLASH into ITCM. So, you need still the same space in FLASH, just to hold a copy of this code there.

But the data access to/from DTCM or the code fetch from ITCM is much faster. The bus is wider (32bit or even 64bit), runs with the MCU core clock and does not involve the caches.
Fetching code from FLASH is way slower: it has wait cycles (the FLASH_LATENCY config).

The caches can speed up data access and code fetches, when data and code is reused from caches (loaded there already). But the first time when data or code is not in cache, or it was evicted in order to make space free in cache for new and other data and code - you have a penalty due to a cache miss and a cache prefetch has to be done.
The caches are line-oriented: even you need just one byte - an entire line, e.g. 32 words will be loaded into the cache.
A cache is just effective and increases performance if you access or execute the same code more than once, several times. But the first time, a cache miss, it is slows down the progression of the code and data access.

ITCM and DTCM are the fastest memories without any penalty (no overhead to prefetch). Code in ITCM is often used for INTerrupt handlers, for functions which are called so often and they should be executed as fast as possible (w/o an unpredictable latency due to cache operations).

DTCM for uninitialized data
I use the DTCM for .stack and .heap. When you have data regions which do not need an initialization (as global .data or static .bss would need) - you just need an entry in the linker script linker_script.ld:

    .heap (NOLOAD):
    {
        __end__ = .;
        PROVIDE(end = .);
        *(.heap*)
        . = ORIGIN(DTCMRAM) + (LENGTH(DTCMRAM)-__STACK_SIZE) - __HEAP_SIZE;
        __HeapLimit = .;
    } >DTCMRAM
    .stack_dummy (NOLOAD) :
    {
        *(*.stack*)
        *(.dtcmram)
    } >DTCMRAM

Remark:
I have no idea how to take the linker script directly to my project (using VSMicro). So, I had to modify the original linker script which is located in:
...\AppData\Local\Arduino15\packages\arduino\hardware\mbed_portenta\3.0.1\variants\PORTENTA_H7_M7
Potentially, it should be possible to configure to use my local script. No idea how.

Remark:
The MCU stack (.stack) is just used during the startup and when an INTerrupt is activated. The RTOS (I use with RTOS) sets up a new stack for all the RTOS threads. So, the threads (tasks) use a completely different stack. The RTOS allocates it via malloc(). So, the .heap section is used for RTOS thread stacks. Therefore I moved also this one to DTCM.

Attached my linker_script.ld (as a H-file to let me attach it here), with the extensions for DTCM and ITCM sections (change the file name extension).
linker_script.h (3.6 KB)

So, you do not need a code to add in startup as long you make sure to use DTCM data as uninitialized data (e.g. local variables in RTOS threads and for functions called from threads with local, auto variables).

ITCM
Instructions placed on ITCM needs besides the section added to linker script also code. This should be actually part of the startup code. But I have no idea how to create my own startup_stam32h474xx.S file (and using a different sample file is risky if I do not know exactly what is done in the original one).

You need code to copy the code from FLASH into ITCM. Only the processor, the MCU can do. Therefore, I have an Assembly File ITCM_Startup.S which will do this copy.
ITCM_Startup.h (401 Bytes)
ITCM_Startup.cpp (534 Bytes)

ATTENTION: the file ITCM_Startup.cpp is a native assembly file:
Rename it to *.S (you cannot compile it as *.cpp). And add it to your project: you had to set the properties to let it be used as part of your project (remove "Exclude From Project").
Possible is also to have a *.cpp file and use Inline Assembly code (via asm).

C vs. C++ code
The code for the function ITCM_Startup() comes from an assembly file *.S. It is a native C-function, not a C++-function. The linker will complain as missing reference if you do not declare this function as a C-function. Therefore, see the extern declaration in the H-file.
(C++ code functions use a mangled name, C does not, and code in assembly file is C-code like, not mangled)

Why it does not work everywhere to call ITCM_Startup?
Strange is - and I have no clue why: I cannot call this function right in setup() entry.
Doing this fails:

void setup() {
    /* copy code to ITCM RAM */
    ////ITCM_Startup(0,0,0,0);
    /* WHY it does not work here? */

I have to call it way later. I do it after I have done most of my setup() code, e.g. also initializing SDRAM:

    //initialize SDRAM, even we do not use yet
    sdram.begin();
#if 0
    GsdramStart = (uint8_t*)sdram.malloc((8 * 1024 * 1024)/8 - 12);     //first 12 bytes are used for malloc?
#else
    GsdramStart = (uint8_t*)0x60000000;                                 //hard-coded start of SDRAM (bank 1, NOR/PSRAM)
#endif
    if (GsdramStart)
        strcpy((char*)GsdramStart, (const char *)"SDRAM");              //0x60000000, but string starts at 0x6000000C
    //before, on 0x60000008 seems to be the allocated length
//...
    ITCM_Startup(0,0,0,0);              /* it works just here */

    WelcomeMessage();

It works only if I call this function ITCM_Startup() much later. Why? No idea!
If anybody has a clue ...
The same code called "too early" crashes (the red-morsing LED for a HardFaultHandler invoked). I assume, it might be related to the MPU configuration, e.g. my SDRAM init setups also the MPU. Maybe the ITCM is "disabled" by the MPU config done during the default startup.

Remark:
Actually, this code should be done as earliest as possible. Actually it belongs into startup.S code. All the functions placed in ITCM are just useable after the copy was done - never before.
Even you will not get any compile or linker error - make sure that such an ITCM function is called just after ITCM_Startup() was done.

Place code and data into ITCM and DTCM
Use the section attribute in your code.
Examples:

ECMD_DEC_Status **__attribute__((section(".itcmram")))** CMD_led(TCMD_DEC_Results* res, EResultOut out)
{
unsigned long myDTCMvariable **__attribute__((section(".dtcmram")))**;

Even you would assign an initialization value - due to the missing init code in startup.S for this section - the variable is uninitialized (random content on first use). So, make sure to initialize yourself before the first use of such DTCM variable.

Hint
If you want to verify where all this code is located, if the linker has properly placed the functions and variables on ITCM and DTCM - you can open the generated *.map file in your build directory. The MAP-File gives you the addresses where it is placed.

Restrictions
The TCM memories (esp. DTCM) is only access-able for the CM7 core.
The CM4 cannot access. You cannot use it as shared memory between CM7 and CM4.
And: you cannot use it for buffers which should be accessed by a DMA, e.g. for SPI, UART, I2C etc. when the data is transferred via DMA.
Only CM7 with instructions done by the core can use DTCM, nothing and nobody else.

When you see the linker_script.ld file - you see, some space on DTCM is used by other code, e.g. the bootloader. So, do not overwrite entire DTCM, the bootloader might be broken.

For the ITCM - I think - but NO GUARANTEE - it is not used for anything else (e.g. vector table, bootloader). A check during runtime (memory dump) showed me: ITCM is free and not used.
But not really sure: so, be careful to use ITCM and not to break the bootloader, the INT handling (vector table, ISR routines there - if they would be placed there by default mbed code).

ITCM and DTCM can increase the performance in a remarkable way. And it gains some space on SRAM when some data is located in DTCM.

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.