Portenta H7 : SDRAM and code execution

tjaekel · May 30, 2022, 10:01pm

A topic (experience, test) to share with you: for ARM processor and system "experts", esp. in terms of ARM instruction set and MPU, relocatable code...

Question first:
Who does know how to mix long_calls with short_calls in C/C++ code?
It means: how to declare, define and call functions which are too far away from calling code via relative, Place-Independent-Code (PIC) code?

My intention: I want to load code on external SDRAM (visible at address 0x60000000) and execute code on SDRAM (e.g. for "Apps" loaded from QSPI or SDCard).

OK, here what I think we had to bear in mind, when we want to implement:

Obvious, SRAM has to be initialized, e.g. via:

SDRAMClass sdram;
    sdram.begin();

But not sure if code execution is allowed and possible on SDRAM: usually, the external SDRAM, in address range 0x60000000 ... 0x7FFFFFFF is default as: "device" and executable. But mbed, RTOS etc. could configure MPU and disable code fetch and execution: this is actually recommended to do so: if external SDRAM is not available - avoid speculative code fetch, cache line fills (from unavailable memory): otherwise a HardFault_Handler can be called (what I had for several days). So, potentially consider to setup an MPU entry for SDRAM region.
Place data and code on SDRAM: OK, you can use attribute and section to do so.
Extend the linker_script.ld: you had to define sections there where the code is linked to SDRAM address, but the code and data is maintained as a "copy" in FLASH ROM (I use it here for simple test: my SDRAM functions sit as copy in FLASH ROM, just to make it easier).
Have a startup code: before the first use, even SDRAM initialized - the content (data and code = text) has to be written (copied) to SDRAM: have code (as assembly file) to "move" from other location (FLASH ROM) into SDRAM region.
Very important: Place Independent Code (PIC) is generated: see compiler options as -fPIC or -m no_long_calls: due to the ARM instruction set where code for branches or function calls are addressed as "relative to PC value" - the function called cannot be further away as 4 MB!. Calling a function from SDRAM (0x60000000) located in FLASH ROM (0x08000000) is TOO FAR AWAY for a relative ARM Thumb instruction! The distance is larger as 4MB. The compiler will complain (and it did so often). So, you need "helper functions", called "trampoline function" or "veener" to extend such a call over a larger distance (mainly via loading a 32bit address to register and jump via "next PC is register content").

I have it working (sharing and attached files here). The biggest problem is to deal with this PIC code (Place Independent Code) and the distance larger as 4MB. The only solution I have: write assembly code which will "extend" the call: from SDRAM region to FLASH ROM, in order to call a function available there.

Here are the code files and some remarks:
Have a function which copies code and data from FLASH ROM to SDRAM (after SDRAM init was done):
ITCM_Startup.h (401 Bytes)
ITCM_Startup.cpp (1.8 KB)

The ITCM_Startup.cpp is actually an assembly file, as ITCM_Startup.S!

These files have a function SDRAM_Startup() and functions to "extend" (forward) calls from SDRAM to FALSH ROM (where regular FW is).

These files are associated with the linker_script.ld:
linker_script.h (3.9 KB)
The file name is linker_script.ld! Place it under your project or potentially under the LIB reference file path (used from there, not sure), e.g.: C:\Users<username>\AppData\Local\Arduino15\packages\arduino\hardware\mbed_portenta\3.0.1\variants\PORTENTA_H7_M7

And you had to bear in mind: assembly code does not "find" C++ functions: to, you had to declare functions used from assembly code as "plain C function" (no "name mangling"). Also if you want to branch in assembly code to a function: the called function should be "plain C".

This linker script has the definition of sections, used via attribute, e.g. code in C/C++ looks like this:

char SDRAM_String[80] __attribute__((section(".sdramdata"))) = { "Hallo SDRAM\r\n" };

Remark: as a new code section, do NOT use .textsdram! I use .sdramtext because linker script has already a definition for (.text) which would place all .textsdram still in FLASH ROM (the second * for alias as any other appendix, .textsdram is the same as .text - in FLASH ROM!).

Configure the MPU: even it does not seem to be necessary to configure the MPU for an entry to manage the SDRAM region (0x60000000, 128MB), you can (and should) do:

	/* configure the MPU - let allow code execution on SDRAM - it does not seem to be needed */
	print_log(UART_OUT, "old MPU setting\r\n");
	hex_dump((char *)0xE000ED90, 0x30, 4, UART_OUT);
	{
#define MPU_SDRAM_EXEC_REGION_NUMBER  MPU_REGION_SDRAM1
#define MPU_SDRAM_REGION_TEX          (0x4 << MPU_RASR_TEX_Pos)		/* Cached memory */
#define MPU_SDRAM_EXEC_REGION_SIZE    (22 << MPU_RASR_SIZE_Pos)		/* 2^(22+1) = 8MB */
#define MPU_SDRAM_ACCESS_PERMSSION    (0x03UL << MPU_RASR_AP_Pos)
#define MPU_SDRAM_REGION_CACHABLE     (0x01UL << MPU_RASR_C_Pos)
#define MPU_SDRAM_REGION_BUFFERABLE   (0x01UL << MPU_RASR_B_Pos)

		MPU->CTRL &= ~MPU_CTRL_ENABLE_Msk;
		/* Configure SDARM region as first region */
		MPU->RNR = MPU_SDRAM_EXEC_REGION_NUMBER;
		/* Set MPU SDARM base address (0x60000000) */
		MPU->RBAR = SDRAM_START_ADDRESS;
		/*
			- Execute region: RASR[size] = 22  -> 2^(22+1) -> size 8MB
			- Access permission:  Full access: RASR[AP] = 0b011
			- Cached memory:  RASR[TEX] = 0b0100
			- Disable the Execute Never option: to allow the code execution on SDRAM: RASR[XN] = 0
			- Enable the region MPU: RASR[EN] = 1
		*/
		MPU->RASR = (MPU_SDRAM_EXEC_REGION_SIZE | MPU_SDRAM_ACCESS_PERMSSION | MPU_SDRAM_REGION_TEX | \
			MPU_RASR_ENABLE_Msk | MPU_SDRAM_REGION_BUFFERABLE) & ~MPU_RASR_XN_Msk;	/* do not disable XN for code execution possible */

		/* Enable MPU and leave the predefined regions to default configuration */
		MPU->CTRL |= MPU_CTRL_PRIVDEFENA_Msk | MPU_CTRL_ENABLE_Msk;
		__DSB();		/* we should use barriers to make sure all updated */
		__ISB();
	}

So, the user code looks like this:

extern "C" {
	void SDRAM_Startup(void);		/* initialize: copy code and data to SDRAM */
	int SDRAM_FuncL(int i);			/* a LONG_JUMP call, via FLASH ROM */
	int SDRAM_Func(int i);			/* the function as assembly code for SDRAM */
	int SDRAM_Func2(int i);			/* another function defined as C-code for SDRAM */
	void SDRAM_UARTSend(const char* str, int chrs, EResultOut out);	/* forward to UARTSend() */
}

char SDRAM_String[80] __attribute__((section(".sdramdata"))) = { "Hallo SDRAM\r\n" };
#ifdef THIS_FAILS
////#pragma push					/* this generates very strange error: "instantion of mbed::Callback<(void()>" */
////#pragma long_calls
/* this this causes an error "relocation truncated" due to place independent (relocatable) code not possible
 * how to define a C/C++ function placed on SDRAM??
 */
int __attribute__((section(".sdramtext"), long_call)) SDRAM_Func2(int i) {
	return i * 10;
}
////#pragma short_calls
////#pragma pop
#endif
int SDRAM_Func2(int i) {
	return i * 10;
}
typedef int (*pFunc_t)(int);		//define a type for function pointers

And inside a regular function I do this (before I call a function on SDRAM):

	SDRAM_Startup();     /* copy .data and .text to SDRAM! */

	print_log(UART_OUT, "new MPU setting\r\n");
	hex_dump((char*)0xE000ED90, 0x30, 4, UART_OUT);

	print_log(UART_OUT, "SDRAM INIT result\r\n");
	hex_dump((char*)0x60000000, 64, 4, UART_OUT);
	
	print_log(UART_OUT, "SDRAM data         : %lx\r\n", (unsigned long)SDRAM_String);
	print_log(UART_OUT, "SDRAM func long    : %lx\r\n", (unsigned long)&SDRAM_FuncL);
	print_log(UART_OUT, "SDRAM func address : %lx\r\n", (unsigned long)&SDRAM_Func);
	print_log(UART_OUT, "SDRAM func2 address: %lx\r\n", (unsigned long)&SDRAM_Func2);
	strcpy(SDRAM_String, "NEW STRING!\r\n");
	int i = 10;
	i = SDRAM_FuncL(i);
	print_log(UART_OUT, "##1 result is: %d\r\n", i);

	i = SDRAM_Func2(i);
	print_log(UART_OUT, "##2 result is: %d\r\n", i);

	hex_dump((char*)0x60000000, 64, 4, UART_OUT);

	{
		/* this works! but be careful when moving relocatable code to SDRAM: can fail */
		unsigned long* memP1 = (unsigned long*)0x60000200;					//SDRAM address for code (the SDRAM_Func2)
		unsigned long* memP2 = (unsigned long*)(((unsigned long)&SDRAM_Func2) & 0xFFFFFFFEul);	//ATT: it is thumb code! odd address!
		pFunc_t fPtr = (pFunc_t)0x60000201;									//ATT: it must be odd, for thumb code!
		print_log(UART_OUT, "##3 copy function to SDRAM\r\n");
		memcpy(memP1, memP2, 80);
		//try to call now the function on SDRAM
		i = 20;
		i = fPtr(i);
		print_log(UART_OUT, "##3 result is: %d\r\n", i);
	}
	/* call function forwarded to FLASH ROM */
	SDRAM_UARTSend("Hallo from SDRAM\r\n", 18, UART_OUT);
	SDRAM_UARTSend(SDRAM_String, 13, UART_OUT);

So: the data, this SDRAM_String, is not an issue. Just do NOT use (NOLOAD) in linker script: so, this string is copied from FLASH ROM to SDRAM and it becomes available on SDRAM. With (NOLOAD) in linker script: the space is allocated but not the global assigned init value, e.g. the string: it does not sit in FLASH ROM as a copy for SDRAM content.

Just for all the function calls: I have not found a way to declare a C/C++ function as a long_call. Often, the compiler complains with "relocation truncate" or even a very strange error message as "callback mbed::...".
So, I forward all the long_calls from SDRAM via assembly code helper functions to FLASH ROM. Not so nice, but no idea yet how to cope with it.

PIC code instructions
When you study the ARM instruction set: you will see most of the instructions for branch and calling a function are "PC-relative". It means: the code will branch or call an address as function as "current-PC-value + offset". The beauty: it is place independent: the same code code can be moved, can be located somewhere else - it still works (not absolute references where it is located or where the calling code is located, as long as it is still relative to the caller).

This does not work anymore when code is executed on SDRAM (0x60000000) and it should call a function on FLASH ROM (0x08000000): it is "too far away" (for relative addressing, max. 4MB distance). So, you had to "extend" the call: you need a "veener" function where the total 32bit address is loaded into a register and than branch to it (see assembly code).

OK, fine: but: I need a free register (here I use R3) to do so. But this limits the the maximum number of parameters for a function call (max. 4: R0..R3). Now I need R3 free for this "long_jump" and I can have only 3 parameters on a function call.

I have not (yet) found a way to declare "long_calls" and to forward to a FAR address without to lose a temporary register just to do the "long_jump.
How to mix "long_calls" with "short_calls" (as default)? How to get rid of this "relocation truncate" error messages without to write so much assembly code handlers?

THUMB code
Also to bear in mind: our CM7 uses THUMB instructions, not ARM instructions. It means this: the bit 0 on every PC value, e.g. when you call a function, is set to 1. The address looks "strange": it is odd: if you print the address of a function - their address is odd (has a bit 0 set).
This is mandatory to keep it (as odd address): a bit 0 as 0 would mean: "change to ARM instruction set (away from THUMB) - and it will FAIL on CM7 (CM3, CM4)".
The compiler generates properly THUMB code (bit 0 set) for function addresses. But when you copy with assembly code - make sure code addresses are always on odd entry addresses (+1).

So: even the compiler generates already properly odd addresses for function calls: I make sure that bit 0 (of a function address, not for data! still "real" address) remains or is set.

Relocatable Code
I do something really ugly: I have a function SDRAM_Func2() which sits in FLASH ROM (0x08000000). I could not call this function directly (too far away). But I take the code and copy the code for this function from FLASH ROM into SDRAM. Than I call this function on SDRAM (after copied).

It works: due to fact that generated code is "Place Independent Code" (PIC); it works in the same way. I can call and OK.
BUT: if such a function I call (and copied their code) would call another sub-function inside this function - it will NOT work anymore. It will crash!
The sub-function call is still a "relative PC+offset call" but the function needed to be called as sub-function is NOT on SDRAM and even too far away. It cannot work.
So, even I had a SDRAM function working: if this one would call any other functions from it - it had to be a "long_call". It cannot work.

So, all OK, the approach is clear, the framework (test) works. Just if I do such one:

int __attribute__((section(".sdramtext"), long_call)) SDRAM_Func2(int i) {
	return i * 10;
}

It gives me a very strange "Callback mbed::..." error. Why? This "long_call" does not seem to work.

And If I do just:

int __attribute__((section(".sdramtext"))) SDRAM_Func2(int i) {

it complains still with "relocation truncate" (the SDRAM_Func2 is still to far away, not for PIC code). How to mix long and short calls?

And: using the assembly code helper functions ("veener code") is a work-around: but how to have still 4 parameters to function calls (without to lose one register just for the long_call to do)?
OK: I could push register to stack, branch forward, pop back from stack... very ugly looking code.

Never mind: the good news is: I can load and execute code to and from SDRAM. Cool: Now I can write "Apps", e.g. load from SDCARD or QSPI flash (which does not let me execute directly from it as QSPI memory mapped device - a "wrong" chip soldered on Portenta H7?).

But it sounds to me: I had to create a "BIOS vector table" (extend calls from SDRAM to FLASH ROM), in assembly code. And create the "App" as an independent project, where it is self-contained (does not use directly FW and FLASH ROM functions), instead it uses the "BIOS vector table" to call external functions.
Not a big deal, just effort (and a bit complicated SW structure).

If you would have any idea how to deal with long_calls and to mix code (with short and long jumps) ... I would appreciate (even any idea how to deal with the "problem").
Thank you.

Debug Hooks
Due to the fact that Portenta H7 does not have really a debugger (just this bootloader) - it is a bit difficult to debug. I cannot but it would be nice to connect an ST-Link debugger and trace my code. Especially I want to know when HardFault_Handler is called, what is the processor PSPR register value etc. (where was the fault trap generated).

OK. Ideas (I use):

after compiling - check the generated *.MAP file (in project/Debug directory). It gives me a clear indication of the linker via linker_script.ld has placed properly the code, e.g. in SDRAM. Also, where to find the copy on FLASH ROM, when it will be copied from there to SDRAM, for SDRAM_setup().
I have created a hex_dump() function (see in my example code, implementation is not provided): I can print any MCU memory region, also registers, e.g. MPU registers. This is really helpful to debug: I can see what is inside the memory and registers of MCU. (I use a UART command line and this "hd" is a command to 'inspect' any memory).
I could also have an UART command to write any memory or register with a value, e.g. "wr . This could me to try something from command line, w/o to change/compile my code again.

The message is" if you have (and use) UART: have a simple "monitor program", to have have access to memory, registers from a command line. It makes it easier to try something, in inter-active mode, without to compile again code and see it crashing.
The compile time for Arduino, Portenta H7... is anyway so "bad" (long) - it is faster to debug via a UART command line. A hex_dump function from command line is my "biggest friend" meanwhile.

tjaekel · May 30, 2022, 10:44pm

clarification:
If you copy "place independent code" (PIC): the start address is an EVEN address, e.g. 0x60000000, but when you call the function (e.g. via a function pointer): the calling address is odd: 0x60000001.

So, align the code properly (best at 4 byte, via .balign 4) but call functions, branch to code via odd address: address + 1!

And: best is also to align 32bit words, e.g. I use for the "veener" address (loaded into register in order to branch to this address). So, when defining a 32bit word in assembly code, here .word SDRAM_Func2, align it to a 4-byte address (via .balign 4). A 32bit load instruction needs a 4-byte-alignement.

This .balign 4 helps also to deal with "holes": if you have code, and THUMB code is 16bit - it can happen that the code ends on a 2-byte aligned address but not 4-byte address (2 bytes missing, a hole.) But the word which is loaded into register must be 4-byte aligned address to find there.

If you wonder about the nop used there. Not sure, but sometimes it is needed: ARM processor can do a speculative code fetch: it fetches the next instruction already, even it would branch away. But the instruction pipeline does not know if it would branch. It "speculates" that even the next instruction is needed, therefore fetch it already.
This "speculative" stuff is very tricky to understand: but when you see some nop's after a branch, even it looks reasonable it will never hit this nop, it branches out before - it might be needed (just to avoid that the "speculative" code fetch and "branch prediction unit" will see invalid code: yes, it can happen that an instruction after a branch is taken and even executed already, e.g. when it is a register load from memory instruction - even lost afterwards but statistically it will make sense to "be ahead" of executed code).

ARM processors are really cool and the engineering behind the scenes, the "trick" for performance improvements are "amazing". But it can make the coding, esp. in assembly code, a "bit tricky".

Example: if you would execute an instruction at the very last address of a memory (and next address is memory not available), or you read data from a very last address (and next data address is not available memory) - the processor is "so smart" to "look ahead": it will try to fetch also the next instruction, to fill the cache with the next memory line). But if memory is not available there (and you get a bus error due to a "hang" on bus fabric) - your code will crash or it hangs there forever (bus fabric does not complete transaction on missing memory).

Even all you code is correct, looks reasonable - this "prefetch" ("speculative" code fetch and execution, cache pre-fill) can cause trouble. Therefore, often you might find NOP where is does not seems to be executed, e.g. I have seen also after FOR-loops - it might be "mandatory" to have there a NOP.

tjaekel · May 30, 2022, 11:07pm

Don't worry: the compiler will generate in "most of the cases" the correct code: it becomes just an issue when you try to access data or execute code on the very last address of a memory.

BTW: this can be any memory, e.g. also the end of the FLASH ROM, the end of an internal SRAM. In this case - the compiler does not know it is the end: the next location is not available anymore but the compiler does not care (but the processor might do).

But if it is not available anymore: the MCU (CPU, not the compiler code) could try to read "ahead" (a "speculative" prefetch happens behind the scene).
The only way to avoid "speculative" prefetch is to configure MPU (and define not-available memory regions as not access-able). But mbed does not seem to do (as recommended by ARM guidelines). The MPU in CM7 is often not used neither needed: but to make use of it can make your SW more "predictable" in terms of behavior (esp. when data and code is at the end of memories).

All fine, as long you do not hit this issue (not using the very last available memory addresses). If you get a HardFault_Handler called or the FW seems to hang: think about if it can be an issue due to accessing the very last available memory location.

tjaekel · June 1, 2022, 2:07am

FYI, I have it working with a BIOS vector table.

This is the memory layout: how to forward function calls to ROM FW, via a BIOS vector table, from SDRAM "application":

I have realized, EABI let me to use R12 (used as temporary interlink register): so, the registers R0..R3 remain free for function call arguments. Great.

Here the files:
ITCM_Startup.cpp (2.6 KB)
linker_script.h (4.0 KB)

The test code does not use yet a real SDRAM main function. Instead: I forward to a simple function which is located on SDRAM. It proofs properly, that the SDRAM_Entry() function is working.

Here the sketch code - declarations:

extern "C" {
	void  SDRAM_Startup(void);		/* initialize: copy code and data to SDRAM */
	/* the entry point of SDRAM code, like main() */
	int   SDRAM_Main(int argc, char *argv[]);
	/* BIOS vector table, on SDRAM, used by SDRAM code */
	void  SDRAM_UARTSend(const char* str, int chrs, EResultOut out);
	void  SDRAM_UARTSendString(const char* str, EResultOut out);
	char* SDRAM_UARTGetString(void);
}

static char SDRAM_string[80] __attribute__((section(".sdramdata"))) = "Hallo from SDRAM\r\n";

typedef void (*pFunc)(const char* str, EResultOut out);
typedef char* (*pFunc2)(void);

And the code part:

	SDRAM_Startup();

	/* check if the addresses are correct, for BIOS forward: */
	print_log(UART_OUT, "XX2: %lx\r\n", (unsigned long)&SDRAM_Main);
	print_log(UART_OUT, "XX3: %lx\r\n", (unsigned long)&SDRAM_UARTSendString);
	print_log(UART_OUT, "XX4: %lx\r\n", (unsigned long)&SDRAM_Entry);
	print_log(UART_OUT, "XX5: %lx\r\n", (unsigned long)SDRAM_string);

	/* check the ROM FW functions, used in BIOS vector table: */
	print_log(UART_OUT, "XX2: %lx\r\n", (unsigned long)&UART_Send);
	print_log(UART_OUT, "XX2: %lx\r\n", (unsigned long)&UART_SendStr);
	print_log(UART_OUT, "XX2: %lx\r\n", (unsigned long)&UART_GetString);

	{
		int i;
		i = SDRAM_Entry(10, NULL);
		print_log(UART_OUT, "XX5: %d\r\n", i);
	}

	{
		pFunc fPtr;
		pFunc2 fPtr2;
		char* c;
		fPtr = SDRAM_UARTSendString;
		fPtr(SDRAM_string, UART_OUT);

		fPtr2 = SDRAM_UARTGetString;
		c = fPtr2();
		fPtr(c, UART_OUT);
	}

Remark:
The next step is to setup and create a project to build the SDRAM "application".
This SDRAM code needs to include/use the stdlib functions, e.g. strcat, sprintf etc. So, it will be doubled. It is not possible to reuse the same functions in FW ROM code. So, we will link the SDRAM "App" against the standard lib (stdlib).

Via this BIOS_vector table we reuse the existing functions in FW. And via this SDRAM_Entry function we can launch the SDRAM code, without to know where the real entry address is (it can be anywhere in SDRAM code).
This BIOS vector table makes is possible to change FW or SDRAM code, without to take care where all the functions are. Just: the BIOS vector table must be placed always at the same place (we use absolute addresses for the vectors). The KEEP keyword in linker_script.ld makes it possible.

Cool: next step: write a real SDRAM application and load it from QSPI or SDCARD.

system · November 28, 2022, 2:07am

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Where to go next if code execution from RAM or SD is needed? General Discussion	6	978	July 23, 2022
SDRAM 8-bit Portenta h7 Portenta H7	2	1005	January 11, 2023
Portenta H7 vector table, ISR_Handlers and SVC Portenta H7	1	477	November 18, 2022
Jump Program Counter To RAM Programming	7	1529	May 5, 2021
Portenta H7: QSPI conflicting with SDRAM? Portenta H7	5	1168	January 20, 2023

Portenta H7 : SDRAM and code execution

Related topics