Data caching for multicore shared data

henksb · October 26, 2022, 7:34pm

I run data acquisition on CM4 (writes the data) and data processing on CM7 (reads the data). CM7 uses data caching, CM4 does not. For efficiency reasons I just pass a pointer from CM4 to CM7 using RPC (as opposed to an entire array contents). I am willing to put up with manual CM7 data cache updating where needed.

I noticed (as expected) that I have to cache-update the RAM data on M7 in order to read that properly updated data from memory on CM4. It looks like I don't have to do the same for SDRAM-allocated data. Is that correct, in other words is the SDRAM noncached?

What is the best strategy for this? Declaring RAM variables volatile may work for basic data types but I get errors when applying that to classes and structs. Alternatively, I can declare RAM variables attribute ((aligned (32))) in order to call SCB_InvalidateDCache_by_Addr() for 32-byte cache lines. This is my plan, and I would like to know if that should work.

Any suggestions or references/links are welcome. Right now, I am finding things out by trial and error.

tjaekel · October 27, 2022, 12:31am

You are on the right track: "Cache Coherency".
CM7 has caches, CM4 does not. CM4 is like any other core (e.g. a DMA engine) and writes/reads directly. But CM7 sees through caches to the memory content (and is not informed when real memory was changed by other core).

So, yes, you should use Cache Maintenance operations on CM7 side (invalidate cache before reading, but also clean cache after writing - so that CM4 can see the updated content).

Best practice is to align the data buffers (for cache line size, 32 bytes). But it works also not to align: just set the address for cache invalidate to the start of cache line and long enough to cover the entire buffer (I think the function should do to align to cache line size and cover entire length).

If SDRAM is cached is not - depends how the FW (Arduino code) would do. It can be as default cached (depends on the external memory address region if handled as "strongly ordered", "device" or "cached"). I think I saw: default is cached (maybe "write-through" for SDRAM region).

Most importantly would be to know: is the MPU configured? (potentially it is!) Is the external SDRAM address region configured as "cache-able"?
Also: if it is cache-able but with "write-through"?: if so, the CM4 could see what CM7 has written, even w/o to clean the cache. But CM7 cannot see what CM4 has written.

So, the MPU config in effect is the most important thing to know (and how the regions are defined).
If you do not know: assume to do always a Cache Maintenance (on CM7 side, invalidate and clean cache).

BTW: the same for other memories (internal SRAM). They can also be controlled by MPU in terms of cache behavior.

It is hard to say w/o info how the Arduino/mbed would initialize the MPU. I think, the memory region used for the RPC itself is un-cached (for sure, but all other memories...?).
You can also define another memory as un-cached. How to do MPU config in Arduino/mbed - no idea.

Just do all the time cache maintenance (on CM7) when not sure. It will contribute a bit of overhead ("unpredictable" latency) during runtime.
To think about "cache coherency" is always mandatory and very reasonable to bear in mind with "multi-core" programming (CM7 and CM4, CM7 with DMAs ...).

henksb · October 27, 2022, 4:36pm

Thanks T, great that you're still here helping out.

I found some comments in SDRAMClass::begin() suggesting that the SDRAM is cached indeed. No surprise because it appears to be pretty fast. If I did not notice the caching it must have been pure luck. So, I will treat it as cached memory like the RAM.

I had some trouble with it, putting the begin() call after the RCC init makes it not work. I now put it as the first statement in setup(), and not keep the SDRAMClass as member of some class that gets initialized later on. Other than that, I am very happy with the SDRAM, I map my objects to it.

Reading up on the cache terminology I see that if CM4 writes to cached memory I need to invalidate it on CM7 in order for it to read it, and when CM7 writes to cached memory I must clean it in order for CM4 to read it. This is helpful.

I looked at the cache invalidation code in core_cm7.h and understand that the data does not have to be 32 bit aligned in order to use the cache functions. Just a bit more efficient perhaps if it is.

Making a region non-cached will not work well for me because I need to look at the data periodically, not all the time. That strategy would lead to more overhead.

I have some experience with cache invalidating from using the DMA-based dual triggered ADC feature where I picked up cache invalidation from an STM32 example. For the case of DMA I can see why it must be invalidated because the memory used still has a dirty cache from the last time around when the data was read.

henksb · October 27, 2022, 7:04pm

Actually, if the data is not exactly 32-byte aligned and/or occupies less than a multiple of 32 bytes, it is unsafe to look at it through pointers because the cache invalidation/cleaning has side effects depending on where the compiler decides to place other variables that inadvertently might undergo the wrong type of cache operations if they are nearby. So, I will use RPC for individual variables and pointers for arrays while making sure the array data is 32-byte aligned and occupies a 32-byte multiple.

henksb · October 27, 2022, 11:59pm

I implemented what I stated earlier, and it seems to work fine.

This is where I would say that working in the Portenta environment pays off. It gives me a bootRom, SDRAM, multicore with RPC, a COM port, and integrates with the STM32 APIs (with some bumps here and there). I also use a second COM port (UART1) for the CM4 that I stream through a Teensy to my PC. I debug in STM32CubeIDE through a makefile project that uses the Arduino build commands. Pretty happy now.

Topic		Replies	Views
Access of SDRAM by both M4 and M7 cores Portenta H7	7	1714	May 20, 2023
Arduino GIGA SDRAM as shared memory GIGA R1 WiFi	3	1408	October 19, 2023
Can the Portenta H7 M4 and M7 cores share variables? Portenta H7	5	3682	May 6, 2021
Portenta H7 - Cache Coherency Portenta H7	1	326	November 27, 2022
Fastest way to share data between cores (SRAM3/4 usage) Portenta H7	1	1326	June 19, 2021

Data caching for multicore shared data

Related topics