Advice on distributing task between M7 and M4

I have a remote vehicle project I am working on and am interested in some advice on how I might distribute different tasks to the different cores. Mainly there are a few breakout boards I will be using that run UART. My thought was the communication would slow down higher-level tasks so I might want to run those on the M4 and keep most of my logical decisions and output functions running on the M7 core. Here is a rough list of all the tasks that will take place and my first crack at which core I think they should be running on. Let me know what you think!

M7 CORE (480MHZ)

  1. LOGICAL CODE FOR MODE SWITCHES
    a. PASSING INSTRUCTIONS TO M4 FOR POWERING ON/OFF DEVICES
  2. COMMANDING MOTOR
  3. COMMANDING RUDDER
  4. READ SENSORS (I2C)
    a. POWER
    b. TIME
    c. LUMENS
  5. READ HEADING FROM M4
  6. SEND ANY MAJOR ERRORS TO M4 FOR SATCOMM

M4 CORE (240MHZ)

  1. READ ORIENTATION SENSOR (UART)
  2. READ GPS LOCATION (UART)
  3. CALULATE HEADING
  4. SAT COMMS (UART)
  5. PASS ANY ERRORS BACK TO M7 FOR LOGIC CHECKS
  1. Am I thinking correctly that UART communication with these other devices make sense to run on a lower speed chip and then pass the info to the M4 when it's available? This could allow the M7 logic calculations and hardware outputs to run at a higher speed.

  2. Are there limitations on what UARTS the M4 chip has access to? It appears ill need to utilize 3, so I'll probably need the Portenta breakout to access all the pinouts.

  3. I have not done anything on the M4 core as of yet, and read the M4 doesn't directly communicate with the USB serial, is there another way to access it, or do I need to figure out RPC calls to pass the info back and forth? I need to learn RPC regardless to get info from the M4 and vice versa anyway but just thinking about debugging M4 code when I do push it there.

  4. I see a lot of people using the IFDEF CORE commands and uploading the exact same sketch to both cores. Is this necessary? Can I upload individual sketches to each core? I won't really have any common code segments that need to be shared where I might want to program this way. Perhaps I am thinking about this wrong but what is the benefit of programming this way?

Thanks for any help everyone! Portenta seems like a really cool product and I'm having a blast so far.

*Clarification- I have the H7 Lite

the baud rate is shared between the two boards, so that data comes in as fast as it leaves... You'll need to get the data to the faster MCU before it can do its work. But then you are right, it will do it faster (unless there are some other constraints)

Thank you very much for the input! I am working on creating some ways to test how the latency might look with RPC involved. The base code I currently have communicating with orientation device loops in around 4ms. I need to add the GPS yet, and I'm not too worried about added latency on the SatComm as that will only communicate during very spaced-out time intervals.

I have read some concerns and ambiguity around how RPC calls and reads are currently handled by the Portenta and will need to take great care in measuring how it is handled (some users stated 50ms delay required before reads or system crashes occur) or else even using the second core might be pointless and I may need to just use the single core.

If anyone has good resources on RPC I would love to see all anyone has used. I have read what google has to offer and can barely come across an example where people aren't having some interesting issues implementing it without red lighting the device.

If you have more then two boards using the UART can become complicated. Take a look at CAN, it would be much faster and works over 1000 feet. No additional power supplies are needed. There is a MPC2515 board available at a minimal cost, they work great. Can is a 2 wire interface, that along with power and ground keeps you at a relative common 4 conductor cable.

The Portenta has 4 UARTS doesn't it? My notes above are also mistaken, my orientation sensor is I2C so I will be using exactly 2 UARTS. This is an ROV project so wiring is maybe 18".

Hi @hanslanda , I am really interested in your aim. I was searching for the same solution. Did u figure it out? I would like to do some calculations on m4 and see the results and change the switches on m7. I would like to test if I can make the computation faster than using only 1 core.

My very general thinking about how to do a "load sharing" between CM4 and CM7 in Portenta H7:

  1. CM4 is a Co-Processor:
    if you check the datasheet - the CM4 does not have all the capabilities of CM7. It can happen, that just one specific device is only access-able by CM7 (e.g. SDMMC1).
    Also: the CM4 is much slower:
    it does not have caches, not DTCM, ITCM, and runs with slower MCU core clock.
    So, very fast stuff (code) should run on the CM7 side

  2. Inter-Processor-Communication:
    CM7 and CM4 can "see" each other. There is a (HW based) Semaphore mechanism in place to "trigger and sync" both CPUs.
    It should not be so dramatic in terms of speed (CM4 is anyway slower). And the overhead, e.g. you just send a message with a pointer to a "shared memory", very short, but all the data via memory - should be pretty fast.

BUT: "cache coherency" could add "penalties":
If you share data via "shared memory": the CM4 goes directly to memory, the CM7 has caches. It has to "clean" or "invalidate" its cache, just to see what CM4 has written. These "cache policy function calls" could slow down (so, shared memory needed between CM4 and CM7 can slow down the CM7).

  1. Concurrent processing
    There is one "delicate" issue: both CPUs use the same flash ROM, same RAM memory - if they act at the same memory - the bus fabric has to "manage" the concurrent access. Just one CPU can read/write, the other has to wait.
    So, it would be wise to assign dedicated memories (Flash ROM locations, RAM locations) so that they do not block each other all the time.
    The Flash ROM is divided into two "segments", the SRAM is split into different "segments": give every CPU its own code (Flash) and data (RAM) location, so that they do not block each other.

It is a tough decision how to "share the load":

  1. Should CM7 run the RTOS?
    OK, it could do a faster context switch as CM7, more "real-time" behavior
  2. Should CM4 do all the INTs?
    it would be a slower INT response ("real time"), esp. when it has to "inform" the CM7 running the RTOS that an INT was there

My personal assumption would be:

  • place all not so timing-sensitive stuff on CM4
  • let the CM7 do the very high speed stuff, potentially also the RTOS
  • CM4 can do "slow stuff" in background, e.g. UART, command decoding, INTs which do not come so fast and not so often
  • CM4 could also use the FPU (e.g. for math, ARM DSP): CM4 provides the API for math operations (but supported by FPU)
    --> actually, I am not sure, if the FPU is bound to a core, e.g. HW FPU just on CM7

But in general:

  • CM7 is for very fast speed, very short response time, esp. due to caches, DTCM, ITCM
  • CM4 for a bit more relaxed stuff, not so "real-time" demanding, not so often activated, no need to be really accurately fast
  • where to place the RTOS depends on "hard real-time" requirements (context switching/thread activation)
  • and when you run RTOS on both sides - the RPC mechanism (via HW Semaphores) has to be extend now also to send SW semaphores to RTOS's back and forth (I would just run one RTOS, otherwise additional "wrappers" needed.

And: if you add power management, e.g. to bring some devices into STOP mode, e.g. CM4 powers something down, a device, but CM7 wants to use - you have to "tell" CM7 about the device state.
It results in a "Not-Symmetrical Multi-Processor" architecture (often, multi-core systems are SMP = Symmetrical Multi-Processor). Therefore, I would not consider each core as "identical" (symmetrical), instead more like "main and co-processor".