Why is multi-tasking on ESP32 dual-cores not faster than single-tasking on one core?

Hi there,

I'm now building a thermography camera which uses MLX90640, Seeed Studio XIAO ESP32S3, and 2.4" TFT LCD with touch sensor and SD card I/F.

Here is a block diagram:

1st progress

Two functions are running on the ESP32.

ProcessInput() is to acquire the thermal image sent from the MLX90640 via I2C.

And the other ProcessOutput() is to interpolate the thermal image to make resolution higher and display it in color.

These functions run sequentially in loop() on ESP32 Core 1 and everything works fine like as follows:

2nd progress

So as a next step, to speed up the frame rate, I rewrote the program so that ProcessInput() runs on core 1 as Task1 and ProcessOutput() runs on core 0 as Task2. And those are connected with double buffers and some synchronous handshaking.

The handshaking between two tasks is using a message queue and a counting semaphore. The expected effect is expressed in a timing diagram as follows:


(Edit: Added "Core 0" and "Core 1" to each time line.)
(Edit: If you can not understand my poor diagram, please refer to Multiple buffering - Wikipedia)

I think it will be easier to understand if you actually see my program, which I will show it at the end.

Result

When running on multiple cores, I found that the processing time of ProcessInput() increased, resulting in no improvement in frame rate at all :face_with_raised_eyebrow:

ProcessInput() [ms] ProcessOutput() [ms] Frame rate [Hz]
Single-core sequential processing 74 52 7.9
Multi-cores parallel processing 127 53 7.9

Note that 7.9 ≒ 1000 / (74 + 152) in the 1st column, and 7.9 ≒ 1000 / 127 in the 2nd column.

Questions

I suspect that I2C and SPI share one APB (Advanced Peripheral Bus) and there is a conflict, which makes the I2C slower. This is based on ESP32 Technical Reference Manual, or the following thread:

So my 1st question is as stated in the title, or is my suspicion correct, or is there any other reasons?

Then my 2rd question is whether there is a way to improve the frame rate in my camera system.

And my last question is whether my multi-tasking programming is good or bad.

I'm hoping for advice from someone more knowledgeable.
Thanks in advance :bowing_man:


A simple sketch for observing handshaking and processing time in multi-tasking.

SyncTasks.ino

The ENA_MULTITASKING setting switches between single tasking and multitasking.

#include <Arduino.h>

/*=============================================================
 * Step 1: Select whether to multitask or not
 *=============================================================*/
#define ENA_MULTITASKING  false

/*=============================================================
 * Step 2: Configure expected processing time
 *=============================================================*/
#define RANDOMIZE false

#if RANDOMIZE
#define PROCESS(x) delay(random(x))
#else
#define PROCESS(x) delay(x)
#endif

#define PROCESSING_TIME_INPUT   1000
#define PROCESSING_TIME_OUTPUT  2000

// Function prototype defined in multitasking.cpp
void task_setup(void (*task1)(uint8_t), void (*task2)(uint8_t, uint32_t, uint32_t));

void ProcessInput(uint8_t bank) {
  PROCESS(PROCESSING_TIME_INPUT);
}

void ProcessOutput(uint8_t bank, uint32_t inputStart, uint32_t inputFinish) {
  static uint32_t prevFinish;
  uint32_t outputStart = millis();

  PROCESS(PROCESSING_TIME_OUTPUT);

  uint32_t outputFinish = millis();

  Serial.printf("Input:  %d\nOutput: %d\nCycle:  %d\n",
    (inputFinish  - inputStart ),
    (outputFinish - outputStart),
    (outputFinish - prevFinish )
  );

  prevFinish = outputFinish;
}

void setup() {
  Serial.begin(115200);

  // Start tasks
#if ENA_MULTITASKING
  void task_setup(void (*task1)(uint8_t), void (*task2)(uint8_t, uint32_t, uint32_t));
  task_setup(ProcessInput, ProcessOutput);
#endif
}

void loop() {
#if ENA_MULTITASKING
  delay(1000);
#else
  uint32_t inputStart = millis();
  ProcessInput(0);
  ProcessOutput(0, inputStart, millis());
#endif
}

multitasking.cpp

In multitasking.cpp, not only the bank number processed by ProcessInput() but also the processing time is stored in the message queue and passed to ProcessOutput().

Then, ProcessOutput() monitors each processing time and frame rate.
Also, if you set each task to run on the same core, it will behave the same as single tasking.

#include <Arduino.h>

#define TASK1_CORE      1
#define TASK2_CORE      0

#define TASK1_PRIORITY  2
#define TASK2_PRIORITY  1

// Message queue sent from task 1 to task 2
typedef struct {
  uint8_t   bank;   // Exclusive bank numbers for Task 1 and Task 2
  uint32_t  start;  // Task 1 start time
  uint32_t  finish; // Task 1 Finish Time
} MessageQueue_t;

// Define two tasks on the core
void Task1(void *pvParameters);
void Task2(void *pvParameters);

// Define pointers to the tasks
static void (*Process1)(uint8_t bank);
static void (*Process2)(uint8_t bank, uint32_t start, uint32_t finish);

// Message queues and semaphores for handshaking
static TaskHandle_t taskHandle[2];
static QueueHandle_t queHandle;
static SemaphoreHandle_t semHandle;

#define HALT()  { for(;;) delay(1000); }

// The setup function runs once when press reset or power on the board
void task_setup(void (*task1)(uint8_t), void (*task2)(uint8_t, uint32_t, uint32_t)) {
  // Pointers to the tasks to be executed.
  Process1 = task1;
  Process2 = task2;

  // To process tasks in parallel, the semaphore must have an initial count of 1
  semHandle = xSemaphoreCreateCounting(1, TASK1_CORE != TASK2_CORE ? 1 : 0);
  queHandle = xQueueCreate(1, sizeof(MessageQueue_t));

  // Check if the queue or the semaphore was successfully created
  if (queHandle == NULL || semHandle == NULL) {
    Serial.println("Can't create queue or semaphore.");
    HALT();
  }

  // Set up sender task in core 1 and start immediately
  xTaskCreatePinnedToCore(
    Task1, "Task1",
    8192,           // The stack size
    NULL,           // Pass reference to a variable describing the task number
    TASK1_PRIORITY, // priority
    &taskHandle[0], // Pass reference to task handle
    TASK1_CORE
  );

  // Set up receiver task on core 0 and start immediately
  xTaskCreatePinnedToCore(
    Task2, "Task2",
    8192,           // The stack size
    NULL,           // Pass reference to a variable describing the task number
    TASK2_PRIORITY, // priority
    &taskHandle[1], // Pass reference to task handle
    TASK2_CORE
  );
}

/*--------------------------------------------------*/
/*------------------- Handshake --------------------*/
/*--------------------------------------------------*/
uint8_t SendQueue(uint8_t bank, uint32_t start, uint32_t finish) {
  MessageQueue_t queue = {
    bank, start, finish
  };

  if (xQueueSend(queHandle, &queue, portMAX_DELAY) == pdTRUE) {
//  Serial.println("Give queue: " + String(queue.bank));
  } else {
    Serial.println("unable to send queue");
  }

  return !bank;
}

MessageQueue_t ReceiveQueue() {
  MessageQueue_t queue;

  if (xQueueReceive(queHandle, &queue, portMAX_DELAY) == pdTRUE) {
//  Serial.println("Take queue: " + String(queue.bank));
  } else {
    Serial.println("Unable to receive queue.");
  }

  return queue;
}

void TakeSemaphore(void) {
  if (xSemaphoreTake(semHandle, portMAX_DELAY) == pdTRUE) {
//  Serial.println("Take semaphore.");
  } else {
    Serial.println("Unable to take semaphore.");
  }
}

void GiveSemaphore(void) {
  if (xSemaphoreGive(semHandle) == pdTRUE) {
//  Serial.println("Give semaphore.");
  } else {
    Serial.println("Unable to give semaphore.");
  }
}

/*--------------------------------------------------*/
/*--------------------- Tasks ----------------------*/
/*--------------------------------------------------*/
void Task1(void *pvParameters) {
  uint8_t bank = 0;

  while (true) {
    uint32_t start = millis();

    // some process
    Process1(bank);

//  Serial.println(millis() - start);

    bank = SendQueue(bank, start, millis());

    TakeSemaphore();
  }
}

void Task2(void *pvParameters) {
  while (true) {
    MessageQueue_t queue = ReceiveQueue();

    GiveSemaphore();

    // some process
    Process2(queue.bank, queue.start, queue.finish);
  }
}

Thanks for reading this long post.

I forgot to write some detailed information.

My testbed environment

Related resources

What is the timing of ProcessInput() in a multicore setup, while not running the output task.

If the input process is blocked by the output process the time should be near 75 ms again.

If you have two cores both trying to access the same output pins then it will take just as long as one core accessing those pins sequentially.

You will only see a speed up if each core accesses different hardware resources.

1 Like

Hi @robtillaart ,
Thanks for replying.

In task_setup(), I created Task1 for ProcessInput() and Task2 for ProcessOutput() like this:

// The setup function runs once when press reset or power on the board
void task_setup(void (*task1)(uint8_t), void (*task2)(uint8_t, uint32_t, uint32_t)) {
  // Pointers to the tasks to be executed.
  Process1 = task1;
  Process2 = task2;

  // To process tasks in parallel, the semaphore must have an initial count of 1
  semHandle = xSemaphoreCreateCounting(1, TASK1_CORE != TASK2_CORE ? 1 : 0);
  queHandle = xQueueCreate(1, sizeof(MessageQueue_t));

  // Check if the queue or the semaphore was successfully created
  if (queHandle == NULL || semHandle == NULL) {
    Serial.println("Can't create queue or semaphore.");
    HALT();
  }

  // Set up sender task in core 1 and start immediately
  xTaskCreatePinnedToCore(
    Task1, "Task1",
    8192,           // The stack size
    NULL,           // Pass reference to a variable describing the task number
    TASK1_PRIORITY, // priority
    &taskHandle[0], // Pass reference to task handle
    TASK1_CORE
  );

  // Set up receiver task on core 0 and start immediately
  xTaskCreatePinnedToCore(
    Task2, "Task2",
    8192,           // The stack size
    NULL,           // Pass reference to a variable describing the task number
    TASK2_PRIORITY, // priority
    &taskHandle[1], // Pass reference to task handle
    TASK2_CORE
  );
}

So I think that after the first xTaskCreatePinnedToCore() execution the Task1 is immediately activated. Then Task1 execute ProcessInput() like as follows:

void Task1(void *pvParameters) {
  uint8_t bank = 0;

  while (true) {
    uint32_t start = millis();

    // some process
    Process1(bank); // <-- **ProcessInput()**

//  Serial.println(millis() - start);

    bank = SendQueue(bank, start, millis());

    TakeSemaphore();
  }
}

So the ProcessInput() must run at first, then ProcessOutput() runs. And I observed my simple sketch before combining with other staff such as MLX90640.

Hi @Grumpy_Mike ,
Thanks for your advice.

Yeah I see! Two cores access not the same output pins but internal memory (in PSRAM?).

ProcessInput() in core 0 is always access I2C to get thermal image and output to the internal memory (which consists two banks), and then ProcessOutput() is access the internal memory to interpolate the thermal image and output to the LCD through SPI.

Therefore, one possible cause is the competition on the internal memory, which will not be improved by giving task 1 a higher priority than task 2.

I wonder If core 1 and core 2 compete on the memory bus, which one will have priority?

I tried swapping the cores for ProcessInput() and ProcessOutput(), but the results were the same.

Why is there the number 4 near the hatch mark of the I2C Bus -- when the I2C is a two-wire Bus?

The ESP32S3 has a shared data bus for both cores to access the same RAM ➜ while both cores have independent instruction caches and their own local registers, they share the access to the system's memory, including both internal and external RAM and when two cores try to read from or write to RAM at the same time, the access must be serialized. This means both cores can experience contention when accessing memory (and also the caching mechanism could lead to bottlenecks).

1 Like

there is info in the graphic

image

OP counted the power wires.

1 Like

Hi @GolamMostafa ,
Thanks for checking the diagram.

I intended SDA, SCL, Vin and GND.

Sorry if this is confusing.

You need to put semaphores around all shared resources (HW and in memory etc).
And yes, multicore does not always result in faster execution.

Fortunately the ESP32 can do I2C an other pins, so one can create an I2C bus per device (at least to a certain level)

1 Like

Do we consider the power lines (Vcc and GND) as the signal wires of the I2C/Two-Wire Bus?

That's right. Thank you!

that's not the topic of this conversation... Really I'm not a hardware guy, so I don't care much about any conventions on such high level drawings... I take that for what they mean, 4 wires going from the ESP32 to the module...

1 Like

@J-M-L and @GolamMostafa ,

I'm not good at hardware, so next time I'll count and write only the signal lines.

the intent was good and you documented that you power your external modules from the ESP directly (there is a limit on how much current you can draw from the 3.3V pin - may be like 500mA - so don't draw too much )

1 Like

Taking all your advice together, that seems to be true.

I2C slave <---> I2C master on ESP32 <---> Memory bus on ESP32 <---> Core1

The I2C slave and master pins are probably independent of other pins on the ESP32.

Although no matter which peripheral is accessed from the ESP32 core, it goes through the memory bus, so I think that conflicts always need to be resolved inside ESP32. Am I correct?

This is something that beginners, including myself, are always advised to do on so this forum.

The XIAO has a battery management IC on board, so the prototype camera looks like the one in the photo.

I'm not really sure how to read the spec of Li-Po battery, so I'll check it again, including the current consumption of the MLX40960 and the LCD, along with the current supply capability of the XIAO.

Thanks!

It looks like the compiler or something internal does that anyway. Otherwise the signals would be a mess and nothing would work.

I struggled a lot with how to correctly understand the startup timing and operating status of each task.

It was difficult to check correctly with buffered IO such as serial output.

It would have been nice to be able to observe it with an oscilloscope, but I have only XIAO as ESP32 board, and all of its pins are used up, so this was not possible.

If I have the opportunity, I would like to try measuring it on another ESP32 board.

Thanks.