Is Raspberry pi pico micros() broken?

i have some code that measures the time it takes a microcontroller to perform 1m integer adds. I record micros() before and after. It works with all microcontrollers i have including esp32,esp32 s2 and s3, stm32F1, stm32F4 and milk v duo.

On raspberry pi pico the value of first micros() from the last micros() value is always 1 microsecond which is wrong. Is micros() broken in arduino core for raspberry pi. I have tried both official board arduino cores from arduino and earle's version.

void Add()
{
   uint64_t AInt = 0;
   uint64_t Num0 = 314;
   Serial.print("Add Time : ");
   unsigned long clock0 = micros();
   for(int i = 0; i  < Loops;i++)
   {
      AInt = AInt + Num0;
   }

  float pclock = (micros() - clock0);
  Iops = mtime / pclock; 

}

or the compiler just calculates the result or completely ignores the useless code and there is nothing to do in runtime

The results are used elsewhere so it cant be that compiler ignores it. for example the results are printed via serial. AInt is also modified after the loop. Just after the second micros() call.

This is what i have currently


void Add()
{
   uint64_t AInt = 0;
   uint64_t Num0 = 314;
   Serial.println("Add Time : ");
   unsigned long clock0 = micros();
   for(int i = 0; i  < Loops;i++)
   {
      AInt = AInt + Num0;
   }

  float pclock = (micros() - clock0);
  Iops = mtime / pclock; 

  AInt -= 10; 

  //Serial.print(pclock);
  Serial.print("Final Value: "); 
  Serial.println(AInt);
  Serial.print(Iops);
  Serial.println(" MIops");
}

i could use random instead of Num0 but then it would be measuring both the performance of random and integer add which when I replaced it with random(0,999999); it gives 1.14 MIops When raspberry is @200 Mhz

Alright.

uint32_t Loops =1000000; // 1 million operations
unsigned long mtime = 1000000; // 1 Second
float Iops = 0;
float flops = 0;
void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  pinMode(LED_BUILTIN, OUTPUT); 
  digitalWrite(LED_BUILTIN, HIGH);
}

 void Add()
{
   uint64_t AInt = 0;
   uint64_t Num0 = 314;
   Serial.println(" *** Add Time ***");
   unsigned long clock0 = micros();
   for(int i = 0; i  < Loops;i++)
   {
      AInt = AInt + Num0;
   }

  float pclock = (micros() - clock0);
  Iops = mtime / pclock; 

  AInt -= 10; 

  //Serial.print(pclock);
  Serial.print("Final Value: "); 
  Serial.println(AInt);
  Serial.print(Iops);
  Serial.println(" MIops");
}

void loop() {
  // put your main code here, to run repeatedly:

  Add();
   delay(1000);
}

Thanks for the help btw.

well I found the problem. It was the raspberry pi's compiler optimization. It was pre-calculating the result which is on by default... So to avoid that you have to use " asm volatile("" : : : "memory") " inside the loop to tell it to not reorder / optimize the code.

Anyone interested the performance of MCUS

Rp2040@240MHz ( 1 Core)

  1. Int32 (Add,Sub) - 29.82 MIops
  2. Int32 (Mul,Div) - 26.52 MIops
  3. Single Precision Flops - 29.82 MFlops

Esp32-S3@240MHz ( 1 Core)

  1. Int32 (Add, Sub,Div) - 23.88 MIops
  2. Int32 (Mul) - 21.71 MIops
  3. Single Precision Flops -29.85 MFlops

Esp32-S2@240MHz

  1. Int32 (Add, Sub,Div) - 18.88 MIops
  2. Int32 (Mul) - 17.09 MIops
  3. Single Precision Flops -19.93 MFlops

Stm32F401@84 Mhz

  1. Int32 ( Sub,Div) - 9.33 MIops
  2. Int32 (Mul) - 7.6 MIops
  3. Int32 (Add) - 8.39 MIops
  4. Single Precision Flops -10.49 MFlops

Overclocking by close to 100% ?

micros() on the Arduino core for rp2040 takes close to 4us to execute.

I think your numbers seem suspect. Doesn't the ESP32-S3 have hardware floating point? I'd think t hat should be more than 2x faster that the S2 or rp2040 (which don't.) Are you sure that the calculations got done in single precision?

rp2040 doesn't have a division instruction; I'm surprised that it apparently matched ESP32. (OTOH, the RP2040 has a single-cycle multiply, while the ESP doesn't.)

Which RP2040 "Core" are you using? IIRC, the Arduino Core fails to use the built-in rp2040 optimizations for floats, and might not use its "division accelerator" either.

Benchmarking is complicated!
See also Reddit - Dive into anything

esp32 s3 does have one but it wont make a major difference in this case. for example the esp32 s2 doesnt have fpu but it can still calculate it on the alu. Whereas the alu on s3 would be free because the fpu would do the floating point stuff. so yeh s3 is much faster than s2 but the benckmark test the raw flops. The compiler for esp32 s3 just uses the fpu while s2 runs it on alu. All esp32 s3 and esp32 s2 flash is clocked at 80 Mhz. The esp32 s3 flash can go up to 120MHz but that wouldn't be really fair.

Don't forget esp32 and Arm are different architectures. Arm has way better memory optimizations and features pertaining to memory and cache. I've read they have something called ART Accelerator allowing better flash access etc.

And for rp2040 I'm using Earle's board library because only his version has overclocking. And yes its 240 MHz (Overlocked). it can go up to 250 MHz stable but for comparison i left it at 240..

Stuff like memory access, cache, the speed of spi flash can affect performance. This is the best case scenario for them. If for example i were to have rp2040 do operations from memory directly the performance would be horrible like almost 2 MIops because memory access is costly.

The benckmark code i wrote prevents the compiler from tossing out the loops or precalculating the results but it also allows the compiler to use registers rather than reloading the values everytime which is costly.

Operations like Mul and Divide are demanding so it makes sense that they are slightly slower than Add or Sub.

And tbh if micros() really was taking longer on rp2040 then in reality it will have higher value Mflop and MIop value making it beat the other mcus.

The benckmark code measures how much time each MCU takes to do 1m float operations and 1m Integer operations (individually ofc). Then we find out how much of these 1m Ops it can do in 1 second by dividing 1 second by the time it takes per 1m ops