AI imroved my tight loop that justs counts by over 300%. Is it cheating?

Hello there. I am several days into esp32 and C/C++. So my knowledge regarding advanced techniques is limited.
But in a different thread, we were talking about tight loops that counts and displays the result every second. Performance was not really the focus, but something else.
I make a new thread, because the original problem was solved, and this is a new problem.

My code:

#include "esp_log.h"
#include <esp_task_wdt.h>
#include <Arduino.h>

static const char *TAG = "Main";

void setup() {
  Serial.begin(115200);
}

void loop() {
  setCpuFrequencyMhz(240);

    int32_t lastResetTime = xTaskGetTickCount() * portTICK_PERIOD_MS;
    int32_t performanceCounter = 0;

    while (true) {
        performanceCounter++;
        if ((xTaskGetTickCount()* portTICK_PERIOD_MS - lastResetTime) >= 1000) {
            printf("Performance score2: %lu\n", performanceCounter);
            performanceCounter = 0;  
            lastResetTime = xTaskGetTickCount()* portTICK_PERIOD_MS;  
        }
    }
}

Just normal stuff, basically I replaced millis() by RTOS func.



But now AI gave me this...I was just curious:

#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_timer.h"
#include "esp_log.h"
#include "soc/soc.h"

#define TAG "PERF"

volatile uint32_t performanceCounter = 0;
volatile uint32_t lastPerformanceCount = 0;

void IRAM_ATTR onTimer(void* arg) {
    uint32_t count = performanceCounter;
    lastPerformanceCount = count;
    performanceCounter = 0;
    printf("Performance score: %lu\n", count);

}

void setup() {
    esp_timer_create_args_t timerArgs;
    timerArgs.callback = &onTimer;
    timerArgs.name = "PerformanceTimer";
    timerArgs.arg = NULL;
    timerArgs.dispatch_method = ESP_TIMER_TASK;
    timerArgs.skip_unhandled_events = false;

    esp_timer_handle_t periodicTimer;
    ESP_ERROR_CHECK(esp_timer_create(&timerArgs, &periodicTimer));
    ESP_ERROR_CHECK(esp_timer_start_periodic(periodicTimer, 1000000)); // 1 second interval
}

void loop() {
    uint32_t localCounter = 0;
    while (true) {
        localCounter++;
        if ((localCounter & 0xFFF) == 0) { // Every 4096 iterations
            uint32_t current = performanceCounter;
            uint32_t newValue;
            do {
                newValue = current + 0xFFF;
            } while (!__sync_bool_compare_and_swap(&performanceCounter, current, newValue));
            localCounter = 0;
        }
    }
}

This was one-shot by the AI, and it said, it could be improved further.
Then one user said:



.
This is maybe a bit over my head, and I am not sure how to debug this, but I tried:

        if (localCounter > 4000) { // Every 4096 iterations
          printf("%lu / %lu\n", localCounter, performanceCounter);
        }

I checked first by simple print(localcounter) that it counts correctly to 4096.
I think folowing is the critical spot.
So I removed first print, and put this in:

        if (localCounter > 4000) { // Every 4096 iterations
          printf("%lu / %lu\n", localCounter, performanceCounter);
        }

Result:

...
4002 / 4095
4003 / 4095
...
4093 / 4095
4094 / 4095
4095 / 4095
4096 / 4095
...
4001 / 8190
4002 / 8190
4003 / 8190
4004 / 8190
...
4094 / 8190
4095 / 8190
4096 / 8190
...
4001 / 12285
4002 / 12285
4003 / 12285
...
4094 / 12285
4095 / 12285
4096 / 12285
...
4001 / 16380
4002 / 16380
4003 / 16380
...
4095 / 16380
4096 / 16380
...
4001 / 20475
4002 / 20475
4003 / 20475

So from my basic understanding, the AI is correct. Is it? :see_no_evil: :face_with_peeking_eye:
Where else could it cheat?
And one second seems to be one second.
The AI used:

  • a hardware timer and interrupt for precise timing
  • local counter to reduce atomic operations
  • atomic operations for thread-safe counting

If correct, what could be learned from that?
Can our superloops with a lot of time checks be optimized?
I mean is that not "THE" most basic tool / principle of arduino?

Please explain (in less than 1000 words) the purpose of this section.

1 Like

vs

looks like the AI code simply operates with a different timing. wouldn't describe this as more efficient

if you want to measure how long it takes to do something a million times (without optimization), you typically capture a timestamp before/after the loop and print the difference.

I asked the AI, and I think I understand at least parts, it's pretty clever.


  • We use a localCounter to minimize access to the shared performanceCounter.
  • The loop increments localCounter as fast as possible.
  • Every 4096 iterations (when localCounter & 0xFFF is 0):
  • We attempt to update performanceCounter atomically.
  • __sync_bool_compare_and_swap is a GCC built-in function for atomic operations.
  • It compares performanceCounter with current, and if they're equal, it sets performanceCounter to newValue.
  • If the swap fails (due to an interrupt changing performanceCounter), we retry with the new current value.
  • This approach reduces contention on performanceCounter and ensures we don't lose any counts due to race conditions with the interrupt.

I also showed, that it does not skip any numbers.
I can't check all of it, but the first hundred k.
Can't find any error.

So why did you say, that it is 80% garbage, when it is working?

Here is a more detailed explanation if you really wanna know:

  • if ((localCounter & 0xFFF) == 0)
  • 0xFFF is hexadecimal for 4095 (12 bits set to 1).
  • The & operator performs a bitwise AND.
  • This condition is true every 4096 iterations of the loop.
  • Why 4096? It's a power of 2, which makes this check very fast on binary computers.
  • uint32_t current = performanceCounter;
  • We take a local copy of the current performanceCounter value.
  • uint32_t newValue;
  • We declare a variable to hold the new value we want to set.
  • do { newValue = current + 0xFFF; } while (...);
  • This is a do-while loop that will keep trying to update the counter until it succeeds.
  • We're adding 4095 (0xFFF) to the current value, which represents the 4096 iterations we've just completed.
  • __sync_bool_compare_and_swap(&performanceCounter, current, newValue)
  • This is a GCC built-in function for atomic operations.
  • It does three things atomically (i.e., without interruption): a. Compares the value at &performanceCounter with current b. If they're equal, it sets performanceCounter to newValue c. Returns true if the swap occurred, false otherwise
  • This operation is atomic, meaning it's guaranteed to happen all at once without being interrupted.
  • The while (!...) means the loop continues as long as the compare-and-swap returns false.
  • If the swap fails (because performanceCounter was changed by the interrupt between steps 2 and 5), we loop again with the new current value.
  • localCounter = 0;
  • After successfully updating performanceCounter, we reset localCounter to start counting another 4096 iterations.

what????
which different timing? there are no "times" involved...lol.

yeah, doing something over 300% faster is not more efficient.ok.

Ok I learned that in day one.
I use it to measure bottlenecks, WITHOUT the need to write anything(since it's always running), tested it a hundred times. But this is not the point right now.

looks like your code is resetting a counter after some period of time and presumably you're looking at how long it takes that counter to reset.

the AI code is simply restting the counter every 4096 iterations.

don't know what you specified to the AI, it doesn't look like they're doing the same thing, so i wouldn't say one is more efficient than the other

LOL.

No, it's pretty crappy. Problem is, you first have to learn a lot on programming and logic to realize that it's just a mirror cabinet.

2 Likes

No, this comparation makes no sense again. And the explanation of the AI is not the reason of the difference.

The first one is slower simply because beside the increment, is executing this at each loop cycle:

xTaskGetTickCount()* portTICK_PERIOD_MS - lastResetTime

And the second one is just doing this, that is much faster:

localCounter & 0xFFF

The rest of the code is not executed at each loop round, is not relevant.
Of course the timer is running in parallel, to avoid counting one second. Use it in both cases, but it will not make any sense anyway.

The IA is talking about atomic operations to avoid conflicts with interrupts and concurrence, what is very nice, sounds clever and indeed is relevant in some situations, but not here.

It's not relevant because in the other code the counter is local and not shared with anyone. And the interrupts can hit both codes in the same way in the main loop. The atomic operation is only executed one out of 4096 times, the increment operation is the same in both cases.

And anyway the whole concept of the test makes no sense, like the previous ones. I don't see the point.

I also use the AI, it saves a lot of time. There is nothing wrong , as far as you understand what you are doing. Otherwise you are blind, even if it looks like it works. Free tickets to disaster station. But it's up to you.

What do you mean by cheating?

It cheats because it only checks the termination condition every 4096 loops, while your original code checks it (relatively expensively: at least a function call) every time through the loop. (we already talked about that function call being the expensive part of the loop, right?)

Also:

We use a localCounter to minimize access to the shared performanceCounter.

But performanceCounter wasn't "shared" in the first example, and was simply in a register, so this is an "optimization" only in the sense that it undoes a complication that the new code introduced.

It needlessly complicates the code with interrupts and thus further needs to complicate it (so it thinks, anyway) with atomic access issues. (I'm pretty sure that 32bit accesses on ESP32 are already atomic.)

Now, the whole "do expensive code less often" is a valid and important optimization strategy. We frequently see people who call neopixels.show() or servo.run() much more often than they should be. But it's more apples vs oranges to compare code that skips check with code that doesn't.

1st: millis() is NOT used in any code,...and the AIs code uses interupt.

vs

Their are NO time checks in the AIs tight loop.
You compared these two lines, but they have nothing in common!!
The second line is just splitting the counter in 4096 block pieces, to count faster(very clever)
This is getting silly now...

I gave my code, and just asked to do it faster,...not for practical reasons, but for learning, and there is a lot to learn.

??? both trying to count as fast as possible.(That was the task)
The AIs is doing the same 300% faster.
It is correctly counting and not skipping anything...so yeah 300% is more efficient.

It is not a mirror cabinet!! It's NOT an illusion.
It's correctly counting for one second,...please point out the where it is crappy, and where that mirror cabinet is.

Hallo westfw.

I think you are actually pretty capable, but you made a small mistake here.
There is NO termination condition in the tight loop.
This one I realized by mysef, but for the rest,
I let AI respond to your critique, if you don't mind.

1:
Termination condition: The AI-generated code does not actually check the termination condition every 4096 loops. The termination is handled by the timer interrupt, which is completely separate from the counting loop.

2:
Local vs. shared counter: The use of a local counter isn't just about undoing a complication. Even if the global counter isn't shared between tasks, using a local counter can still be beneficial:
• It reduces memory access, keeping the most frequently accessed variable in a CPU register.
• It allows for batched updates to the global counter, which can be more efficient, especially on some architectures.

3:
Interrupts and atomic access: You make a good point about 32-bit accesses being atomic on ESP32. The use of atomic operations here might indeed be overly cautious for this specific platform. However, it demonstrates a good practice for code that might be ported to other platforms or for situations where compiler optimizations might affect the intended behavior.

4:
Complexity trade-off: While the interrupt-based approach does add some complexity, it allows for more precise timing without affecting the main counting loop. This separation of concerns can be beneficial in more complex real-world applications.

Your point about comparing "apples to oranges" is fair. Perhaps a more equitable comparison would involve applying similar optimization techniques to the original code without changing the fundamental approach.

I appreciate your insights, as they highlight important considerations in optimizing embedded code. This discussion underscores the importance of understanding the trade-offs involved in various optimization techniques and applying them judiciously based on the specific requirements and constraints of each project.

Hello stonemk.
Same as previous poster, these lines have nothing in common.
The first line was replaced by interrupt timing, the second line is just a method to count faster basicallly. Or to be precise: it keeps track more efficient and faster, than the default approach.
But it still counts single by single...just different way, to keep track of the number.

And NO, I don't use this code in my program, since I don't understand it fully, is too complicated, and is not working "seriell" (sorry lack the term, but you understand)
And I don't need it, since my original code, is simpler and does basically the same.

This: "make the code faster" was just a test, to look into more advanced stuff, and boy did it deliver. But this is separated from my actual use.

And yes the performance counter is a valid approach to check for bottlenecks, WHILE you are programming, NOT in the finished product...lol
Without the need to put in time checks all the time.

I tested it a few dozen times.
I can perfectly see the difference, between a normal sensor reading, and a non-blocking sensor reading, without the need to put check lines in.

While I am implementing new stuff, I instantly can see their performance impact, since the counter is always on screen.
Only if the blocking thing, is not bigger than your performance counter update(eg: one second) But no need for the counter to realize that.
I tested it against direct time checks, and I could see no drawback vs direct timing.
Maybe there is, have not found it.(Just found one: more energy consumption if powermanagment is on..lol)

If you put that huge number into the range of: 0 to 100, or 0 to 1000(more accuracy)
you can spot things even easier.
Do you timecheck EVERY time with all lines you put in?
So you have to suspect this part in the first place.
This approach also easily lets me see when something else in the system is wrong.
Like a different configuration, Debug mode on, or CPU frequency, etc.
Also I think it has advantages when you program is not running sequentiell, but actually using real RTOS.(not confirmed)

Why you keep saying it makes no sense(Not ONE argument presented), when practical tests show that it does??

But again, this in not the topic right now. Not even mentioned here in the thread.
There are a lot of things to learn here.

I tested the interrupt thing with a normal local counter, and it gave about
~ 20 milion. Like our result in the previous thread.
The biggest boost comes from the way, it splitted the counting in smaller much faster parts. ~100% faster
40 mio.

I have very little interest to continue the conversation like that.
The purpose was not what this code is used for, but:
why code c, achieves the SAME RESULTS as code b or c, but 200% or 300% faster.
Like interrupts, hardware timer, why 4096(It's a power of 2, which makes this check very fast on binary computers) etc.
But I see this is probably the wrong place for that.

(To be honest, when thinking about that, before asking the AI, I had something similar in mind, to split the counting up in smaller parts, and then work with swap. But did not try it.
Then I realized, that I was on the right track after the AIs solution.)

No the boost comes just from doing less things in the loop.

Only this two rows of the loop are exetuted at every round:

while (true) {
        performanceCounter++;
        if ((xTaskGetTickCount()* portTICK_PERIOD_MS - lastResetTime)

And in the other code only this ones:

while (true) {
        localCounter++;
        if ((localCounter & 0xFFF) == 0) {

That is executed millions of times, the rest of instructions no.
That's the single and only reason why the second code is faster. And that's possible because one uses the timer to count and the other no.

Conclusions:

  • If you compares codes with different instructions, you get different results. Amazing.
  • If you use the timer instead of counting the time manually, is faster. Congratullations, you invented the noodles soup.

But ok, if you get some conclusions and this is useful or interesting for you, that's fine. What do you want to demonstrate here?

It's delusion.

seems because the test for the # of counts is faster than the way you checked if the time was reached

If it were me, I'd refactor the above code to remove the multiplications:

void loop() {
  setCpuFrequencyMhz(240);

    TickType_t lastResetTime = xTaskGetTickCount();
    const TickType_t period = 1000 / portTICK_PERIOD_MS;
    int32_t performanceCounter = 0;

    while (true) {
        performanceCounter++;
        if ((xTaskGetTickCount()  - lastResetTime) >= period) {
            printf("Performance score2: %lu\n", performanceCounter);
            performanceCounter = 0;
            lastResetTime += period;  
        }
    }
}

Now times are counted in ticks using the FreeRTOS native type TickType_t, and there's no multiplication in sight. (Using TickType_t and unscaled ticks is important for proper wraparound handling.) That single division (1000 / portTICK_PERIOD_MS) is done at compile time because portTICK_PERIOD_MS is a constant.

But since this is using FreeRTOS, why not have a separate task that does only one thing, and use vTaskDelay() to delay for the right amount of time? I'm sure FreeRTOS has optimized vTaskDelay() (and it lets other tasks run, which is even more efficient than even the AI version). I'm not familiar with all timer functions available in FreeRTOS, but there's likely a way to make a task run periodically without using vTaskDelay(). One of the advantages of using an RTOS is you can have multiple independent tasks and you don't need millis()-based multi-tasking.