Tips to increase Serial throughput (or alike)?

Robin2:
I don't think you are doing yourself a favour by thinking of this as 4 cranes rather than 4 printers.

I didn't make myself a favour in advance when I went for a single X axis printer without checking software availability. And I even tought about cutting the aluminum frame to get 4 printers. But I really need multiple tools and - excluding expensive industrial machines - there's nothing in the DIY panorama. Existing tool changers are complex and ... more expensive than a whole crane (frame+motors+pcb+...).

Imagine that you really have 4 separate 3D printers. In that case there is no problem having them all "live" at the same time.

Now, suppose that you could disconnect the "crane" parts of the 4 printers from their X axes. You could continue to have all 4 live at the same time.

But the problem is how to share the X axis so that, when the flex-plastic is needed, the X axis is controlled by the flex-plastic-Mega, when the PEEK is needed the X axis is controlled by the PEEK-Mega etc.

Disconnecting isn't enough. If you disconnect only, the build time goes ahead but the disconnected megas don't keep going. Even if you flip the pins into INPUT so that the disconnected megas get updates from the master mega currently at work, they might miss some details: what are you going to do with temperature reports? And how you move the other arms to the next starting position (or warm up the other tools) if they aren't in charge of current build? It's a cool nightmare...

I don't know how that could be done with GRBL but I know it could be done with a small modification of the sort of system I am using and, most importantly, it would NOT require individual step pulses to be sent by Serial or indeed any complicated synchronisation between the Megas.

Have you the firmware published somewhere I can give a look at?

anichang:
Disconnecting isn't enough. If you disconnect only, the build time goes ahead but the disconnected megas don't keep going. Even if you flip the pins into INPUT so that the disconnected megas get updates from the master mega currently at work, they might miss some details: what are you going to do with temperature reports? And how you move the other arms to the next starting position (or warm up the other tools) if they aren't in charge of current build? It's a cool nightmare...

I can't held feeling you are making this more complicated than it needs to be.

I said nothing about disconnecting Megas - that would be silly. And how could they miss details?

It seems to me you have a particular working model in your mind that you have not explained clearly to us. Maybe you can write a short description of how you want your machine to work - describe the things that need to happen as time passes and the print develops.

Have you the firmware published somewhere I can give a look at?

No, and I don't plan to. It would be meaningless without the Python code and it would be too much trouble to provide support for all of that.

The Arduino code is very simple really. The PC sends the number of steps for the motor that moves the most steps and the numbers of µsecs between steps for all of the motors. The Arduino then just uses that data to produce steps for each motor. When all the steps have been done the next message from the PC will be waiting in the Serial Input Buffer.

...R

Robin2:
The Arduino code is very simple really. The PC sends the number of steps for the motor that moves the most steps and the numbers of µsecs between steps for all of the motors. The Arduino then just uses that data to produce steps for each motor. When all the steps have been done the next message from the PC will be waiting in the Serial Input Buffer.

!!!!
Dude, that's exactly the missing bit here. You reduce the amount of data to be sent by sending a carrier and modulating the 1/frequency. If I place an antenna over my roof I can hear the music going out of your printer on 106.6MHz FM...
I've been searching for differential signaling and other encodings that can serve the purpose but couldn't find anything that can be crammed into an avr.

I would expect the mega to be able to handle eight steppers with ease, especially if you use PORT commands to step eight at a time.

So you "just" need to write your own pair of programs that understand your multi-axis three-D printer. One to slice to G-code and one for the Mega to interpret it and drive. Much easier than syncing four micro controllers.

Also, I'd be inclined to worry more about building the thing and getting it working than starting from "I must have maximum speed". You can tune it later if it's too slow and if you have to, bite the bullet and upgrade your hardware.

Do these multiple gantries extrude simultaneously? A picture would be interesting.

anichang:
!!!!
Dude, that's exactly the missing bit here. You reduce the amount of data to be sent by sending a carrier and modulating the 1/frequency. If I place an antenna over my roof I can hear the music going out of your printer on 106.6MHz FM...
I've been searching for differential signaling and other encodings that can serve the purpose but couldn't find anything that can be crammed into an avr.

I'll dig out my tinfoil hat and leave you to your own devices.

All I can say in my favour is that my projects work.

...R

wildbill:
I would expect the mega to be able to handle eight steppers with ease, especially if you use PORT commands to step eight at a time.

So you "just" need to write your own pair of programs that understand your multi-axis three-D printer. One to slice to G-code and one for the Mega to interpret it and drive. Much easier than syncing four micro controllers.

Also, I'd be inclined to worry more about building the thing and getting it working than starting from "I must have maximum speed". You can tune it later if it's too slow and if you have to, bite the bullet and upgrade your hardware.

Do these multiple gantries extrude simultaneously? A picture would be interesting.

Can't use port commands as the steppers pins aren't all at the same port. And there are 3-7 steppers per crane: 1 have a laser instead of the extruder (ie: 3 steppers), another one have a quad-extruder setup (ie: 7 steppers). Spare steppers on 1 board need to help the other board... I need to control about 25 steppers. 1 Mega can't do it.
It's already built and tested crane-by-crane using a single mega with Klipper loaded. Need to make it work concurrently.
About 'tune it later' and 'bite the bullet', I'm already there. Last year I bought the motors, drivers, pulleys and belts in order to be able to tune it both hw and software once everything was in place; read previous comments for details. Currently I need the software before I can go to fine tuning the whole thing.

Robin2:
I'll dig out my tinfoil hat and leave you to your own devices.

All I can say in my favour is that my projects work.

omg, don't go autistic and read again. I didn't imply/suppose anything offensive. I'm sure it works, that's why I asked you to publish the firmware part. There's no need of the python part as it's trivial to reproduce. The hard part is to handle bits on the arduino and after 4 pages of comments you don't want to publish that!

I've just found interesting that you used an approach similar to frequency modulation! Where did you get the tin foil hat thing from?

anichang:
The hard part is to handle bits on the arduino and after 4 pages of comments you don't want to publish that!

I just use the digitalWriteFast library to deal with my I/O pins. I'm a simple soul.

This is the code that does that. This ISR is called every 40 µsecs.

ISR (TIMER2_COMPA_vect) {

            // the movement works like this
            // the master axis moves at each ISR call
            // it adds a value to multicount which represents its contribution to the LCM
            //          for all axes combined
            // when multicount exceeds the counter for any other axes that axes steps
            //      and increments its counter by its contribution to the LCM
            // the PC will have sent    the step interval
            //                          the number of master steps
            //                          the lcm contribution for each axes

            // HAVE I PROPERLY dealt with axis that should not move?
            //      YES - that is handled by the direction being 0
            //          which is figured out in updateMoveData from the lcmINC being 0

    if (moveInProgress == false) {
        return;
    }
    else if (stepsToGo > 0) {

            // increment multicount for each tick
        multiCount += curCountIncrement;

            // turn the step pulse HIGH when appropriate
        if (moveData.xAxisInc > 0) {
            if (multiCount >= xAxisCount) {
                digitalWriteFast(xAxisStepPin, HIGH);
                xAxisCount += moveData.xAxisInc;
                if (moveData.masterAxis == 'X') {
                    stepsToGo --;
                    stepsMoved ++;
                    accelSkip ++;
                }
            }
        }
        if (moveData.yAxisInc > 0) {
            if (multiCount >= yAxisCount) {
                digitalWriteFast(yAxisStepPin, HIGH);
                yAxisCount += moveData.yAxisInc;
                if (moveData.masterAxis == 'Y') {
                    stepsToGo --;
                    stepsMoved ++;
                    accelSkip ++;
                }
            }
        }
        if (moveData.zAxisInc > 0) {
            if (multiCount >= zAxisCount) {
                digitalWriteFast(zAxisStepPin, HIGH);
                zAxisCount += moveData.zAxisInc;
                if (moveData.masterAxis == 'Z') {
                    stepsToGo --;
                    stepsMoved ++;
                    accelSkip ++;
                }
            }
        }
        if (moveData.aAxisInc > 0) {
            if (multiCount >= aAxisCount) {
                digitalWriteFast(aAxisStepPin, HIGH);
                aAxisCount += moveData.aAxisInc;
                if (moveData.masterAxis == 'A') {
                    stepsToGo --;
                    stepsMoved ++;
                    accelSkip ++;
                }
            }
        }

            // start Timer2B to turn the pulses LOW
        OCR2B = TCNT2 + 2;
        TIMSK2 = 0b00000110; // enable Timer2B interrupt


        if (stepsMoved < accelSteps) { // speeding up
            if (curCountIncrement < moveData.fastCountIncrement and
                    accelSkip == moveData.accelStepInterval) {
                curCountIncrement ++;
                accelSkip = 0;
            }
        }
        if (stepsToGo < accelSteps) {  // slowing down
            if (curCountIncrement > moveData.slowCountIncrement and
                    accelSkip == moveData.accelStepInterval) {
                curCountIncrement --;
                accelSkip = 0;
            }
        }

    }
    else {
        moveInProgress = false;
    }

}

...R

Robin2:
I just use the digitalWriteFast library to deal with my I/O pins. I'm a simple soul.

This is the code that does that. This ISR is called every 40 µsecs.

I am too, I've installed the "Pin" library; no idea if it works fast enough yet. I'll check it out later on, and in case I'll switch back to avr-libc macros.
About the ISR. No need to say that it is HUGE, to be an ISR. I'm sure it works in your use case, but that's a 40us ISR!!!
It's a 640 cycles interval (~25kHz); it's long enough to manage the long procedure you placed in the ISR. For reference: GRBL goes up to 30kHz; Klipper is the speed champ and goes up to ~150kHz (@16MHz, 1 stepper, but speed goes down fast when adding steppers to the equation).
I'm working having as reference the upper theoretical limit (250kHz): down to 32+32 cycles procedures (HIGH+LOW time). Theoretically. I don't really need that; the real scenario is Klipper (150kHz). I just need to match that.
Stefan didn't reply yet; he was writing a real world (small size) motor can't produce enough torque to go that fast, and boiled down (up) to 18.7 microseconds (~53kHz). That's a 299 cycles interval: much easier computationally wise, but still half of your timing.

Anyway, thanks for posting your code. I'll give it a better look as soon as I get the new version of my code to the point I need to trigger an ISR. Currently for precise triggering I'm using something similar to the hack published long ago on HAD.
The original author used that to trigger the pwm pins; I'm triggering compB vector instead to toggle 8 arbitrary pins. I've tested the thing using Wiring and works fair enough: timer0 untouched (fires ovf for Wiring time keeping only), timer1@16MHz (a little amount of instructions to move ahead the trigger index), timer2@16MHz (8 pin toggle), and no delay. I'm manually accounting the cycles I use in each ISR, that's why I get no delay.
Note: you asked why to use Klipper if your solution works. As a rule of thumb you are right, if you don't need anything it's good as is (@25kHz). But if you feel the need for more speed, Klipper can give it to you, straight away, as is, on the spot (@150kHz). And joining Kevin on Klipper can give you the chance to share your efforts and receive plenty goodies from a public project.

anichang:
About the ISR. No need to say that it is HUGE, to be an ISR. I'm sure it works in your use case, but that's a 40us ISR!!!
It's a 640 cycles interval (~25kHz);

I find this conversation very strange. This is the second time you have treated what I have contributed as if it is set in stone when it is only intended as a suggestion to spark off ideas. (And I'm very pleased to be as close as I am to the GRBL speed :slight_smile: )

I'm working having as reference the upper theoretical limit (250kHz):

This is where I have a problem. I have looked back over this Thread and I can't see anywhere an explanation of why that repetition rate is needed. What are you doing that needs a speed 8 times as fast as GRBL ?

...R

Robin2:
I find this conversation very strange. This is the second time you have treated what I have contributed as if it is set in stone when it is only intended as a suggestion to spark off ideas. (And I'm very pleased to be as close as I am to the GRBL speed :slight_smile: )

Well, as Stefan pointed out: the result is good if it works for you. There are other constraints given the motor, the weight to be moved, and so on; so, basically, if your printer has other speed limiting factors, 25kHz can be more than what you need. Given the references we can publicly access, it looks like a good result to me.
In fact, it's the second time I don't understand why you get something nasty in my words!!! And every time I need to go back and read again what I wrote: "Did I say something nasty? I usually do, this time doesn't looke like... where's the clue then... let's read again... and again... and again... "

Edit: and yes! You gave me a good idea: to use 1 motor 'as a carrier'. But I can't use the same implementation because your requirements are more relaxed, so you can push all that code in a single ISR.

This is where I have a problem. I have looked back over this Thread and I can't see anywhere an explanation of why that repetition rate is needed. What are you doing that needs a speed 8 times as fast as GRBL ?

Ok. Let's clear this misunderstanding too. First: 'strictly needed' absolutely not. My design rules are many, but they don't include the speed. When I write 'need' is more like 'desiderata'. But in this case it is more like 'could be needed in the near future' because I still don't know what microstepping I'll be using. I like a semi-empirical approach (study-while-implement), Agile if you like... built-to-order... in any case not 2 separate stages, 1 to design and 1 to implement the design. I did the torque math on a napkin last year while a movie was watching me; got the result, bought the stuff, trashed the napking (cause I finished to eat) and forgot about the math. And it took me a while to figure out what Stefan was talking about a few comments ago. (and you didn't read; he is a bit cryptic but anybody with previous experience should get what he says)
That said, going from full steps to 256-steps means a 256 multiplier to the number of steps needed to complete the same amount of rotations using full steps. Once everything is up&running I could be happy with 1kHz, or in need of 256kHz. Microsteps aren't really needed; it's just a way to serve better signaling to motors (with some derived benefits). In my case it's unlikely for me to use 256-steps; firstly because I don't have all the drivers able to support that, I've mostly cheap drivers; secondly because I'm not sure my motors can do that; third: I have old AVRs, they just can't do that in any case. But I want to use the good drivers to test 256-steps and measure noise, temperatures and power consumption.
So, at this point in time, the simple 'the more the better' is the approach.
There's also the second part: 'the more' can't be infinite and both arduino and the driver are competing to set the ceiling. Drivers have a minimum pulse width; the ones I checked are all under 2uS (+2uS LOW); a 4us period is safe to set as ceiling: 250kHz. Is it possible to get 250kHz on my megas? I don't think so, but not because of the avr itself. It's more a combination of different factors. If I could use hardware pwm pins, I might be able to do it; but I can't, so I need to flood the cpu with a ton of IRQs. What is the best speed for my AVRs? Evidence says "150kHz can be done" (Klipper).
So, to conclude: 'the more the better, in between 125kHz and 250kHz' is the current target'. I simply don't know more than that. I'd really like to be able to answer Stefan's questions about required performance, but I simply don't know more.

anichang:
So, at this point in time, the simple 'the more the better' is the approach.

This is all much too vague for me.

I wish you well with your project.

...R

Just for the records.

Reaching higher pulse frequence can't be done using an 8 bit AVR. Below an example simulating the code needed to receive, prepare and run a pulsing sequence on 16 arbitrary pins (8 pins for step, 8 pins for dir); it runs inside the "start tick counter" and "stop tick counter" ASM functions, needed to account the number of cpu cycles spent. The first part init the uart0 and setup libc printf in order to print the result; inside the main loop there's a reference account (1000 cycles delay), then the real simulation code. The code doesn't make use of Wiring, it's plain avr-libc and ASM.

uint8_t pins[16];
uint32_t tmicros = 123123;
uint32_t tahead[4];
uint8_t buffer[4][64][16];
uint8_t buffer_len = 0;
uint8_t buffer_read = 0;
uint8_t buffer_pulse_len = 0;
uint8_t buffer_pulse_read = 0;
uint8_t pulsing = 0;
uint8_t semiperiod = 0;

int main(void) {
    u0_init();
    stdout = &u0_stdout;
    while (1) {
        // start
        StartTickCounter();
        // code to be tested
        __builtin_avr_delay_cycles(500);
        __builtin_avr_delay_cycles(500);
        // report
        StopTickCounter();
        printf("1000 cycles delay: %u\n", GetTicks());
        ResetTickCounter();

        //
        _delay_ms(1);

        // start
        StartTickCounter();
        // store incoming data, ie: 2 bytes, each bit 1 pin value
        buffer[0][0][0] = 1&1;
        buffer[0][0][1] = 2&2;
        buffer[0][0][2] = 3&3;
        buffer[0][0][3] = 4&4;
        buffer[0][0][4] = 5&5;
        buffer[0][0][5] = 6&6;
        buffer[0][0][6] = 7&7;
        buffer[0][0][7] = 8&8;
        buffer[0][0][8] = 9&9;
        buffer[0][0][9] = 10&10;
        buffer[0][0][10] = 11&11;
        buffer[0][0][11] = 12&12;
        buffer[0][0][12] = 13&13;
        buffer[0][0][13] = 14&14;
        buffer[0][0][14] = 15&15;
        buffer[0][0][15] = 16&16;
        // prepare
        if (tahead[0]<=tmicros) {
            if (pulsing==64) {
                pulsing = 0;
            } else {
                pulsing++;
                buffer_len--;
                buffer_read = 123;
                if (buffer_read == 123) buffer_read = 0;
            }
        }
        // one pulse (called 2 times: 1 HIGH, 1 LOW)
        if (semiperiod) {
            pins[0] = buffer[0][pulsing][0];
            pins[1] = buffer[0][pulsing][1];
            pins[2] = buffer[0][pulsing][2];
            pins[3] = buffer[0][pulsing][3];
            pins[4] = buffer[0][pulsing][4];
            pins[5] = buffer[0][pulsing][5];
            pins[6] = buffer[0][pulsing][6];
            pins[7] = buffer[0][pulsing][7];
            pins[8] = buffer[0][pulsing][8];
            pins[9] = buffer[0][pulsing][9];
            pins[10] = buffer[0][pulsing][10];
            pins[11] = buffer[0][pulsing][11];
            pins[12] = buffer[0][pulsing][12];
            pins[13] = buffer[0][pulsing][13];
            pins[14] = buffer[0][pulsing][14];
            pins[15] = buffer[0][pulsing][15];
            semiperiod = 0;
        } else {
            pins[0] = 0;
            pins[1] = 0;
            pins[2] = 0;
            pins[3] = 0;
            pins[4] = 0;
            pins[5] = 0;
            pins[6] = 0;
            pins[7] = 0;
            pins[8] = 0;
            pins[9] = 0;
            pins[10] = 0;
            pins[11] = 0;
            pins[12] = 0;
            pins[13] = 0;
            pins[14] = 0;
            pins[15] = 0;
            semiperiod = 1;
        }
        // report
        StopTickCounter();
        printf("result: %u\n", GetTicks());
        ResetTickCounter();

        // slow down to read on console
        _delay_ms(500);
    }
}

Results in:

11:37:30.763 -> 1000 cycles delay: 1002 
11:37:30.763 -> result: 166 
11:37:31.260 -> 1000 cycles delay: 1002 
11:37:31.260 -> result: 128

294 cycles per pulse (166 in HIGH time, 128 in LOW time) is 294/16=18,375 uS minimum pulse period (and no spare cpu time). If you have ~20uS cpu time used just to produce the pulse sequence, there might be the need to have 2-4-8... more microseconds to run the glue code; even more in the case of complex time manipulations (ex: for run-later). Refactoring didn't help much. Probably going down hard on macros and C++ templates can get better results but not enough.

In other attempts I've been able to produce simpler pulsing code, taking the very minimum pulsing cpu time down to 8uS, and a total cpu time of 12uS (ie: 83kHz best speed). Increasing the amount of data to be moved from host to MCU. Then I'd need to implement more code for accessories (ex: reporting sensors and set some flags on other pins), error detection (ex: crc) and correction (ex: retransmit)... and the speed would go down, probably down to the same results of other readily available firmwares (50kHz for grbl, 150kHz for Klipper, and custom solutions like the one posted by Robin2 in this thread).

Using SPI makes the things slightly better, on 1 MCU. Because the tight and high frequencies allow to compact the input, prepare and run steps; saving cycles and by doing so the minimum pulse period can be reduced. But adding multiple MCUs to the equation bring me to square 1.

The whole point for me to dive into this topic was to remove specialized logic from the mcu, in order to have a general purpose firmware to be used for multiple projects. In short: I didn't like files like 'stepper.c' in the existing printer' firmwares; and I didn't like to see highly specialized serial protocols in the existing printers' firmwares. I'd prefer a more generalized approach using something like 'toggle-pin-one-and-trigger-pulse-on-pin-two' rather than 'stepper'. Increasing speed wasn't a design rule; I've supposed to get more speed by removing specialized logic. It turned out there's no extra logic to be removed, so no extra speed.

By touching the limits of this platform became pretty clear that there's no point to spawn a whole new firmware. The options are three:

  • GRBL: awesome code, fairly simple serial protocol, but it isn't suited for multi-mcu (ie: no run-later timer code).
  • Klipper: messy code, cumbersome serial protocol, made for 'run-later' (ie: multi-mcu).
  • Marlin: messy code, no idea of the serial protocol, no multi-mcu. It is the most feature rich of the three but I don't need for all that stuff to be on the MCU.

Basically I've to choose:

  • to clean Klipper's DECL_* set of macros, complex serial protocol, and weird build system; without loosing speed.
  • to comfortably add run-later (multi-mcu) to GRBL; without loosing speed.

Klipper supports multiple MCUs/boards. On 32 bit platforms it speeds up pretty well; it allows for higher microstepping rates. The only unacceptable detail is its code complexity; lurking in there made me feel like an elephant in a Swarosvki shop. And usually that feeling when investigating some code, isn't a good sign.

In case you are still wondering what's the fastest you can achieve on the UART port. There was a previous thread discussing this...

Old thread

In summary, you can technically go all the way up to 2 Mbps, but you'll be limited to the "polling" nature of USB (used on the USB CDC driver on your PC or USB-UART chip)

I was able to send and received 1Mbytes worth of data without any parity/frame/overrun errors, but it is nowhere near close to a raw 2Mbps rate (USB "polling" slowed it down significantly).

I'll be testing it without a USB later and see if a direct COMM port on a old Linux server PC can achieve the raw 2 Mbps the hardware is capable of

hzrnbgy:
In case you are still wondering what's the fastest you can achieve on the UART port. There was a previous thread discussing this...

Old thread

In summary, you can technically go all the way up to 2 Mbps, but you'll be limited to the "polling" nature of USB (used on the USB CDC driver on your PC or USB-UART chip)

I was able to send and received 1Mbytes worth of data without any parity/frame/overrun errors, but it is nowhere near close to a raw 2Mbps rate (USB "polling" slowed it down significantly).

I'll be testing it without a USB later and see if a direct COMM port on a old Linux server PC can achieve the raw 2 Mbps the hardware is capable of

Thanks for reporting your efforts. You confirmed my findings posted at the beginning of this thread. On current AVR 8bit boards we are stuck at 500kbits (or 4-8Mbits SPI) and 20+ uS pulsing periods. Better performance are possible by exploiting the other hardware goodies (ex: hw pwm), whenever is possible to do so.
You can get 1mbits on the uart0 but I've found sporadic errors during my tests. It isn't reliable without enhanced framing and error detection/correction; in that case become mandatory to burn some cpu time with cobs (byte framing, relies on good bit framing... so you need to use 2 stop bits) and parity/crc, then retransmit lost data. Is it worth it?
And similarly with 2Mbits (once you bypass the USB middle-chip): is it worth it? We can get steady 500kbits using the tty-USB converter because everything is on PCB traces; once you use your own cabling, it's more likely to introduce sporadic errors.

The middle-chip (FTDI / CH340) on some of these boards is a bottle neck in some different ways: it introduces the same 'buffer bloat' problem found in modern networking stacks, it denies USB HID protocol, and all of this on top of USB protocol, adding by itself other kind of latencies. Ex: using USB 2.0 gives you the chance to get a 125uS delay per packet, worst case; this mandates to increase the buffer (ie: more bufferbloat) to overcome those worst cases, every time your periods are lower than 125uS. It's one of the reasons why real time usb audio have been pretty bad since USB took over other standards...
USB 3.0 solved this problem and the upcoming USB 4.0 will definitively remove this issue. In the very near future.
But in the near future there isn't avr 8bit any more. GCC already EOL'ed AVR code; v10 (or v11, can't remember) will be the last AVR compiler. Main distros (debian, ubuntu, gentoo, arch) just switched from v5.4 to v9. Distros could default to v10 soon; and we'll be able to build/download and use v10 for the next 10 years probably. So, it isn't urgent to replace the AVR hw. But highly advised to not buy it any more for new projects. In Arduino world: Arduino Zero > Arduino Uno. No idea why they didn't introduce something to replace the Mega as well (or maybe they did and I didn't see it on their products page).

Given the EOL'ling of the compiler, is it wise to spend more time on those AVRs in the desperate attempt to squeeze more performance than the ones already found in the past 15 years? Don't think so: 1) It's unlikely to happen (cause thousands of people have been looking for it in the past 15 years!!!), 2) what for? All those boards will be in the trash can within 10 years, together with the Z80s, PICs and so on. Even the ATTiny... an ESP32 in full power saving have a similar power footprint.

One side note: I re-evaluated the Arduino framework. It's much more efficient than used to be 10+ years ago. There's no much code efficiency spent in the name of comfort for newbies. I mean, it's comfy and efficient; at least for basic stuff I've used (GPIOs, UART, Timers). The only thing to avoid are micros() and millis() !!!