Just for the records.
Reaching higher pulse frequence can't be done using an 8 bit AVR. Below an example simulating the code needed to receive, prepare and run a pulsing sequence on 16 arbitrary pins (8 pins for step, 8 pins for dir); it runs inside the "start tick counter" and "stop tick counter" ASM functions, needed to account the number of cpu cycles spent. The first part init the uart0 and setup libc printf in order to print the result; inside the main loop there's a reference account (1000 cycles delay), then the real simulation code. The code doesn't make use of Wiring, it's plain avr-libc and ASM.
uint8_t pins[16];
uint32_t tmicros = 123123;
uint32_t tahead[4];
uint8_t buffer[4][64][16];
uint8_t buffer_len = 0;
uint8_t buffer_read = 0;
uint8_t buffer_pulse_len = 0;
uint8_t buffer_pulse_read = 0;
uint8_t pulsing = 0;
uint8_t semiperiod = 0;
int main(void) {
u0_init();
stdout = &u0_stdout;
while (1) {
// start
StartTickCounter();
// code to be tested
__builtin_avr_delay_cycles(500);
__builtin_avr_delay_cycles(500);
// report
StopTickCounter();
printf("1000 cycles delay: %u\n", GetTicks());
ResetTickCounter();
//
_delay_ms(1);
// start
StartTickCounter();
// store incoming data, ie: 2 bytes, each bit 1 pin value
buffer[0][0][0] = 1&1;
buffer[0][0][1] = 2&2;
buffer[0][0][2] = 3&3;
buffer[0][0][3] = 4&4;
buffer[0][0][4] = 5&5;
buffer[0][0][5] = 6&6;
buffer[0][0][6] = 7&7;
buffer[0][0][7] = 8&8;
buffer[0][0][8] = 9&9;
buffer[0][0][9] = 10&10;
buffer[0][0][10] = 11&11;
buffer[0][0][11] = 12&12;
buffer[0][0][12] = 13&13;
buffer[0][0][13] = 14&14;
buffer[0][0][14] = 15&15;
buffer[0][0][15] = 16&16;
// prepare
if (tahead[0]<=tmicros) {
if (pulsing==64) {
pulsing = 0;
} else {
pulsing++;
buffer_len--;
buffer_read = 123;
if (buffer_read == 123) buffer_read = 0;
}
}
// one pulse (called 2 times: 1 HIGH, 1 LOW)
if (semiperiod) {
pins[0] = buffer[0][pulsing][0];
pins[1] = buffer[0][pulsing][1];
pins[2] = buffer[0][pulsing][2];
pins[3] = buffer[0][pulsing][3];
pins[4] = buffer[0][pulsing][4];
pins[5] = buffer[0][pulsing][5];
pins[6] = buffer[0][pulsing][6];
pins[7] = buffer[0][pulsing][7];
pins[8] = buffer[0][pulsing][8];
pins[9] = buffer[0][pulsing][9];
pins[10] = buffer[0][pulsing][10];
pins[11] = buffer[0][pulsing][11];
pins[12] = buffer[0][pulsing][12];
pins[13] = buffer[0][pulsing][13];
pins[14] = buffer[0][pulsing][14];
pins[15] = buffer[0][pulsing][15];
semiperiod = 0;
} else {
pins[0] = 0;
pins[1] = 0;
pins[2] = 0;
pins[3] = 0;
pins[4] = 0;
pins[5] = 0;
pins[6] = 0;
pins[7] = 0;
pins[8] = 0;
pins[9] = 0;
pins[10] = 0;
pins[11] = 0;
pins[12] = 0;
pins[13] = 0;
pins[14] = 0;
pins[15] = 0;
semiperiod = 1;
}
// report
StopTickCounter();
printf("result: %u\n", GetTicks());
ResetTickCounter();
// slow down to read on console
_delay_ms(500);
}
}
Results in:
11:37:30.763 -> 1000 cycles delay: 1002
11:37:30.763 -> result: 166
11:37:31.260 -> 1000 cycles delay: 1002
11:37:31.260 -> result: 128
294 cycles per pulse (166 in HIGH time, 128 in LOW time) is 294/16=18,375 uS minimum pulse period (and no spare cpu time). If you have ~20uS cpu time used just to produce the pulse sequence, there might be the need to have 2-4-8... more microseconds to run the glue code; even more in the case of complex time manipulations (ex: for run-later). Refactoring didn't help much. Probably going down hard on macros and C++ templates can get better results but not enough.
In other attempts I've been able to produce simpler pulsing code, taking the very minimum pulsing cpu time down to 8uS, and a total cpu time of 12uS (ie: 83kHz best speed). Increasing the amount of data to be moved from host to MCU. Then I'd need to implement more code for accessories (ex: reporting sensors and set some flags on other pins), error detection (ex: crc) and correction (ex: retransmit)... and the speed would go down, probably down to the same results of other readily available firmwares (50kHz for grbl, 150kHz for Klipper, and custom solutions like the one posted by Robin2 in this thread).
Using SPI makes the things slightly better, on 1 MCU. Because the tight and high frequencies allow to compact the input, prepare and run steps; saving cycles and by doing so the minimum pulse period can be reduced. But adding multiple MCUs to the equation bring me to square 1.
The whole point for me to dive into this topic was to remove specialized logic from the mcu, in order to have a general purpose firmware to be used for multiple projects. In short: I didn't like files like 'stepper.c' in the existing printer' firmwares; and I didn't like to see highly specialized serial protocols in the existing printers' firmwares. I'd prefer a more generalized approach using something like 'toggle-pin-one-and-trigger-pulse-on-pin-two' rather than 'stepper'. Increasing speed wasn't a design rule; I've supposed to get more speed by removing specialized logic. It turned out there's no extra logic to be removed, so no extra speed.
By touching the limits of this platform became pretty clear that there's no point to spawn a whole new firmware. The options are three:
-
GRBL: awesome code, fairly simple serial protocol, but it isn't suited for multi-mcu (ie: no run-later timer code).
-
Klipper: messy code, cumbersome serial protocol, made for 'run-later' (ie: multi-mcu).
-
Marlin: messy code, no idea of the serial protocol, no multi-mcu. It is the most feature rich of the three but I don't need for all that stuff to be on the MCU.
Basically I've to choose:
- to clean Klipper's DECL_* set of macros, complex serial protocol, and weird build system; without loosing speed.
- to comfortably add run-later (multi-mcu) to GRBL; without loosing speed.
Klipper supports multiple MCUs/boards. On 32 bit platforms it speeds up pretty well; it allows for higher microstepping rates. The only unacceptable detail is its code complexity; lurking in there made me feel like an elephant in a Swarosvki shop. And usually that feeling when investigating some code, isn't a good sign.