Tips to increase Serial throughput (or alike)?

Hi everybody,

I’d like to use my 16MHz Arduino Mega as a device to precisely control 5-8 steppers from my host (an RPi-like sbc), as fast as possible, using some form of software pulsing (ie: interrupt triggered, as I can’t use the steppers on the hardware driven PWM pins). I’ve been struggling on this topic for 6 months and don’t have a clue yet; it could easily be solved using a more capable board, but at this point became a challenge.

The definition of ‘as fast as possible’ is given by the steppers driver: 2uS minimum pulse width (+2uS low). It means the theoretical maximum is 250 pulses per millisecond (250KHz), 64 cpu cycles (16*4uS) per pulse, 8k ram gives a few millis of buffer capability. If we place cpu-ram-troughput on a triangle sides, we get the ‘contraints’ picture.

I’ve been trying more than a dozen of different approaches (starting from Firmata, scrapping all down and rewriting again; using variable-length integers, or attempting differential signaling) and couldn’t find one going even close to 100KHz. Cpu and ram are ok; I can produce a very precise trigger using a timer interrupt and 20-30 cycles (ISR in/out included) to toggle 8 pins, using a few different approaches depending on the data format I’ve receiving from the serial port. The problem is I can’t keep the buffer full because the uart is too slow.

First of all, I can’t get an error-free byte stream over 500kbps. Even bypassing both Wiring/Arduino and Python I get underachieving performance with 1Mbps, and pretty bad results at 2Mbps. Is it the right result or I’m doing something wrong?

This is the mcu code I’m using to test the speed:

#include <inttypes.h>

#include <avr/io.h>
#include <avr/interrupt.h>
#include <util/delay.h>

#undef BAUD
#define BAUD 500000
#undef BAUD_TOL
#define BAUD_TOL 0 // 0% error tolerance
#define USE_2X   0
#include <util/setbaud.h>

ISR(USART0_RX_vect) {
}

ISR(USART0_UDRE_vect) {
}

uint8_t bindex = 0;
int main(void) {
    // baudrate
    UBRR0H = UBRRH_VALUE;
    UBRR0L = UBRRL_VALUE;
    // clear register
    UCSR0A = 0x00;
    // enable TX
    UCSR0B |= (1<<TXEN0);
    // enable data-registry-empty (tx) IRQ
    UCSR0B |= (1<<UDRIE0);
    //
    //UCSR0C |= (1<<UPM01)|(1<<UCSZ01)|(1<<UCSZ00); // set 8-bit character size with even parity
    UCSR0C = (1<<UCSZ01)|(1<<UCSZ00); // set 8-bit character size

    while(1) {
        if (bindex<250) {
            loop_until_bit_is_set(UCSR0A, UDRE0);
            UDR0 = 123;
            bindex++;
        } else {
            loop_until_bit_is_set(UCSR0A, UDRE0);
            UDR0 = 0;
            bindex = 0;
        }
    }
}

and host side main.c (Linux + a small C++ uart library without delays in code):

#include <iostream> 
#include <chrono> 
#include "include/CppLinuxSerial/SerialPort.hpp" 
 
using namespace std::chrono; 
using namespace mn::CppLinuxSerial; 
 
int main() { 
    SerialPort serialPort("/dev/ttyUSB0", 500000); 
    serialPort.SetTimeout(-1); // Block when reading until any data is received 
    serialPort.Open(); 
 
    long strlen = 0;  
    high_resolution_clock::time_point t1 = high_resolution_clock::now(); 
    while(strlen<1000000) { 
        // Read some data back (will block until at least 1 byte is received due to the SetTimeout(-1) call above) 
        std::string readData; 
        serialPort.Read(readData); 
        strlen = strlen + readData.length(); 
    } 
    high_resolution_clock::time_point t2 = high_resolution_clock::now(); 
    serialPort.Close(); 
 
    duration<double> time_span = duration_cast<duration<double>>(t2 - t1); 
    std::cout << strlen << " bytes in " << time_span.count() << ", " << (strlen/time_span.count()) << "bytes per second"; 
    std::cout << std::endl; 
}

At 500kbps (50 bytes per millisecond):

1000000 bytes in 19.9975, 50006.1bytes per second

Max, steady, error free.

At 1Mbps (100 bytes per millisecond):

1000000 bytes in 10.4159, 96007.3bytes per second

Fair, but not perfect.

At 2Mbps (it should be 200 bytes per millisecond):

1000000 bytes in 8.29066, 120618bytes per second

Pretty broken.

I don’t see how I can fix the 2Mbps and the 1Mbps could possibly work by introducing some error correction and retransmits. But I’d like to avoid adding cpu overhead, at this point. But I really need a minimum of 5-8 bits per pulse in order to control 5-8 steppers (ie: 250 bytes per millisecond to achieve 250KHz pulsing, theoretically); don’t know any way to use less bits.

So, at the end of the day, I can’t increase the uart speed and don’t know of any compact representation of pins states so that I can squeeze 8 steppers * 250 pulses in the less than 50 bytes per millisecond I can get at 500kbps.

Any previous art? A good idea? Tips?

How about I2C or SPI?

Are you trying to receive data on the arduino serial and translating that bit stream to control stepper motors? Or the other way around? Your Arduino code shows the Arduino serial SENDING data out

Or is this more like it?

SBC >> serial data >> Arduino >> stepper control on 8 pins

wildbill:
How about I2C or SPI?

I've considered those as a Plan B. The host have both I2C and SPI but wiring (4x Megas) would be a pain. I'm searching for some coding black magic rather wiring complexity ... some kind of compression ... even using PROGMEM to memorize tables or common sequences and then address long sequences using a single address byte. But I'm not a algorithm expert. The attempts I wrote failed miserably on cpu, on ram, or both. Complexity rises fast, and I've already used 50% of cpu power to produce the pulse trigger. I'm a bit exhausted.

hzrnbgy:
Are you trying to receive data on the arduino serial and translating that bit stream to control stepper motors? Or the other way around? Your Arduino code shows the Arduino serial SENDING data out

Or is this more like it?

SBC >> serial data >> Arduino >> stepper control on 8 pins

You guessed right. I need to stream the data from the SBC (fast write) to Arduino (fast read); the example code instead does the opposite, it is just a quick and dirty test I wrote, in order to be sure of my results before posting on the forum. The example code is just the simplest thing I could think of to measure serial performance; data direction is inverted because it reduces the arduino code to the very minimum, and the powerful host can account without introducing latencies instead.

anichang:
You guessed right. I need to stream the data from the SBC (fast write) to Arduino (fast read);

That’s altogether the wrong approach IMHO.

You should send data to the Arduino that tells it how many steps to move and, if necessary, the speed or the total time for the move. Then let the Arduino figure out the individual step pulses and step timing. Something like <-300, 50> meaning move 300 steps anticlockwise at 50 steps per second.

…R

anichang:
I’m searching for some coding black magic rather wiring complexity … some kind of compression … even using PROGMEM to memorize tables or common sequences and then address long sequences using a single address byte.

That sounds like a promising approach, but it depends on how much you can predetermine sequences. Another thing to consider is that it sounds like you’re trying to send a command for every step. Could you just instruct the mega to do fifty steps on stepper3?

Robin2:
That’s altogether the wrong approach IMHO.

You should send data to the Arduino that tells it how many steps to move and, if necessary, the speed or the total time for the move. Then let the Arduino figure out the individual step pulses and step timing. Something like <-300, 50> meaning move 300 steps anticlockwise at 50 steps per second.

…R

Well, the example you made is a 3-4 byte command for less than a microsecond (250 pulses, ideally), single stepper. Eight steppers are 24-32 bytes to instruct a fraction of the needed pulses. It works if the number of steps per command is high (ie: <-300, 198>), it fails if I need to change direction or need just 2 pulses… I tried that :slight_smile:

What are the shortest "runs" of your stepper-motors they are doing?

Is it short as three steps then stop reversing direction doing five steps reversing direction moving two steps?

Or is the minimum they run into one direction a few hundreds of steps?

What is the max frequency your steppers are running?

Do all steppers have to run on their own speed?

These values determine if using a set and forget interrupt-driven approach can be used.

The basic principle is to setup timers to cause interrupts in a regular manner and each time they initiate an interrupt a step is created or not.

you setup a counter-variable to count down to zero. As soon as the variable is above zero create a step-pulse until the counter is down on zero. Then just stop creating pulses.

So you just setup the counter-variable and a flag "run" and creating the pulses is done by the isr. Which automatically stops if counter reaches zero.

I don't know how many different timer-interupts you can setup on a mega 2560.
Depending on the maximum-frequencies you want to use it might be possible to make the timer-interrupt very fast and useing dividers for slower stepping.

If each motor has to run several hundred steps as a minimum this gives you time for setting up the next "run" which means the serial receiving doesn't have to be so fast.

Sending a lot of bytes over the serial interface creates interrupts too. So you might have timing-conflicts of you want to do very very fast receiving a lot of bytes and creating 8 different step-pules frequencies that change rapidly.

Do you have a real thing you want to drive or this is just a explore and push-the-limits-test?

most stepper-motors can drive up to 2000 or 3000 rpm. Not more.
If you use microstepping that is able to produce a 1/256-microstepping this would mean a step-frequency of 3000 / 60 * 256 = 12,8 kHz which is 78 micro-seconds.
3000 rpm is to fast to start from 0 rpm. This means acceleration and decceleration must be included.
(Not sure how to do that)

So the isr would have to run at a speed of 35 microseconds. Not sure if this is realistic.

Again if you want to do a test or a challenge how many and how fast can I drive 5-8 stepper-motors explore the limits.

If you really have to drive 8 stepper-motors that fast I would use a much faster microcontroller like the teensy 4.1
600 MHz $30 or a teeny 4.0 $20

simply driving up to the most shortest step-pulses the stepper-driver can cope with is way above what a stepper-motor can do mechanically becauseof the inertia the rotor has.

best regards Stefan

Again:

is this just a challenge to explore he mimits of an arduino mega ?

or

do you want to realise a real-world-application?

if the second you will have to define the maximum requierements you need.

best regards Stefan

wildbill:
That sounds like a promising approach, but it depends on how much you can predetermine sequences. Another thing to consider is that it sounds like you're trying to send a command for every step. Could you just instruct the mega to do fifty steps on stepper3?

I can predetermine hours of sequence, and then stream the instructions every millisecond ... no problem. The test host currently is an AMD Threadripper 32 cores @ 4Ghz. The final SBC might be an RPi or ODROID or alike; in any case powerful enough to do any kind of math.
The problem with a table-based approach is ... the size of the table. I've 8 steppers multiplied by 250 pulses per millisecond. It's a 2000 bits long sequence (factorial of 2000 possibilities?). I've ~45 bytes (at 500kbps) to address the right sequence; it means that I can split the 2kbit sequence in 45 chunks; this reduces a lot the number of possible combinations but ... still ... huge.

Tried this on mine and 2M seems to work on Putty
2M-8-even-1

#include <avr/io.h>
#include <avr/interrupt.h>

volatile uint8_t rcvd = 0;

ISR(USART_RX_vect)
{
	// receive data
	rcvd = UDR0;
}

int main(void)
{

	// disable global interrupts for now
	asm("CLI");

	// change system clock pre-scaler to 1, run at 16 MHz from the external crystal
	CLKPR = 1<<CLKPCE;
	CLKPR = 0<<CLKPCE | 0<<CLKPS3 | 0<<CLKPS2 | 0<<CLKPS1 | 0<<CLKPS0;


	// set baud rate (2M)
	UBRR0H = 0;
	UBRR0L = 0;

	// USART initialization with 16MHz system clock, 8-even-1
	UCSR0A = 1<<U2X0 | 0<<MPCM0;
	UCSR0B = 1<<RXCIE0 | 0<<TXCIE0 | 0<<UDRIE0 | 1<<RXEN0 | 1<<TXEN0 | 0<<UCSZ02;
	UCSR0C = 0<<UMSEL01 | 0<<UMSEL00 | 1<<UPM01 | 0<<UPM00 | 0<<USBS0 | 1<<UCSZ01 | 1<<UCSZ00 | 0<<UCPOL0;

	uint32_t count = 0xFFFFF;

	do
	{
		UDR0 = 'Z';
		while((UCSR0A & 1<<TXC0)==0);
		UCSR0A |= 1<<TXC0;
		count--;
	}
	while(count);

	while(1)
	{
		asm("NOP");
	}

	return 0;
}

What are these steppers actually doing?

hzrnbgy:
Tried this on mine and 2M seems to work on Putty
2M-8-even-1

You have enabled the U2X mode; there's a caveat somewhere in the datasheet. I can't remember where, but it says the U2X mode affects rx or tx (one of the 2) and says it require a precise clock source to avoid errors; can't remember the exact text, I quickly dropped the idea to use of U2X because the uart is symmetric so... I can't have an 'errors storm' in one direction in order to speed up in the other direction).
And you are doing the opposite of my test setup (ie: I was sending from arduino to host, you are sending from host to arduino); this could explain why you are looking at a good transmission @2Mbps, and I look at a bad transmission instead. But, again, if it works in one direction only, doesn't help me much: what happens if the returning ACKs are faulty? I'd need to implement retransmissions as well, on both sides, increasing complexity and cpu overhead on arduino as well.
Am I right?

Depending on how accurate the crystal is, 2M on 16MHz should give you 0.0% error with U2X (on TX or RX). Not sure what you are saying about error when U2X is enabled. You should be okay with 2M if you have a decent crystal on your Arduino board

My test is sending from Arduino to host. I don't have a way to generate million bytes on my PC and send it over to Arduino via UART.

Maybe you can the same test on your end since you have a program on your host to detect error from received serial data from Arduino

hzrnbgy:
Depending on how accurate the crystal is, 2M on 16MHz should give you 0.0% error with U2X (on TX or RX). Not sure what you are saying about error when U2X is enabled. You should be okay with 2M if you have a decent crystal on your Arduino board

My test is sending from Arduino to host. I don't have a way to generate million bytes on my PC and send it over to Arduino via UART.

Maybe you can the same test on your end since you have a program on your host to detect error from received serial data from Arduino

Yep. At first glance I saw the recv ISR and I guessed you were receiving; then I realized you were transmitting instead; same direction of my test code. Are you sure there aren't errors and it's real 2Mbps? In my test I get a lower bitrate on the receiving end despite the arduino is sending at 2Mbps...
I've found the paragraph (22.3.2 Double Speed Operation) on 2560 datasheet:
"Setting this bit will reduce the divisor of the baud rate divider from 16 to 8, effectively doubling the transfer rate for
asynchronous communication. Note however that the Receiver will in this case only use half the number of sam-
ples (reduced from 16 to 8) for data sampling and clock recovery, and therefore a more accurate baud rate setting
and system clock are required when this mode is used. For the Transmitter, there are no downsides."
We both are using the transmitter on the arduino, so the clock source shouldn't be a problem for the test but could be once the direction is reversed, as the application is reversed: need to use the receiver on arduino, and it needs a good clock source.

why not setting up your ordroid or whatever with a realtime OS and do everyzhing there?

So if you like to discuss about maximised serial transfer-rates - I'm out

anichang:
Well, the example you made is a 3-4 byte command for less than a microsecond (250 pulses, ideally), single stepper. Eight steppers are 24-32 bytes to instruct a fraction of the needed pulses. It works if the number of steps per command is high (ie: <-300, 198>), it fails if I need to change direction or need just 2 pulses… I tried that :slight_smile:

You need to describe the whole project you are trying to create.

It seems to me there are two choices - get the the SBC to drive the stepper motor directly. Or move more of the “intelligence” into the Arduino.

The idea of sending single step pulses from microprocessorA to microprocessorB to a stepper driver seems pointless to me.

And I can’t imagine the circumstances in which it would be necessary to change your mind after 2 steps.

…R

If you have a simple PC program that can send 1 megabytes of data (at 2M-8-even-1) from PC to serial, I can gin-up a quick test and can detect/count parity/overrun/frame errors on the Arduino received side using the error flags on the serial hardware

StefanL38:
Again:

is this just a challenge to explore he mimits of an arduino mega ?

or

do you want to realise a real-world-application?

if the second you will have to define the maximum requierements you need.

best regards Stefan

Hi Stefan, thanks for the overview about the real steppers math. I know nothing about steppers.
Real application(s): I need the software both to move an arm and to power my custom 3d printer. Basically I'm trying to make a general purpose firmware so that the host does all the math and then send 'the story' to the arduino for timely writing on pins. Math on linux, realtime on arduino.
As I wrote in the first message, I could solve the issue using faster boards but I already have those 4 Megas (bought long time ago and never used for real stuff) and I'd like to put them to good use. It's good stuff, and it's a pity to trash it after being lifeless in a drawer for long time. Moreover, I try to avoid proliferating of different platforms in my premises, to make my life easier; a few years ago I bought some avr, esp32 and rpi... and ruled myself to stick there as long as possible.
Currently I'm going to migrate the AVRs to some 32bit modern alternatives (STM32, LPC, SAMD, ...); but I'd like to finish my AVRs before re-stocking.
Back to real issues about steppers mechanics: I know there are inertia issues and the need to account for acceleration, but I was hoping to be able to solve those issues at a later stage, by sending the right sequences from the host. Is it a bad idea?