Random errors with ATmega328p and the new bootloader at 115200 baud

The Arduino bootloader uses a specific baud rate for communication between the bootloader and the PC. The current baud rate used is 115200 baud for uploading programs. Previous bootloaders used a baud rate of 57600 baud. This baud rate is set as part of the bootloader code. When you reset the Arduino, the bootloader sets the UART to the known upload baud rate, and waits briefly for the PC to begin uploading. If an upload does not begin, the bootloader jumps to the currently loaded application program.


THE PROBLEM

The new bootloader, ready on all the latest Arduino boards, writes the firmware faster. This is great and I really like it but sometimes I got errors during the loading phase.

Sometimes I even got two consecutive errors and other times I ran hundreds of compilations and writes without errors. So in the past, like everyone else, I blamed low-quality boards, USB cables too long, too much speed, defects in the USB hubs, defects in the FTDI and CH340 drivers, etc.

Lately I have been working on the Theremino_Tester, a very demanding sketch:
" Applicazioni varie | theremino "

I had chosen 115200 as the communication speed, to conform to the new Arduino BootLoader practice already in use. But soon I started to notice errors in the serial communication. On some PCs, they never occurred, but on others they were so frequent as to be annoying.

And since the Theremino_Tester is a very useful and interesting project, I decided to seriously investigate the problem.

And I discovered that errors only occur with 115200 bps speed, and counterintuitively, the errors disappear when the speed is increased to 250k, 500k or 1M bps.

The explanation, once found, is simple. On the ATmega 328p, which has a 16 MHz clock, the speed of 115200 has a deviation of 3.5%, while 250k, 500k and 1M are free from any discrepancy (0%). You might think that 3.5% is not much, but you have to take into account that the bits accumulate one after the other and that when you get to the stop bit you are already at a percentage of 35%. A percentage that all the texts regarding serial considered unacceptable.

In the following messages, you will find detailed tables and explanations that demonstrate my thesis.


SOME WEB MESSAGES

...most of the arduino communications is between the arduino and the PC via that serial/usb chip on the board, so even if the bitrates are FAR off, you'll still get reliable communications as long as the serial/usb chip (on uno, ALSO an AVR running at 16MHz) both err in the same direction. (some of the problems that have occurred have been because the 8u2 and the 328 used DIFFERENT algorithms for picking the divisors, so that one ran +3% and the other ran -2.8%, which is not so good. (and while 115200bps is a common "fastest" speed for PCs, it's a particularly bad speed for a 16MHz AVR. Sigh.)
" How does arduino handle usart with 16MHz? - #8 by westfw "

..16MHz (default on most arduino) gives 3.7% error, which usually is too much for the other side to work without error correction.
" 115200 baud problems - #5 by mellis "
" https://wormfood.net/avrbaudcalc.php "

.... well, like dgtl said, there's 3.5% error. If the thing you are connecting to is exactly correct, or slightly incorrect in the same direction, it will probably work. You can get a -2.1% error by using the "double speed mode" (U2X0 in UART_SRA0). Note that this error is in the opposite direction of the normal calculation; I think you can be almost certain of errors if you have one AVR set one way (-2.1%), and the other set the other way (+3.5%) (5.6% total error.)
....
FTDI states in appnote, that the baud rate error should be below 3%. So running AVR at 16MHz connected to FTDI is outside allowed tolerances on both sides (assuming FTDI is spot on; they have fractional baud rate divisor so the error should be much lower but I did not care to calculate the actual numbers). Even if it seems to work at first, you can not be sure that in the whole temperature range and by using randomly selected chips at production it would still work.
....
Better select another baud rate!
Why not 250k or 500k with a 0% error?
....
" UART 115200 on ATmega328p@16Mz - Page 1 "
" https://www.eevblog.com/forum/microcontrollers/uart-115200-on-atmega328p16mz/?action=dlattach;attach=127519;image "


CONCLUSIONS

Choosing a 115200 baud rate for the new 168p bootloader was a really unfortunate choice. To make matters worse, it doesn't create enough errors to make it noticeable. If the errors were more frequent, the problem would be obvious and it would have been solved quickly. Instead, the errors are so rare that many people believe everything is ok.

But it's not "ok", the new 168p bootloader is working at the limits of acceptable and in some extreme cases the errors can become really annoying. I have a perfectly working Nano board that on some tablets causes really frequent errors, every 100-200 characters or sometimes even more often.

The extreme cases happen when the speed deviation of the Arduino board and the terminal (PC) act in opposite directions and we reach intolerable differential speed deviation, even of 5.6% and more.

But with a baud rate of 250k, 500k or 1M, the speed error is zero. I have written a dedicated sketch and a PC control application and left them running for hours. On none of the Arduino modules I have tested, and none of the tablets and PCs we use, a single error has ever occurred!

As a counter test I did the same tests at 115200 and observed frequent errors in almost all combinations of modules and PCs.


What to do now?

Naturally, I could modify the bootloader and reprogram my modules, but this would not solve the annoyance that these errors are causing to many other people.

The best solution is to convince Arduino programmers that there is a problem. We must emphasize how these upload phase errors are causing trouble to a large number of users. Even if they try to blame the issue on Chinese clones, long cables, wrong settings, user inexperience or even San Gennaro, we must stay united to make them face the problem. If many of us report these errors, they might be convinced to address them.

Those who are not affected by the issue might underestimate it, but if you experience occasional errors during the upload phase on a 328p with the new bootloader, let us know your experience and how frustrating it can be!


A SIMPLE ERROR TESTER

void setup()
{
  Serial.begin(115200);

  String s = " ABCDEFGHIJKLMNOPQRSTUVWXYZ";
  for (int i = 0; i < 1000; i++)
  {
    Serial.println(String(i) + s + s);
  }
}

EXAMPLE WITH ERRORS

...
omissis
...
177 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
178 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHI⸮)⸮⸮⸮
%5EUeu⸮⸮⸮⸮
179 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
180 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
181 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
182 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
183 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
184 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
185 ABCDEFGHIJKLMNOPQRSTU⸮⸮⸮  
!%)-159=AEIMQUY]aei5
186 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSU⸮⸮⸮⸮⸮
187 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
...
omissis
...
517 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
518 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
⸮S⸮
!%)-159=AEIMQUY]aei⸮ABCDEFGHIJKLMNOPQRSTUVWXYZ
520 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
...
omissis
...
883 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
884 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
885 ABCDEFGHIJKLMNOPQRSTU⸮⸮⸮	
!%)-159=AEIMQUY]aei5
886 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
887 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVWXYZ
1 Like

UPD i get no such errors, with your code.
maybe you connect something to pins 0/1
UPD2 if you want ideal 250k baud rate by setting Serial.begin(250000); you should get Arduino board with crystal quartz
зображення
instead of microscopic resonator
зображення

OR connect Logic Analyzer and look how deviate the frequency and initiate appropriate Serial.begin(250394); Serial.begin(249726);

2 Likes

Of course, it doesn't happen in all cases, as I wrote in my post.

It only happens if the actual bps rate between the module and the PC deviates too much. In your case you are probably lucky and therefore at 115200 you will not have errors either with my test or even when using the new bootloader.

It happens to me a lot on some tablets and less on the large PC. But usually if I run it several times or sooner or later it gives some error.


You're probably right about quartz, but it's always better to adjust the software to make it work in all cases, rather than pretending to use more precise hardware.

And regarding the Serial.begin(nnnnn) this can only be done in the sketches, while the upload and bootloader are fixed at 115200. And it is precisely during the upload that these errors are annoying.

as you said, if bootloader not receive firmware it start the program and it mean after this moment is absolutely even what baud rate has BL contain or used.

You probably didn't read my post properly. The test I showed is just to demonstrate that 115200 is a problematic speed.

But the real problem lies in the new bootloader which is fixed at 115200.

1 Like

which errors?

BL is open source, you can make your own modified one. or you can erase BL with HEX uploaded thru ISP

no problem, sorry. or you think i have newer as new?

The avrdude errors that randomly happens during the upload phase when you use a 328p with the new bootloader on some PC.

Maybe you could compile and upload 10 or 20 times without errors and then maybe you get two consecutive errors. And this is not produced by long cables, lack of quartz, chinese clones or other similar things. It is produced by the speed error of 3.5% when using ATmega 328p at 16 MHz and the new bootloader at 115200 bps

Yes i could modify the bootloader to use 250k bps and reprogram it in my boards but - as written in my post - "this would not solve the annoyance that these errors are causing to many other people."

Thanks for this input. 115200 Baud is indeed a bad choice!

Unfortunately, @kolaha does not understand the problem.

1 Like

i get no errors. even with your ultimative error tester.

@jremington correct. i can not understand what i not once see. and i say: "thank god for 115200 BL, the old one was so annoying slow"

@theremino You have presented a coherent analysis of the problem you perceive. It may well be true(I see no flaw), but resistance will be high. I haven't perceived the problem, and I'm running a small network of RS485 Nanos, all clones, as well as three Unos and a
Mega, all genuine; if it existed here, I'd have seen it, but I do understand the random nature of the problem you're describing - it takes a pair of devices at opposite ends of the error% to make the problem appear.
Thanks for raising the issue, hopefully more will come to light.

1 Like

But that's what he's saying, @kolaha - you probably don't have devices with the required clock deviation. It's entirely arbitrary. Unless you're 'lucky' enough to have the right devices, you'll never see it. To recreate it, one would have to override the onboard oscillator of an Arduino, to move the clock frequency to one end of the 'acceptable' range or the other.

1 Like

@camsysca If in your network you use all 328p and all at 16 MHz and all at 115200 then the problem cannot arise because they are all at the same speed which is -3.5% compared to what should be. They are synchronous and therefore everything is fine.

But have you ever received avrdude errors when programming modules with the Arduino IDE? It has happened to many and normally you don't notice it, you reprogram it a second time and everything is fine.

For years I thought they were random errors but now that I have analyzed the problem carefully I am sure that it is due to the 3.5% of the 115200 of the new bootloader.

No. But from first starting out with my Nano network, that 3.5% has been stuck in my mind, because I'm well away that it's not optimal. Hasn't bitten me, yet. You're right, it's not likely to, unless I start to vary the nodes beyond the family I work with, and that's not likely unless I catch the ESP bug. Frankly, I'd be likely to move to 250000 anyway, to improve throughput, in which case the problem "goes away".

i think: if error is not recreatable then it is not in BL, because it cannot change. but what can change? voltage, silicon semiconductor material in chip, radiation, electro-magnetic pollution, etc, nothing of that was checked by topic starter

I don't see a problem source in your test code, either - it's likely you'll end up with buffering backlog throwing that much output, even at 115200, because the output buffer is small, but that should just slow down the throughput; the flush() would just synchronize your transmissions with the output speed, really.
Good luck!

Except that, I use a commercial USB-485 dongle, el-cheapo, which is not likely 328P at 16 MHz. So there is that possible issue. Since the PC is talking to all of my nodes, I would see transmission errors in that sense.

I also use a different manufacturer's USB-485 adapter as a passive listener, along with a freebie RS232 analysis package from the web, so I could see the problem there.

But I don't, so this is just noise.

my advice: use STM MCUs, they have USARTs with adapting baud rate