Go Down

Topic: wiring_serial.c - some suggestions for the Serial (Read 8467 times) previous topic - next topic


Hi guys,

I'm new to Arduino (and any microcontrollers for that matter) and maybe the following will sound arrogant, but I think that the core libraries are far from being optimized for 8-bit RISC processor.  

Here are a few suggestions for the class Serial, as defined in wiring_serial.c:

  • use unsigned char instead of int for rx_buffer_head and rx_buffer_tail.

  • use & RX_BUFFER_MASK instead of % RX_BUFFER_SIZE in the whole file, where RX_BUFFER_MASK is defined as
    Code: [Select]

  • define a function Serial.beginPrescaler(int prescaler) that does the same thing as Serial.begin, but with the prescaler for USART as a parameter

My reasons are simple:

  • Arduino works with an 8-bit RISC, one should use 8-bit variables unless more bits are needed. Since the hardcoded buffer size is 128 bytes anyway, unsigned char is enough to store the head and tail. Immediate gain is half the size of the code and twice the speed.

  • The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number). 16-bit division on a 8-bit RISC without a division operation is a disaster. The code is hundreds of bytes longer and takes hundreds of cycles. Because this code is in an interrupt routine that gets executed every time a byte is received it is critical to make it optimal. Note: it seems that the compiler does the optimization automatically when you use unsigned char instead of int.

  • This would be a convenience for advanced users. One can avoid the 32-bit division (another disaster) in the Serial.begin routine and it would be more apparent that the baud rates like 57600 and 115200 are not optimal for 16MHz clock.

I did some testing:

  • An empty sketch (empty setup() and loop()) with the standard library takes 976 bytes. When you do the first 2 modifications, it's only 852 bytes. That's a difference 124 bytes for everybody, no matter if they're using Serial. (The reason for that is the interrupt routine that is always linked). Also the other functions in Serial are somewhat shorter.

  • An empty sketch with one call Serial.begin(57600L) takes 1244 bytes. It's only 1052 with the 3rd modification. That's another cca 200 bytes saved.

  • But the main advance is the performance gain. The interrupt routine is heavily utilized when receiving data and with the current library it takes around 250 cycles only to read one byte. That's 15 microsecs. When reading 10 KB/s, 15% of the processor power is spend on that. With 1Mbaud, the routine is not fast enough to read the incoming bytes and they are dropped. Also the standard practice is to have if (Serial.available() >0) in the loop(), which takes as much as the interrupt. The modified routine happily works with 1Mbaud speeds.



use & RX_BUFFER_MASK instead of % RX_BUFFER_SIZE in the whole file
The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number).

The current scheme allows a user to redefine the buffer size if they're low on ram, without having to understand or worry about having to pick a power of two.  I'm not sure it would be a good idea to lose that capability.

I'm very unimpressed that the compiler did not manage to optimize this, even with rx_buffer_head and rx_buffer_tail redfined to UNSIGNED char as per your first suggestion :-(

I did manage to get it to generate andi instructions with some pretty heavy handed casting (that I don't like.)  I want to do some more experiments with avoiding expressions that generate intermediate results that the compiler is unable to optimized as unsigned bytes.
Code: [Select]
      unsigned char i = ((unsigned char)(rx_buffer_head + 1)) % RX_BUFFER_SIZE;
6f6:   e0 91 08 01     lds     r30, 0x0108
6fa:   ef 5f           subi    r30, 0xFF       ; 255
6fc:   9e 2f           mov     r25, r30
6fe:   9f 77           andi    r25, 0x7F       ; 127
700:   e1 50           subi    r30, 0x01       ; 1

Hmm.  How about:
Code: [Select]
6de:   1f 92           push    r1
6e0:   0f 92           push    r0
6e2:   0f b6           in      r0, 0x3f        ; 63
6e4:   0f 92           push    r0
6e6:   11 24           eor     r1, r1
#if defined(__AVR_ATmega168__)
       unsigned char c = UDR0;
6f2:   20 91 c6 00     lds     r18, 0x00C6
       unsigned char c = UDR;

       register unsigned char newhead;
       newhead = rx_buffer_head + 1;
6f6:   e0 91 08 01     lds     r30, 0x0108
6fa:   ef 5f           subi    r30, 0xFF       ; 255
       newhead %= RX_BUFFER_SIZE;
6fc:   9e 2f           mov     r25, r30
6fe:   9f 77           andi    r25, 0x7F       ; 127
700:   e1 50           subi    r30, 0x01       ; 1

       // if we should be storing the received character into the location
       // just before the tail (meaning that the head would advance to the
       // current location of the tail), we're about to overflow the buffer
       // and so we don't write the character or advance the head.
       if (newhead != rx_buffer_tail) {
702:   80 91 07 01     lds     r24, 0x0107
706:   98 17           cp      r25, r24
708:   31 f0           breq    .+12            ; 0x716 <__vector_18+0x38>
               rx_buffer[rx_buffer_head] = c;
70a:   ff 27           eor     r31, r31
70c:   e3 5f           subi    r30, 0xF3       ; 243
70e:   fe 4f           sbci    r31, 0xFE       ; 254
710:   20 83           st      Z, r18
               rx_buffer_head = newhead;
712:   90 93 08 01     sts     0x0108, r25
720:   0f 90           pop     r0
722:   0f be           out     0x3f, r0        ; 63
724:   0f 90           pop     r0
726:   1f 90           pop     r1
728:   18 95           reti

Wish I understood WTF it's doing saving/restoring R0/R1; looks pretty pointless (makes one of them a known zero, that isn't used.  Uses the other to save SREG, but it could have used one of the other registers it had to save anyway. (is this part of the calling sequence definition, that subroutines (except for ISR) can trash r0/r1?  That'd make a little more sense, but they still oughta be optimized away!))


Wow, I'm impressed. However I'm a hardware guy so I hope the software wizards around here look and consider your suggestion.

Thanks for the contribution.



Huh.  That's neat.  The compiler/linker included with Arduino0013 eliminates C functions that aren't called, even if they're part of the same source code as C functions that ARE used, and apparently even with some level of indirection.

In the current case, my little serial test case didn't use serialRead, and  neither HardwareSerial::read or serialRead show up in the object file.  This held true for other functions too.  For example, my sketch got 200 bytes shorter without the call to Serial.begin (which I THINK will leave the serial port at the 19200bps from the bootloader?)

Does anyone have a pointer to the specifics of how this dead function elimination works?


PS: Even with code that's fast enough, running async serial comm at 1Mbps does not strike me as a good idea...


I guess it wouldn't be a big deal to note next to
#define RX_BUFFER_SIZE 128 something like
// use only 16, 32, 64, 128 or 256 !!!
But I agree that % RX_BUFFER_SIZE is fool-proof.

To me, it seems to me that the compiler optimizes % to & when I use unsigned char instead of int. But writing it explicitly would force it.

Pushing and popping R0 and R1 from the stack is weird, especially when it doesn't do the same with the registers that it's actually using in the interrupt.

Why is 1Mbaud a bad idea? It works great for me. Obviously I'm not sending data to Arduino, but from Arduino to PC. I need a real time info about what it is doing, and that produces 50kB/s of data. That's easy to handle by my PC =) But I needed the Arduino's side to work to be able to decode short commands sent from PC. The little hack from my first post just did the trick. I think that everybody would benefit from a smaller faster core library.


Mar 01, 2009, 08:39 pm Last Edit: Mar 01, 2009, 08:40 pm by halley Reason: 1
To me, it seems to me that the compiler optimizes % to & when I use unsigned char instead of int. But writing it explicitly would force it.

It's rarely a good idea to play games like this:  to pretend to know what the compiler will produce in machine-code for a given high-level code, and to then try to write high-level code that will produce a specific machine-code result.  Sometimes, the compiler will be better than you.

Write what you intend from a functional point of view, and if you don't like the compiler's optimizations, then write in machine-code or improve the compiler.  Source code is for the humans.


Mar 01, 2009, 09:51 pm Last Edit: Mar 01, 2009, 10:13 pm by mekon83 Reason: 1
@halley: I cannot agree more. But I'm just proposing the simplest fix I can think of to make the core library routine around 5 times faster. For me, % gets compiled as & when I change int to unsigned char. So I would just keep it. But westfw had a problem with that and had to do some typecasting. If someone prefers to write it in assembly instead, sure, go ahead, but that'll be less accessible for even more people.

EDIT: So as westfw proposed, the code that appears to do exactly what is expected is:
Code: [Select]

134:            unsigned char i = ((unsigned char)(rx_buffer_head + 1)) % RX_BUFFER_SIZE;
+00000166:   91E00110    LDS       R30,0x0110     Load direct from data space
+00000168:   5FEF        SUBI      R30,0xFF       Subtract immediate
+00000169:   2F9E        MOV       R25,R30        Copy register
+0000016A:   779F        ANDI      R25,0x7F       Logical AND with immediate
+0000016B:   50E1        SUBI      R30,0x01       Subtract immediate


I like my second patch better (the one that avoids having the c compiler have to deal with intermediate results.  Casts are bad, especially for this sort of purpose, as Halley says.  Whereas C's "promote to integer" "issue" is reasonably well known.)
Code: [Select]
 unsigned char i = rx_buffer_head +1;

It's interesting that Mekon didn't need anything more than the change to unsigned char; perhaps that's a 11 vs 13 difference?  My initial experiments were with a v11 library.


Although, if we're going to change the serial library, I want an optional tx buffer too.  It's obnoxious that Serial.print() is as slow as it is (and this isn't the code being slow, it's the wait-for-the-UART algorithm.)


Mar 02, 2009, 02:26 am Last Edit: Mar 02, 2009, 02:29 am by mekon83 Reason: 1
@westfw: Sorry, I wasn't careful enough before. Even for me, % gets optimized to & even without typecasting, but the compiler then works with 16-bit words instead of 8-bits. Your suggestion works the best.

For the TX buffer, the following code works for me (for ATmega168). You can simply replace the serialWrite function in wiring_serial.c:
Code: [Select]

#define TX_BUFFER_SIZE 32

unsigned char tx_buffer[TX_BUFFER_SIZE];

unsigned char tx_buffer_head = 0;
volatile unsigned char tx_buffer_tail = 0;


 // temporary tx_buffer_tail
 // (to optimize for volatile, there are no interrupts inside an interrupt routine)
 unsigned char tail = tx_buffer_tail;

 // get a byte from the buffer      
 unsigned char c = tx_buffer[tail];
 // send the byte
 UDR0 = c;

 // update tail position
 tail ++;
 tail %= TX_BUFFER_SIZE;

 // if the buffer is empty,  disable the interrupt
 if (tail == tx_buffer_head) {
   UCSR0B &=  ~(1 << UDRIE0);

 tx_buffer_tail = tail;


void myserialWrite(unsigned char c) {

 if ((!(UCSR0A & (1 << UDRE0))) || (tx_buffer_head != tx_buffer_tail)) {
   // maybe checking if buffer is empty is not necessary,
   // not sure if there can be a state when the data register empty flag is set
   // and read here without the interrupt being executed
   // well, it shouldn't happen, right?

   // data register is not empty, use the buffer
   unsigned char i = tx_buffer_head + 1;

   // wait until there's a space in the buffer
   while (i == tx_buffer_tail) ;

   tx_buffer[tx_buffer_head] = c;
   tx_buffer_head = i;

   // enable the Data Register Empty Interrupt
   UCSR0B |=  (1 << UDRIE0);

 else {
   // no need to wait
   UDR0 = c;

It is useful for lower baud rates (lower than 1Mbaud ;), because the interrupt itself takes 4us to execute, while with 1Mbaud it takes only 10us to send or receive a byte. With 1Mbaud, there might be an interference of the RX and TX interrupts causing the receiver to miss incoming bytes.


Thanks for the code; that'll be helpful.
I had in mind modifying wiring_serial.c so that it was "smart" and didn't take up any more space if you had "#define TX_BUFFER_SIZE 0", although I have mixed feelings about conditional compilation.  (The compiler will nicely optimize away C code that can never execute:
Code: [Select]
if (TX_BUFFER_SIZE > 0) {
 // buffered output code that is not included if
 //    TX_BUFFER_SIZE is a constant zero
} else {
 //  Unbuffered output code
 But I don't think that works for the ISR (Hmm.  Need to check, especially with the newer compiler/linker.  Bet it won't eliminate even empty, never-called, ISR routines...)


For that, the preprocessor can be used. It is used heavily in wiring_serial.c anyway. Something like:

Code: [Select]


#if defined(TX_BUFFERED)

// buffered code here

#define TX_BUFFER_SIZE 32

unsigned char tx_buffer[TX_BUFFER_SIZE];

unsigned char tx_buffer_head = 0;
volatile unsigned char tx_buffer_tail = 0;

// buffered code, with all variables etc.


// the unbuffered code
// the original serialWrite()


and the user can turn on or off the buffering just by commenting out the line #define TX_BUFFERED.

Or even better
Code: [Select]

#define TX_BUFFER_SIZE 32

#if (TX_BUFFER_SIZE > 0)

// buffered code here


// the unbuffered code
// the original serialWrite()


The advantage of the preprocessor is that also the unused interrupt routine doesn't get compiled if the buffer is not used.


Ok, here's the final optimized wiring_serial.c with output buffer that can be completely disabled by setting #define TX_BUFFER_SIZE 0:


It allows for 1Mbaud transfers and saves a couple hundred bytes of space.

I tested it with ATmega168. I *should* work with ATmega8 too, it compiles, but I can't test it.



I would know how to set 1Mbaud serial comunication in the arduino with this library... what should I have to do?


Go Up

Please enter a valid email to subscribe

Confirm your email address

We need to confirm your email address.
To complete the subscription, please click the link in the email we just sent you.

Thank you for subscribing!

via Egeo 16
Torino, 10131