Pages: [1] 2   Go Down
Author Topic: wiring_serial.c - some suggestions for the Serial  (Read 5028 times)
0 Members and 1 Guest are viewing this topic.
0
Offline Offline
Newbie
*
Karma: 0
Posts: 37
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hi guys,

I'm new to Arduino (and any microcontrollers for that matter) and maybe the following will sound arrogant, but I think that the core libraries are far from being optimized for 8-bit RISC processor.  

Here are a few suggestions for the class Serial, as defined in wiring_serial.c:

  • use unsigned char instead of int for rx_buffer_head and rx_buffer_tail.

  • use & RX_BUFFER_MASK instead of % RX_BUFFER_SIZE in the whole file, where RX_BUFFER_MASK is defined as
    Code:
    #define RX_BUFFER_MASK (RX_BUFFER_SIZE - 1)

  • define a function Serial.beginPrescaler(int prescaler) that does the same thing as Serial.begin, but with the prescaler for USART as a parameter
My reasons are simple:

  • Arduino works with an 8-bit RISC, one should use 8-bit variables unless more bits are needed. Since the hardcoded buffer size is 128 bytes anyway, unsigned char is enough to store the head and tail. Immediate gain is half the size of the code and twice the speed.
  • The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number). 16-bit division on a 8-bit RISC without a division operation is a disaster. The code is hundreds of bytes longer and takes hundreds of cycles. Because this code is in an interrupt routine that gets executed every time a byte is received it is critical to make it optimal. Note: it seems that the compiler does the optimization automatically when you use unsigned char instead of int.
  • This would be a convenience for advanced users. One can avoid the 32-bit division (another disaster) in the Serial.begin routine and it would be more apparent that the baud rates like 57600 and 115200 are not optimal for 16MHz clock.

I did some testing:

An empty sketch (empty setup() and loop()) with the standard library takes 976 bytes. When you do the first 2 modifications, it's only 852 bytes. That's a difference 124 bytes for everybody, no matter if they're using Serial. (The reason for that is the interrupt routine that is always linked). Also the other functions in Serial are somewhat shorter.

An empty sketch with one call Serial.begin(57600L) takes 1244 bytes. It's only 1052 with the 3rd modification. That's another cca 200 bytes saved.

But the main advance is the performance gain. The interrupt routine is heavily utilized when receiving data and with the current library it takes around 250 cycles only to read one byte. That's 15 microsecs. When reading 10 KB/s, 15% of the processor power is spend on that. With 1Mbaud, the routine is not fast enough to read the incoming bytes and they are dropped. Also the standard practice is to have if (Serial.available() >0) in the loop(), which takes as much as the interrupt. The modified routine happily works with 1Mbaud speeds.
[/list]

Best,
Norbert
Logged

SF Bay Area (USA)
Online Online
Tesla Member
***
Karma: 135
Posts: 6787
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
use & RX_BUFFER_MASK instead of % RX_BUFFER_SIZE in the whole file
   :
The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number).
The current scheme allows a user to redefine the buffer size if they're low on ram, without having to understand or worry about having to pick a power of two.  I'm not sure it would be a good idea to lose that capability.

I'm very unimpressed that the compiler did not manage to optimize this, even with rx_buffer_head and rx_buffer_tail redfined to UNSIGNED char as per your first suggestion :-(

I did manage to get it to generate andi instructions with some pretty heavy handed casting (that I don't like.)  I want to do some more experiments with avoiding expressions that generate intermediate results that the compiler is unable to optimized as unsigned bytes.
Code:
      unsigned char i = ((unsigned char)(rx_buffer_head + 1)) % RX_BUFFER_SIZE;
 6f6:   e0 91 08 01     lds     r30, 0x0108
 6fa:   ef 5f           subi    r30, 0xFF       ; 255
 6fc:   9e 2f           mov     r25, r30
 6fe:   9f 77           andi    r25, 0x7F       ; 127
 700:   e1 50           subi    r30, 0x01       ; 1

Hmm.  How about:
Code:
SIGNAL(SIG_UART_RECV)
#endif
{
 6de:   1f 92           push    r1
 6e0:   0f 92           push    r0
 6e2:   0f b6           in      r0, 0x3f        ; 63
 6e4:   0f 92           push    r0
 6e6:   11 24           eor     r1, r1
     :
#if defined(__AVR_ATmega168__)
        unsigned char c = UDR0;
 6f2:   20 91 c6 00     lds     r18, 0x00C6
#else
        unsigned char c = UDR;
#endif

        register unsigned char newhead;
        newhead = rx_buffer_head + 1;
 6f6:   e0 91 08 01     lds     r30, 0x0108
 6fa:   ef 5f           subi    r30, 0xFF       ; 255
        newhead %= RX_BUFFER_SIZE;
 6fc:   9e 2f           mov     r25, r30
 6fe:   9f 77           andi    r25, 0x7F       ; 127
 700:   e1 50           subi    r30, 0x01       ; 1

        // if we should be storing the received character into the location
        // just before the tail (meaning that the head would advance to the
        // current location of the tail), we're about to overflow the buffer
        // and so we don't write the character or advance the head.
        if (newhead != rx_buffer_tail) {
 702:   80 91 07 01     lds     r24, 0x0107
 706:   98 17           cp      r25, r24
 708:   31 f0           breq    .+12            ; 0x716 <__vector_18+0x38>
                rx_buffer[rx_buffer_head] = c;
 70a:   ff 27           eor     r31, r31
 70c:   e3 5f           subi    r30, 0xF3       ; 243
 70e:   fe 4f           sbci    r31, 0xFE       ; 254
 710:   20 83           st      Z, r18
                rx_buffer_head = newhead;
 712:   90 93 08 01     sts     0x0108, r25
    :
 720:   0f 90           pop     r0
 722:   0f be           out     0x3f, r0        ; 63
 724:   0f 90           pop     r0
 726:   1f 90           pop     r1
 728:   18 95           reti
Wish I understood WTF it's doing saving/restoring R0/R1; looks pretty pointless (makes one of them a known zero, that isn't used.  Uses the other to save SREG, but it could have used one of the other registers it had to save anyway. (is this part of the calling sequence definition, that subroutines (except for ISR) can trash r0/r1?  That'd make a little more sense, but they still oughta be optimized away!))
Logged

Left Coast, CA (USA)
Offline Offline
Brattain Member
*****
Karma: 361
Posts: 17301
Measurement changes behavior
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Wow, I'm impressed. However I'm a hardware guy so I hope the software wizards around here look and consider your suggestion.

Thanks for the contribution.

Lefty
Logged

SF Bay Area (USA)
Online Online
Tesla Member
***
Karma: 135
Posts: 6787
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Huh.  That's neat.  The compiler/linker included with Arduino0013 eliminates C functions that aren't called, even if they're part of the same source code as C functions that ARE used, and apparently even with some level of indirection.

In the current case, my little serial test case didn't use serialRead, and  neither HardwareSerial::read or serialRead show up in the object file.  This held true for other functions too.  For example, my sketch got 200 bytes shorter without the call to Serial.begin (which I THINK will leave the serial port at the 19200bps from the bootloader?)

Does anyone have a pointer to the specifics of how this dead function elimination works?
Logged

SF Bay Area (USA)
Online Online
Tesla Member
***
Karma: 135
Posts: 6787
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

PS: Even with code that's fast enough, running async serial comm at 1Mbps does not strike me as a good idea...
Logged

0
Offline Offline
Newbie
*
Karma: 0
Posts: 37
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I guess it wouldn't be a big deal to note next to
#define RX_BUFFER_SIZE 128 something like
// use only 16, 32, 64, 128 or 256 !!!
But I agree that % RX_BUFFER_SIZE is fool-proof.

To me, it seems to me that the compiler optimizes % to & when I use unsigned char instead of int. But writing it explicitly would force it.

Pushing and popping R0 and R1 from the stack is weird, especially when it doesn't do the same with the registers that it's actually using in the interrupt.

Why is 1Mbaud a bad idea? It works great for me. Obviously I'm not sending data to Arduino, but from Arduino to PC. I need a real time info about what it is doing, and that produces 50kB/s of data. That's easy to handle by my PC =) But I needed the Arduino's side to work to be able to decode short commands sent from PC. The little hack from my first post just did the trick. I think that everybody would benefit from a smaller faster core library.
Logged

Connecticut, US
Offline Offline
Edison Member
*
Karma: 2
Posts: 1036
Whatduino
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Quote
To me, it seems to me that the compiler optimizes % to & when I use unsigned char instead of int. But writing it explicitly would force it.

It's rarely a good idea to play games like this:  to pretend to know what the compiler will produce in machine-code for a given high-level code, and to then try to write high-level code that will produce a specific machine-code result.  Sometimes, the compiler will be better than you.

Write what you intend from a functional point of view, and if you don't like the compiler's optimizations, then write in machine-code or improve the compiler.  Source code is for the humans.
« Last Edit: March 01, 2009, 02:40:12 pm by halley » Logged

0
Offline Offline
Newbie
*
Karma: 0
Posts: 37
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

@halley: I cannot agree more. But I'm just proposing the simplest fix I can think of to make the core library routine around 5 times faster. For me, % gets compiled as & when I change int to unsigned char. So I would just keep it. But westfw had a problem with that and had to do some typecasting. If someone prefers to write it in assembly instead, sure, go ahead, but that'll be less accessible for even more people.

EDIT: So as westfw proposed, the code that appears to do exactly what is expected is:
Code:
134:            unsigned char i = ((unsigned char)(rx_buffer_head + 1)) % RX_BUFFER_SIZE;
+00000166:   91E00110    LDS       R30,0x0110     Load direct from data space
+00000168:   5FEF        SUBI      R30,0xFF       Subtract immediate
+00000169:   2F9E        MOV       R25,R30        Copy register
+0000016A:   779F        ANDI      R25,0x7F       Logical AND with immediate
+0000016B:   50E1        SUBI      R30,0x01       Subtract immediate
« Last Edit: March 01, 2009, 04:13:59 pm by mekon83 » Logged

SF Bay Area (USA)
Online Online
Tesla Member
***
Karma: 135
Posts: 6787
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I like my second patch better (the one that avoids having the c compiler have to deal with intermediate results.  Casts are bad, especially for this sort of purpose, as Halley says.  Whereas C's "promote to integer" "issue" is reasonably well known.)
Code:
 unsigned char i = rx_buffer_head +1;
  i %= RX_BUFFER_SIZE;
It's interesting that Mekon didn't need anything more than the change to unsigned char; perhaps that's a 11 vs 13 difference?  My initial experiments were with a v11 library.
Logged

SF Bay Area (USA)
Online Online
Tesla Member
***
Karma: 135
Posts: 6787
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Although, if we're going to change the serial library, I want an optional tx buffer too.  It's obnoxious that Serial.print() is as slow as it is (and this isn't the code being slow, it's the wait-for-the-UART algorithm.)
Logged

0
Offline Offline
Newbie
*
Karma: 0
Posts: 37
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

@westfw: Sorry, I wasn't careful enough before. Even for me, % gets optimized to & even without typecasting, but the compiler then works with 16-bit words instead of 8-bits. Your suggestion works the best.

For the TX buffer, the following code works for me (for ATmega168). You can simply replace the serialWrite function in wiring_serial.c:
Code:
#define TX_BUFFER_SIZE 32

unsigned char tx_buffer[TX_BUFFER_SIZE];

unsigned char tx_buffer_head = 0;
volatile unsigned char tx_buffer_tail = 0;


SIGNAL(USART_UDRE_vect) {

  // temporary tx_buffer_tail
  // (to optimize for volatile, there are no interrupts inside an interrupt routine)
  unsigned char tail = tx_buffer_tail;

  // get a byte from the buffer      
  unsigned char c = tx_buffer[tail];
  // send the byte
  UDR0 = c;

  // update tail position
  tail ++;
  tail %= TX_BUFFER_SIZE;

  // if the buffer is empty,  disable the interrupt
  if (tail == tx_buffer_head) {
    UCSR0B &=  ~(1 << UDRIE0);
  }

  tx_buffer_tail = tail;

}


void myserialWrite(unsigned char c) {

  if ((!(UCSR0A & (1 << UDRE0))) || (tx_buffer_head != tx_buffer_tail)) {
    // maybe checking if buffer is empty is not necessary,
    // not sure if there can be a state when the data register empty flag is set
    // and read here without the interrupt being executed
    // well, it shouldn't happen, right?

    // data register is not empty, use the buffer
    unsigned char i = tx_buffer_head + 1;
    i %= TX_BUFFER_SIZE;

    // wait until there's a space in the buffer
    while (i == tx_buffer_tail) ;

    tx_buffer[tx_buffer_head] = c;
    tx_buffer_head = i;

    // enable the Data Register Empty Interrupt
    UCSR0B |=  (1 << UDRIE0);

  }
  else {
    // no need to wait
    UDR0 = c;
  }
}

It is useful for lower baud rates (lower than 1Mbaud smiley-wink, because the interrupt itself takes 4us to execute, while with 1Mbaud it takes only 10us to send or receive a byte. With 1Mbaud, there might be an interference of the RX and TX interrupts causing the receiver to miss incoming bytes.
« Last Edit: March 01, 2009, 08:29:57 pm by mekon83 » Logged

SF Bay Area (USA)
Online Online
Tesla Member
***
Karma: 135
Posts: 6787
Strongly opinionated, but not official!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Thanks for the code; that'll be helpful.
I had in mind modifying wiring_serial.c so that it was "smart" and didn't take up any more space if you had "#define TX_BUFFER_SIZE 0", although I have mixed feelings about conditional compilation.  (The compiler will nicely optimize away C code that can never execute:
Code:
if (TX_BUFFER_SIZE > 0) {
  // buffered output code that is not included if
  //    TX_BUFFER_SIZE is a constant zero
} else {
  //  Unbuffered output code
}
 But I don't think that works for the ISR (Hmm.  Need to check, especially with the newer compiler/linker.  Bet it won't eliminate even empty, never-called, ISR routines...)

Logged

0
Offline Offline
Newbie
*
Karma: 0
Posts: 37
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

For that, the preprocessor can be used. It is used heavily in wiring_serial.c anyway. Something like:

Code:
#define TX_BUFFERED

#if defined(TX_BUFFERED)

// buffered code here

#define TX_BUFFER_SIZE 32

unsigned char tx_buffer[TX_BUFFER_SIZE];

unsigned char tx_buffer_head = 0;
volatile unsigned char tx_buffer_tail = 0;


SIGNAL(USART_UDRE_vect) {
//...
// buffered code, with all variables etc.

#else

// the unbuffered code
// the original serialWrite()

#endif
and the user can turn on or off the buffering just by commenting out the line #define TX_BUFFERED.

Or even better
Code:

#define TX_BUFFER_SIZE 32

#if (TX_BUFFER_SIZE > 0)

// buffered code here

#else

// the unbuffered code
// the original serialWrite()

#endif

The advantage of the preprocessor is that also the unused interrupt routine doesn't get compiled if the buffer is not used.
Logged

0
Offline Offline
Newbie
*
Karma: 0
Posts: 37
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Ok, here's the final optimized wiring_serial.c with output buffer that can be completely disabled by setting #define TX_BUFFER_SIZE 0:

http://www.math.ucla.edu/~npozar/arduino/wiring_serial.c

It allows for 1Mbaud transfers and saves a couple hundred bytes of space.

I tested it with ATmega168. I *should* work with ATmega8 too, it compiles, but I can't test it.
Logged

0
Offline Offline
Newbie
*
Karma: 0
Posts: 8
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hi!!!

I would know how to set 1Mbaud serial comunication in the arduino with this library... what should I have to do?

thanks
Logged

Pages: [1] 2   Go Up
Jump to: