I'm new to Arduino (and any microcontrollers for that matter) and maybe the following will sound arrogant, but I think that the core libraries are far from being optimized for 8-bit RISC processor.
Here are a few suggestions for the class Serial, as defined in wiring_serial.c:
use unsigned char instead of int for rx_buffer_head and rx_buffer_tail.
use & RX_BUFFER_MASK instead of % RX_BUFFER_SIZE in the whole file, where RX_BUFFER_MASK is defined as
#define RX_BUFFER_MASK (RX_BUFFER_SIZE - 1)
define a function Serial.beginPrescaler(int prescaler) that does the same thing as Serial.begin, but with the prescaler for USART as a parameter
My reasons are simple:
Arduino works with an 8-bit RISC, one should use 8-bit variables unless more bits are needed. Since the hardcoded buffer size is 128 bytes anyway, unsigned char is enough to store the head and tail. Immediate gain is half the size of the code and twice the speed.
The compiler compiles % RX_BUFFER_SIZE literally as a module operation (essentially division) for some reason, instead of the obvious optimized & (RX_BUFFER_SIZE - 1) (probably because int is a signed number). 16-bit division on a 8-bit RISC without a division operation is a disaster. The code is hundreds of bytes longer and takes hundreds of cycles. Because this code is in an interrupt routine that gets executed every time a byte is received it is critical to make it optimal. Note: it seems that the compiler does the optimization automatically when you use unsigned char instead of int.
This would be a convenience for advanced users. One can avoid the 32-bit division (another disaster) in the Serial.begin routine and it would be more apparent that the baud rates like 57600 and 115200 are not optimal for 16MHz clock.
I did some testing:
An empty sketch (empty setup() and loop()) with the standard library takes 976 bytes. When you do the first 2 modifications, it's only 852 bytes. That's a difference 124 bytes for everybody, no matter if they're using Serial. (The reason for that is the interrupt routine that is always linked). Also the other functions in Serial are somewhat shorter.
An empty sketch with one call Serial.begin(57600L) takes 1244 bytes. It's only 1052 with the 3rd modification. That's another cca 200 bytes saved.
But the main advance is the performance gain. The interrupt routine is heavily utilized when receiving data and with the current library it takes around 250 cycles only to read one byte. That's 15 microsecs. When reading 10 KB/s, 15% of the processor power is spend on that. With 1Mbaud, the routine is not fast enough to read the incoming bytes and they are dropped. Also the standard practice is to have if (Serial.available() >0) in the loop(), which takes as much as the interrupt. The modified routine happily works with 1Mbaud speeds.