Tell the truth - did you actually optimize it, or did you just avoid bloating it? :-)
Teensyduino's "Serial" isn't HardwareSerial at all. It's completely different code for USB virtual serial. There is a highly optimized Serial.write(buf, size) function which does block copy directly to USB packet buffers using 2 instructions per byte. It's optimized for speed, not minimal code size.
Teensyduino's Print has many optimizations that try to maximize use of write(buf, size), rather than writing 1 byte at a time. Recently Arduino's Print class has started implementing some of these, but in many places it still writes 1 byte at a time. With HardwareSerial, it doesn't matter, since write(buf, size) is just a loop which repetitively calls the single byte write. But with Teensyduino's Serial, and with Ethernet and the SD card library, using block writes is much faster. These Print optimizations are separate from optimizations in the code which actually implements available/read/write I/O. For streams than use block copy, it makes a huge improvement in performance.
End-to-end speed depends on many software factors, including the software on the PC side, but many people have reported easily achieving 300 kbytes/sec (yes bytes, not bits), and speeds in the 800 kbyte/sec range are possible.
(It's been a bit depressing to watch Serial grow and grow with nearly every release... Despite contributions that would improve things.)
Yes, Arduino's HardwareSerial is horribly inefficient. The use of indirect addressing for all the I/O registers and constants is terribly inefficient on AVR hardware. Somebody obviously felt 1 copy of the code, no matter how complex and inefficient, would be better than a separate copy for each port. From a maintenance perspective, maybe it is, but the trade-off is slow performance and unnecessary compiled code size.
At least 1.0.1 changes the index variables to unsigned, so the interrupt won't use the math library to implement the modulus operator! That's actually a huge improvement in interrupt latency.
Teensyduino also has a HardwareSerial which is heavily optimized, but it only needs to support a single hardware serial port. If there were 2 or more, I'd make copies. It's similar to the pre-0015 version Arduino had, but it has a number of small optimizations which have never appeared in any version of Arduino.
All this code I've published is open source. If anyone really cared, it could be ported back to Arduino, or at least mined for ideas to separately optimize the Arduino version.