I recently did a project with another non-USB AVR chip sending debug data, and a Teensy receiving it and resending it to a PC. The Teensy was also connected to the reset line and could do ISP reprogramming of that other AVR chip, as well as a few checks of various analog voltages.
I wanted the serial to run as fast as possible, but I didn't want to do a lot of work. I did go to the trouble of reading to a 64 byte buffer, rather than 1-byte-at-a-time, but otherwise I just went with things as they are. Here's the code.
void loop() {
char buf[64];
if (cpu_is_running) {
int n = Uart.available();
if (n > 0) {
if (n > sizeof(buf)) n = sizeof(buf);
Uart.readBytes(buf, n);
Serial.write((uint8_t *)buf, n);
}
}
With the code above, it turned the fastest reliable baud rate was 666667 bits/sec.
#define BAUDRATE 666667
//#define BAUDRATE 1000000 // Teensyduino can't keep up...
//#define BAUDRATE 2000000
//#define BAUDRATE 115200
HardwareSerial Uart = HardwareSerial();
It did work pretty well for short bursts of data at 1 Mbit/sec. But if the other AVR chip sent sustained maximum rate data, ultimately it couldn't keep up.
If your program were structured in the common 1-byte-at-a-time approach, you'd incur more overhead, especially since each write to the PC would have the overhead of manipulating the USB buffers. It very likely would not keep up at 666667 bps with so much more overhead.
Internally, readBytes() calls the read function for 1 byte at a time. I may someday optimize readBytes and the underlying Stream class for block reads. When/if that happens, the code above would very likely become able to reliably handle 1 Mbit/sec.
Of course, if your program is doing something different with the data, the actual reliable maximum data rate will depend heavily on what you're doing, and especially if you manage data in blocks or 1-byte-at-a-time. But hopefully this 1 use case gives you at least some idea of the underlying capability to move at least 667 kbits/sec speed from the hardware serial port to the USB virtual serial port.
ps: another caveat is this does NOT apply to regular Arduino boards. The HardwareSerial code in Teensyduino is heavily optimized. Arduino's HardwareSerial code is much slower.