dependence of DUE Native USB port transfer speed on data length

I'm working on writing a reusable Arduino <-> Python communications library for scientific instruments, and as part of that I would like to transfer data at high speed to Python running on a host PC. I tested the transfer rate from both the programming and native USB ports and don't fully understand the results. I tried transferring strings with 1 to 10000 characters (call the number of characters N). Each string was written with one call to Serial.print() or SerialUSB.print(), and read by a Python script on the host computer. In each case, the port was opened with baudrate = 115200 (which doesn't seem to be relevant on the native USB port). I included my full code below.

with Serial (programming port), the length of the string didn't matter:
N = 1 --> ~ 11700 bytes/s
N = 10000 --> ~ 11700 bytes/s
Assuming 9 bits (1 stop bit) per byte transferred, this gives 105300 bits/second (not far from 115200), which makes sense.

with SerialUSB (native port), there was a huge dependence on N:
N = 1 --> ~ 4800 bytes/s (0.2 ms per print statement)
N = 10 --> ~ 48000 bytes/s (0.2 ms per print statement)
N = 100 --> ~ 480000 bytes/s (0.2 ms per print statement)
N = 1000 --> ~ 4500000 bytes/s (0.2 ms per print statement)
N = 10000 --> ~ 5200000 bytes/s (1.9 ms per print statement)

So, the native port is fast (~5 Megabytes per second) with long arrays. However, it seems that there is some overhead with transfers over the native USB port, since transfers of 1 or 100 characters take about the same time. My question is: where does this come from and what is the best way to deal with it? Is it time associated with starting a transfer (~ 0.2 ms), is the native port using something like a fixed large (~ 1024 byte) packet size per print statement, or am I just misunderstanding? I don't know much about the underlying USB transfer and I have not seriously tried to look through the Arduino library code.

For high speed transfers, does it really make sense to buffer a bunch of smaller transfers to an array and send them all at once, or is there some parameter in the Arduino libraries that could be adjusted? I don't want to waste a > 1000 byte buffer and spend extra time moving around data if it is not necessary.

Here is the Arduino DUE code I used (Arduino 1.5.2). Set BUF_N to the number of chars in the string + 1 for termination.

#define BUF_N 10001

char buf[BUF_N];

void setup() {
  //Serial.begin(115200);
  SerialUSB.begin(115200); // baud rate ignored for usb virtual serial
  memset(buf, 'A', BUF_N-1);
  buf[BUF_N-1] = '\0';
}

void loop() {
  //Serial.print(buf);
  SerialUSB.print(buf);
}

And running in Python (Python 2.7.3 on Windows XP):

import time
import serial

ser = serial.Serial(
   port='COM10',
   baudrate=115200,
   parity=serial.PARITY_NONE,
   stopbits=serial.STOPBITS_ONE,
   bytesize=serial.EIGHTBITS
)

last_time = time.time()
n = 10000000 # this doesn't change the reported rate noticeably
while True:
    ser.read(n);
    new_time = time.time()
    print str(float(n)/(new_time-last_time)) + ' bytes/s'
    last_time = new_time

Earlier discussions of the available baud rates (which are related to my question) include:

and
http://arduino.cc/forum/index.php/topic,131872.0.html

My understanding of the USB overhead problem is this: A USB device can't just send data straight away to a host when it has data to send, like a normal serial port would. Instead, a USB device has to wait until the host asks if it has any data to send. I think in USB 1.1 that only happened every 1ms or more. I guess it must be more frequent in USB 2.0 high speed devices. But that's probably why you can only send data every 0.2ms.

As far as I know SerialUSB doesn't do any buffering of transmitted data, it only buffers received data.

With other code I've seen the way they get around the problem is to have a buffer (say 128 bytes) and start a data transfer either when the buffer is full or when a certain period of time has elapsed (~ 1ms), to compromise between latency and throughput. Again with a high speed device maybe the buffer could be bigger and the time shorter.

The baudrate does not affect the speed of USB transfer if you are just doing PC to MCU transfers, and are not driving an actual UART.

USB transfer rate is affected by two things, polling interval and packet size. Polling interval in USB High speed is a multiple of 125us, and the max packet size is 1024 bytes for CDC class (interrupt transfers).

That seems to be consistent with your results. To maximise throughput, you need to at least buffer up to 1024 bytes to get the most from each transfer. In the worst case, if you send 1 byte in a transfer, you have to wait at least 125us before being polled again, and that latency kills the transfer rate. Protocols that do short request/response transactions should be avoided, you need a protocol with a streaming mode.

To avoid latency further, use larger buffer sizes, and I mean really large, like 32Kb, or for more efficient use of RAM use a double buffer scheme, so you fill one while the other is transferring, then switch over.

Of course, maximising throughput with large buffers is OK for bulk transfer like data logging, but creates a latency problem if you do small transfers with a request/response protocol, but you need to avoid those anyway.