The lion's share of the total latency (with FTDI or Uno) is due to the USB-serial chip waiting for more bytes to arrive before it finally times out and sends a less-than-full packet.
It appears the FTDI chip and Uno's 8u2 work similarly, but the default is 16 ms vs 4 ms. It also looks like both simply send a packet every time interval instead of implementing an actual timeout. But I didn't throughly explore that on either. On Uno, where the source code is available, it's pretty clear that's how it works.
Teensy also implements a timeout (3 ms), which is a true timeout (is reset back to zero if you write more data rather than always flushing data on a fixed interval), but since the USB is on chip, you can manually flush the buffer the instant you know all of your response has been written. Sending the partial packet on command is the best you can do to achieve minimal latency (well, writing directly into the USB packate buffer as fast as your code can run, instead of 115200 baud, also helps...)
In theory, you could send the packet on command on Uno's 8u2 by connecting an extra wire between the '328 and 8u2 chip. You'd modify the 8u2 code to sense a change on that pin, and of course immediately send any partial packet, just like Serial.send_now() does on Teensy.
Or perhaps instead of connecting an extra wire, the 8u2 code could be modified to detect a specific "end of message" pattern in the data, or some other protocol-specific message framing, and send the packet as soon as possible.
I believe that is what was meant by the announcement on Uno, that because the 8u2 chip is programmable and the code is open source (it was published several days after release), you can modify Uno to work in ways that are simply impossible with FTDI (but already implemented on Teensy and easy to use from your sketch).
The tests with Teensy show what is probably the "best case" you could hope to achieve by completely eliminating this timeout. Maybe I'll investigate why it's still 1 ms (Teensy is definitely a lot faster than 1 ms), though I'm pretty sure it's a Mac USB host controller driver issue, where a new transfer can't happen until the next USB frame, even if the last one took far less than 1 ms.
I know this is a shameless plug... but if you need low latency, why not just try using Teensy? Not only is the latency 4 times less, but the cost is about half as much. You'd probably need to tweak pin numbers or other minor details, but I already ran your code in ping mode! (ok, end shameless plug)
Then again, it might be interesting to play with the 8u2 code.......
Or maybe someone knows a way to change the FTDI timeout on Mac OS-X ??