Teensy 3.0

I'm glad Teensy 3 is working out well, especially the fast upload process.

I actually put months of work into optimizing the upload speed. I spent quite a bit of time analyzing the Teensy 2.0 upload process. The 2.0 bootloader is quite simple (it needs to be, since it fits entirely within 512 bytes of code). It receives a chunk of data, then programs it to the flash, then sends an ACK. That works well, but it doesn't achieve the best speed. First, the entire chip needs to be erased when the first write request shows up. Erasing the chip earlier isn't an option, because users expect to be able to go into bootloader mode, but if nothing is transmitted the chip is expected to remain unmodified. So there's a lag while erasing, and then while writing the first chunk. Then when the ACK is sent, the software on the PC sends the next chunk. Because it's a user space program, it's subjected to ordinary scheduling delays, which can average several milliseconds before the next chunk is sent. Over the span of writing many chunks, those USB latencies and operating system scheduling delays add up. More time is spent waiting than writing.

In 3.0, I implemented quite a lot of buffering in RAM. So while the chip is erasing, several chunks of data are buffered into RAM. Once programming begins, there's plenty of data buffered to keep the speed constrained only by the flash writing. The operating system's userspace scheduling delays result in bursts of incoming data to refill the buffers. With enough on-chip buffering, the flash writing never stalls waiting for more data. So the upload speed runs at very close to the maximum possible speed imposed only by the flash itself. I used the chip's fast DMA engine to copy from the buffers to the flash controller, to minimize the non-writing time. But overall, buffering in the chip's RAM is the key to avoiding delay.

There's another speedup I've designed, which isn't currently in use on Teensy 3.0. While the loading speed is nearly as fast as the flash memory allows, there is about a 1 second delay (longer on some windows machines) from the instant the compiler produces the .hex file to the upload actually beginning. When Teensy reboots into bootloader mode, the USB disconnects, and then it reappears as the new device which accepts the download. Leonardo does the same. Each operating system does USB enumeration slightly differently, but there are delays associated with USB reset signals and other USB stuff, which add up to about 1 second or more.

My hope is to eliminate all that delay by "rebooting" into the bootloader code without disconnecting the USB. The support for it is all in place in Teensy 3.0, so I hope to enable this with a software update sometime next year (2013). But this is a lot of difficult software work to implement successfully on the computer side. The loader needs to deal with requesting the "reboot" and then checking to see if it was successful. There are a number of complications with basically switching a USB device at runtime without going through the proper enumeration process. If it doesn't work, of course the fallback is to disconnect. Success or failure needs to be determined in a matter of milliseconds to be of any value. All that needs to be done times 3, because each operating system has dramatically different low-level USB APIs. But if it is successful, the opportunity for speedup is pretty incredible. That 1+ second USB enumeration is now the single largest delay (assuming your machine is fast and the build-dependency patch is avoiding full recompile). I want to turn that last 1 second delay into only single-digit milliseconds!

Yes, I'm obsessed with upload speed optimization.......