Wow! Thats something I only dream of ;-). There is a lot of design in Cosa to use PROGMEM when ever possible and cut down on SRAM usage. With 16K I could put in a real RTOS with processes ;-)
You can really speed up pin I/O using a template class. I just posted a library that does writes in two cycles or 0.125 usec on an Uno. That library is at the above location.
I looked at your library recently and found that it uses much the same techniques as in Cosa and Jeelab. The current Arduino/Wiring style can be improved a lot with regards to speed and abstraction. Using C++ is a way forward.
The more important issues are abstracting and structuring the code base so that drivers, libraries, etc, can be added in a methodological fashion. Also I see a need to abstract IO functionality so that devices can be extended, replaced, etc. The IOStream/Canvas and IOBuffer classes in Cosa help a lot with this.
It would be nice to have a device driver framework for Arduino that allows a component view of both hardware modules/shields and software. Cosa is a small step in this direction. 1-Wire, I2C, SPI, etc, needs an abstraction layer and protocol generation so that integration becomes as easy as build in Arduino today. The research done in the d-tools project is an excellent starting point but we need to look at platform support for communicating systems; e.g. application build onto multiple boards, processors, etc, communicating over wire or wireless.
Hardware wise Arduino is a great contribution to getting cheap hardware for schools, university, prototyping and hobby projects. Software wise we have a lot more to do. Much has been done but additional structuring and refactoring is needed for the next quantum leap. Arduino/Wiring is great for beginners but it has some obstacles when it comes to scaling to larger projects. And 32-bits processors are not the answer even if they remove some of the resource limitations (memory, etc).
I think that there is a fair number of professionals with many years of experience of software design and implementation out there/here using Arduino for fun which could and should contribute in this process.
Cheers!