I have had no issue -sending step signals- at 5,000 steps/second with easing (quadratic or linear), and 10,000 steps/second without easing. Of course, I have no motors on-hand that can come close to that speed at the 12V I tend to work with. The primary difference was that instead of using digitalWrite(), I fixed my pins in hardware and used the pin register directly, and used timerone to drive the stepping asynchronously so that I could do other things. The motion is a little different in that the minimum off-period between steps is fixed (i.e. one cycle of the timer), but to get asynchronous stepping (so I can do other stuff), was more valuable than trying to achieve off periods smaller than the interrupt period.
Of course, it only supports a single motor, so you'd slow down a bit supporting more motors, but for an example of how to do it up to 10,000 steps/second see: http://openmoco.org/docs/OMLibraries/class_o_m_motor.html
The source code can be downloaded from http://openmoco.svn.sourceforge.net/viewvc/openmoco/OpenMocoComponents/nanoMoCo/trunk/Libraries/OMMotor/
It is possible to get quite fast code if you profile your solution and test where the slow parts are.
If you're not able to go faster than 1,000 steps/second, it's likely not a code problem and you should be running a proper chopping driver with the correct voltage and current for your application.