RE direct register access:
http://www.arduino.cc/en/Reference/PortManipulationWRT the cascading, imagine a loop to be a mechanical typewriter. Every time your loop ends and goes back to the beginning its like hitting the carriage return. It takes the typewriter a lot more time to go back to the beginning than it does to type a single letter.
Therefore if you want to type 42 letters (i.e the state of each LED) it is faster to type all 42 letters in one line than to type one letter per line. This is a time optimization called "loop unrolling" (google it).
Add to that the direct register access idea. Using it, ONE key press (one instruction) can type 8 letters (by directly writing the port) instead of it taking more like 10-20 key presses to type just ONE letter (digitalWrite function call).
Finally, when you cascade the 595s you have to manipulate the clock line for each letter. To continue the analogy, this would be like typing a ^ (up) before and a v (down) after each letter. So you instead of typing:
1010101
you are need to do ^1v^0v^1v, etc.
But if you give each of 8 chips its own data line but use the same clock line its like typing:
^10101010v^10101010v, etc
A LOT fewer ^v are needed. Note that there is a limit to how many chips the Arduino can drive from a single line... it has to do with how much current the Arduino can source per line and how much the 595 draws. I'm not sure what those number are, just be aware that weird behavior may be a hardware not software issue.
With all of these optimizations you'll probably get it to go hundreds if not 1000 times faster. I have driven 6 M5451 chips (they are like a 35 output shift register LED driver) using code like this and am able to blink the LEDs at 16kHz (I didn't try more). 200 hz or so looks "on" solid to a human eye...
My code (for the M5451 chip) is available at <url>
http://code.google.com/p/arduino-m5451-current-driver/</url>. But you know you'll probably have more fun and learn a lot more if you DIY! :-)