But your example replacement code for the 8-bit AVRs also waits doing nothing during the transfer...
No, it's usually loading up the next and incrementing the pointer counter that's used for it. Small amount of improvement, but makes a significant improvement when moving hundreds of bytes - especially if you are reordering bytes on the fly. Also - in the next version of the led library, that 'gap' time is going to be used for things like on the fly hsv to rgb conversion and global brightness leveling. It's the difference between doing:
write -> wait -> check if end of loop -> jump back to beginning of loop -> load byte -> increment pointer -> write
write -> check if end of loop -> jump back to beginning of loop -> load byte -> increment pointer -> wait -> write
basically overlapping the time spent waiting and the loading/memory juggling.
This generally ends up capping the maximum transfer rate you can get out the chip because of a lot of stupidly spent time, instead of doing cpu bound work (e.g. moving bytes in and out of memory) while the SPI hardware is busy writing bits to SPI it just waits.
If this is responsible for the degradation from 4MHz to 1MHz your data preparation is too calculation intensive and you might have to change other things. But you have the code now to play with if you think you can gain a lot of performance by setting the registers directly.
Not all of it, but it hurts. I don't have the numbers handy but I have seen it impact timing up to 5-10% at higher spi clock rates (osc/2).
Also - on the teensy 3 - the SPI hardware introduces 1-3 clocks of delay/waiting between each byte. So each byte ends up taking 9-11 clock cycles, not 8. There's another chunk of time lost. I haven't gone back to see if AVR's hardware SPI introduces a similar intra-byte delay - however, given that between the end of one byte being written and the next byte beginning you have the flag check (read of a register, bit set check, jump, possible loop - at least 3/4 clocks there in the best case of you only having to go through that loop once), at osc/2 now you've got a few more clock cycle time spans where you aren't sending anything, so you aren't going to see a steady stream of clock and data, unless you're hand timing code to the point where you are writing SPDR at the clock where SPIF is being set.