I got it working with SPI today and got it twice as fast.
I will do some more testing en code cleanup before I upload it here.
I took steene's advice and used rotation over carry (ror) to calculate the output byte in 32 cycles (4 per pin).
Adding a shift register now costs 43 interrupt clock cycles, which is about 5.4 clock cycles per pin 8)