74596 has open collector outputs, but I suspect that chip isn't easy to find these days.
Separating the chips would have made little difference to the time it takes to shift out your data. If anything, it would be slightly slower. It's the same amount of data either way, but you would have many latch lines to control instead of just one.