The "fan-out" is a term related to the bipolar TTL technology, not applicable to unipolar MOS logic.
Long lines and many inputs increase the capacitance, that has to be driven by an output. The higher the capacity, the more current is required to switch from one logic level to the other one, or the longer it takes to switch at a given source/sink current.
You simply can reduce the clock frequency, to make many shift registers work without any additional circuitry. This also compensates for the delay of the data signal, that ripples through the daisy-chained shift registers. Depending on the refresh rate of the driven circuits (multiplexed LED array?), a low clock rate will not cause any unwanted effects.
Or you insert a driver or simple non-inverting gate into the clock lines, after one or more registers, to delay the clock signal according to the ripple delay. For more details you should have a scope at hand, to measure the slopes and the ripple and consequentially required clock delays.
As already mentioned, line drivers built from discrete components become much bigger and more expensive than integrated circuits. That's due to the complementary push/pull outputs, requiring one transistor each, and possibly more transistors to invert the signal twice and to drive the output stage.
As another solution you can use multiple register chains, up to star topology with no daisy-chaining at all. If you can spend some more pins to drive the data inputs of multiple register chains at the same time, the clock rate can be decreased as well. Instead of 1 chain of 24 registers at 1MHz, you can feed 2 chains of 12 registers at 500kHz, or 4 chains of 6 registers at 250kHz, to transfer the same number of bits in the same time.