In your drawing, having seperate Serial Data in lines and seperate Shift Register clock lines does nothing for you.
Clock the same data into all the parts at the same time - only the chip getting the Storage Register clock will have its output updated.
SPI.transfer is fast enough that you can have them all daisy chained, shift out 8 bytes, and update all at once; will turn out to be faster than doing multiple writes to send the data out, then select one of the chips via the intermediate shift register.
Keep the data in an array:
// time to update the registers?
digitalWrite(ssPin, LOW); // goes to RCLK, pin 12
for (x = 0; x<8; x=x+1){
SPI.transfer(dataArray[x]);
}
digitalWrite(ssPin, HIGH); // outputs change on this rising edge
// done with transfer
For even faster results, replace digitalWrite with direct port manipulation
PORTD = PORTD & B11111011; // clear D2 on an Uno for example
:
:
PORTD = PORTD | B00000100; // set D2 on an Uno
Need to pay attention when porting the code to other chip types.
Can use the default SPI data rate of 4 MHz, or change the divisor to use 8 MHz! Be sure to have a power supply decoupling cap (0.1uF/100nF) on the VCC pin of each device.