Two methods:
Wiring in parallel - SCK to all devices, MOSI to all devices, and unique latch to all devices. 6 wires, I've sent data to 4 shift registers (actually four MAX7219) this way. More code to run, but can access an individual shift register if really want to.
digitalWrite (ssPin1, LOW);
SPI.transfer(data1); // however the data is created - discrete variables, highByte/lowByte of an int, array[0]
digitalWrite (ssPin1, HIGH; // outputs update on this rising edge
digitalWrite (ssPin2, LOW);
SPI.transfer(data2); // however the data is created - discrete variables, highByte/lowByte of an int, array[0]
digitalWrite (ssPin2, HIGH; // outputs update on this rising edge
digitalWrite (ssPin3, LOW);
SPI.transfer(data3); // however the data is created - discrete variables, highByte/lowByte of an int, array[0]
digitalWrite (ssPin3, HIGH; // outputs update on this rising edge
digitalWrite (ssPin4, LOW);
SPI.transfer(data4); // however the data is created - discrete variables, highByte/lowByte of an int, array[0]
digitalWrite (ssPin4, HIGH; // outputs update on this rising edge
// can arrange this using a couple of arrays too; one for data, one for latchPin assignments, and use a for:loop to go thru them, the for loop will be slower to execute.
Wiring in parallel - SCK to all devices, latch to all devices, MOSI to device one, data out to data in of the next device. I've sent data to 45 shift registers like this at 8 Mz clock rate (SPI clock divisor at 2, vs default of 4), refreshing all outputs at 20 KHz rate.
Faster method, just 3 IO pins needed.
digitalWrite (ssPin, LOW);
SPI.transfer(data0); // however the data is created - discrete variables, highByte/lowByte of an int, array[0]
SPI.transfer(data1); // however the data is created - discrete variables, highByte/lowByte of an int, array[1]
SPI.transfer(data2); // however the data is created - discrete variables, highByte/lowByte of an int, array[2]
SPI.transfer(data3); // however the data is created - discrete variables, highByte/lowByte of an int, array[3]
// can do the SPI.transfers in a for:loop also, but is slower
digitalWrite (ssPin, HIGH; // all shift register (4 or 45) outputs update on this rising edge