SPI Shift Registers don't work right

I am using a breadboard ATMEGA328p with a set of shift registers controlling three 3x3 rgb led matrices (formed into a cube). I have one shift register hooked up with the first two columns and two common cathode rows on one matrix, thereby controlling the bottom 4 LEDs in a square. As suggested here, I put a .1 uF capacitor in series with the Latch line. The setup can be controlled with a 9V battery or the Arduino.

Here's my problem: With the Latch line connected, sometimes the display doesn't start at all! Sometimes, it will work, but out of tempo. Sometimes, it skips frames. Sometimes it makes up sequences I didn't code! When I switch from 9v to Arduino +5v, the above happens much less frequently.

So I disconnected Latch, and it starts making horrible, but extremely brief flashing in LEDs that shouldn't turn on. This still happens even with Arduino power. It works for about ~5 seconds, then starts getting out of sync and starts making up frames. If I reconnect Latch during the sketch, the flashing goes away, but then the sketch loses sync. It doesn't lose sync without Latch with Arduino, but with Latch, it starts to lag again.

Reset returns the sketch to sync at the beginning.

Here is my code:

// RGB LED Cube Sketch
// Uses an ATMEGA328 to control a 3x3x3 cube of RGB LEDs with
// five 595 shift registers (8 bit).

//CURRENTLY USING ONLY ONE REGISTER!

// SPI code examples from http://www.gammon.com.au/forum/?id=11518
// and http://www.gammon.com.au/spi

// Shift register wiring and other example code can be found
// at http://www.arduino.cc/en/tutorial/ShiftOut

#include <SPI.h>

const byte LATCH = 9; // Edit to change Latch pin on Arduino
const byte maxSize = 16; // Edit this to change size of Array (Counting 0 --> (2^x)-1). Try to make divisible by numOfReg.
byte dataArray[maxSize]; // If you are using an absurdly large array (i.e. >256 bytes) then change to int

byte data; // If you are using an absurdly large array (i.e. >256 bytes) then change to int

const byte numOfReg = 1; //Edit to change # of registers

byte j = 0; // Start at byte 0

void setup() {
  SPI.setClockDivider(SPI_CLOCK_DIV128);
  pinMode(13, OUTPUT);
  digitalWrite(13, HIGH);
  delay(20);
  digitalWrite(14, LOW);
  SPI.begin ();
  digitalWrite(LATCH, HIGH);
  delay(100);
  digitalWrite(LATCH, LOW);
  Serial.begin(1200);

  // DEFINE EACH BYTE OF THE ARRAY IN HEX HERE!
  // For example, if maxSize = 8, then define dataArray[0] through [7]
  // Each item here adds about 6 bytes to the program size. (Out of 32K)
  // Each frame is (# of SR's) backwards.

  // This example turns each LED blue, red, green, then white!
  dataArray[0]=0x09;
  dataArray[1]=0x0A;
  dataArray[2]=0x0C;
  dataArray[3]=0x0F; // End of LED #6
  dataArray[4]=0x18;
  dataArray[5]=0x28;
  dataArray[6]=0x48; 
  dataArray[7]=0x78; // End of LED #5
  dataArray[8]=0x81;
  dataArray[9]=0x82;
  dataArray[10]=0x84;
  dataArray[11]=0x87; // End of LED #9
  dataArray[12]=0x90;
  dataArray[13]=0xA0;
  dataArray[14]=0xC0;
  dataArray[15]=0xF0; // End of LED #8
  // Continue to define each unit of the array
}

void loop() {
  digitalWrite(LATCH, LOW);
  // HERE WE TRANSFER EACH BYTE OF DATA!
  // For example, if sending bytes 0x13, 0xde, 0x4a:
  // 0x13 xx xx
  // 0xde 0x13 xx
  // 0x4a 0xde 0x13

  for (int i = 0; i < numOfReg; i++) {
      data = dataArray[j];
      digitalWrite(LATCH, LOW);
      SPI.transfer(data);
      delay(10);
      j++; // This is j, not i!
      // In a nutshell: Write each byte j sequentially (so it ends up backwards) in groups of numOfReg, then LATCH, then REPEAT.
  }
  
  digitalWrite(LATCH, HIGH);
  delay(500);
  
  if (j == maxSize) {
      j = 0; // Loops back to the beginning. You can change to create an 'intro' then a main 'loop'.
 
  }
  }

What could be wrong?

As suggested here, I put a .1 uF capacitor in series with the Latch line.

No that is putting it in parallel not series.

This tutorial is wrong, you should not connect the capacitor to the latch pin. You connect the capacitor between the 5V power pin and the ground pin of each shift register. The latch pin is connected direct to the arduino without the capacitor

I bridged across the cap to "remove" it, but now 90% of the time, the LEDs don't turn on! Every once in a while, I see the right colors, but they stall. Using the 9v or the Arduino makes no difference.

So did you put the capacitors in the right place?

This
SPI.setClockDivider( SPI_CLOCK_DIV128 );
and this
Serial.begin( 1200 );
and this
delay( 500 );
are all going to hurt your viewing performance.

Test each layer, see if columns and layer control are working, then jump into the multiplexing.

CrossRoads:
This
SPI.setClockDivider( SPI_CLOCK_DIV128 );
and this
Serial.begin( 1200 );
and this
delay( 500 );
are all going to hurt your viewing performance.

How ought I rewrite the above problems in my code?

Should I increase baud? Decrease it? Is the timer I have selected imprecise enough for SPI? Is my code mistyped?