16x16 RGB LEDs with 96 register, good idea?

I think architecturally, I would run 12 groups of 8 shift registers in series, with each group having its own SlaveSelect line, and then update a group using
eight SPI.transfer commands. Cuts way down on the number of parts, your data could be stored as 96 bytes in an array, and just update the section that changed.

Or arrange other ways: 4 groups of 4x4 with 4 SS lines, update each group as something changed.
16 groups of 2x2. Whatever.