Led cube efficiency

You are missing "multiplexing".
In many cubes, the LEDs are arranged in columns and then layers.
In each of ther 64 columns, all the cathodes are connected together.
In each of the 8 layers, all the anodes are connected together.
Then you have a PNP or P-channel MOSFET to supply 64 x 20mA to any one layer of anodes at a time, and 8 shift registers to sink the current from the (up to) 64 LEDs that you want on.
Cycle thru all 8 layers rapidly leaving each layer for 2.5mS so the entire display is refreshed 50 times a second and looks flicker-free (1/50 = 20mS, /8 = 2.5mS) . Persistence of vision will trick the eye into not seeing the rapid changes. Use SPI.transfer to send the 8 bytes out to the shift registers.
Use a part like TPIC6B595 so that all 8 LEDs can be together and the resulting 160mA does not overwhelm the part (74HC595 is only good for 70mA total for the whole device).
During the ~2.495ms of non-datashifting time (takes 4-5uS to shift out 8 bytes), you can be doing other stuff.