It would be possible, but it is definitely not more efficient. This is because data is shifted out via the hardware SPI module, which runs at 8MHz (or 1 bit every two clock cycles)! At first thought, it seems like it would be faster to set up three different data lines and clock them at the same time, but just to set up each line it takes an AND and CP (2 clock cycles) for each bit, then a clock pulse will be 2 cycles, etc. The dedicated hardware will be much faster than the software implementation.
The library's default display cycle is 976.5625Hz. For the entire display this would be a frame rate of 61.035Hz.
Your limiting factor will definitely be the serial data transfer rate. One refresh takes 12*3*16*16 = 9216 bits. If you use the fastest serial speed, 115200, and account for the start/stop bit, you could deliver 92160 bits/sec, or 10 frames per second. There will be some processing overhead, even if the serial data went straight out to the tlc's, so realistically it would be ~8Hz.
See
this thread for a similar project with an 8x8 grid.
About instantiating three separate instances; it won't work because the library just has one data buffer.