Arduino Original vs Arduino Clone

The flash memory is just the '328Ps flash memory. Processing speed is the same as any other '328P, as fast as your code allows.
If you have the shift register data in an array, the outputs can be updated blazingly fast:

PORTB =PORTB & B11111011;  // clear B2 (SS pin low)
for (x=0; x<12; x=x+1){
SPI.transfer(dataArray[x]);
}
PORTB = PORTB | B00000100; // set B2 (SS pin high to update the output register)

I think this updates the outputs in something like 150uS. (0.15mS) (about 1uS/byte, plus 12uS each pass thru the for loop.
Skipping the for loop and just performing 12 SPI.transfers would be really fast, something like 15uS.

PORTB =PORTB & B11111011;  // clear B2 (SS pin low)

SPI.transfer(dataArray[0]);
SPI.transfer(dataArray[1]);
SPI.transfer(dataArray[2]);
SPI.transfer(dataArray[3]);
SPI.transfer(dataArray[4]);
SPI.transfer(dataArray[5]);
SPI.transfer(dataArray[6]);
SPI.transfer(dataArray[7]);
SPI.transfer(dataArray[8]);
SPI.transfer(dataArray[9]);
SPI.transfer(dataArray[10]);
SPI.transfer(dataArray[11]);

PORTB = PORTB | B00000100; // set B2 (SS pin high to update the output register)

If one needs more flash and SRAM, then jumping up to a '1284P chip will provide 128K flash, 16 SRAM, and 10 more IO pins.
That can be arranged in an Uno compatible layout also with USB/Serial on the board, or off (shown off here).