You cannot bitbang it faster, you have to follow the preceise timing that the WS2812 requires. Faster CPU means faster busy-waiting.
You can split the matrix into separate segments and connect these to different pins. Now you dont have to update everything when you change only one LED which saves time.