Two 'layers' of 16-input multiplexers (17 multiplexers) will give you 256 input pins, addressed by 8 Arduino pins. Another set of 17 multiplexers will give you 256 output pins. Combined in a matrix you can check any one of 65536 crosspoints with 16 Arduino pins. Is that enough?