Very nice. The code's looking very good. I'll try to come up with a subclass example on send().
Here's a list of minor little things I noticed looking at the latest code. Nothing's critical, just little things.
_busyPin seems to always be _data_pins. Eliminating it might reduce code size, and save an extra byte of per-instance RAM.
rwSave save in init() appears to be unused now.
Inside init(), should en2 be checked for 255 instead of 0 to see if it's unused?
Inside init2(), it would be advisable to call delayMicroseconds with only 15000 and do the loop 9 times instead of 3. Even though delayMicroseconds takes a 16 bit input, it doesn't actually work properly beyond 16383. In fact, limiting your call to 8191 us might be a good idea, in anticipation of a 32 MHz AVR (eg, the xmega chips).
Can _displayfunction become only a local variable inside begin2(), possibly saving code side and one more byte of per-instance allocated RAM?