Few bytes, long distance, fast communication... wanted!

Can you tell us more what it is used for ?
What timing is needed. Is 100ms acceptable as a delay ? Or do you need 1ms ?
Are the Arduino boards in one long line ? If the communication is cascaded, the 1ms could turn into 40*1ms for a total of 40 milliseconds.

Perhaps it is possible to synchronize the Arduino boards, so the signal delay can be resolved in the boards. But that depends on what it is used for.