How exactly slow is digitalRead()?

A single I/O instruction is 0.125us. The ATmega chip is optimized for single I/O bit instructions.
A digitalRead() is about 3.6us. That is without the time for the iteration/loop. The 4.78us is with the iteration/loop included.

Using the digitalPinToPort() and so, will increase the speed a lot. It is not as fast as 0.125us, because a few variables have to be read from memory.
The digital...Fast functions are more or less portable. For fully portability you are stuck with that 3.6us.