The biggest reason for slowness is the way Arduino does it i/o combine with the poor way shiftOut() was written. DigitalWrite() uses runtime table look ups. Some of that information is known at compile time based on which variant is being used. But to really fix this requires changing the digitalWrite() API.
However, there are ways to re-write the code in shiftOut() that is not quite as fast as direct port i/o (it uses indirect port i/o) but is portable across all architectures and yet is much faster than the original arduino shiftOut()
For example, they are checking direction for each bit vs just once up front.
Then the big speedup is to only look up the port address and bitmask only once vs on each bit.
You can use:
portOutputRegister(digitalPinToPort(pin)) to get the port address for the arduino pin
digitalPinToBitMask(pin) to get the bitmask within the port for the arduino pin
A key thing to keep in mind whenever stomping on the port directly vs letting digitalWrite() do it is that if you want to code to be able to co-exist with other code that runs at interrupt level, you must ensure the atomicity of the port register update.
If you don't, then you run the risk of corrupting the port register.
The code above will corrupt the port register if other code writes to PORTB at ISR level.
The only way to ensure atomicity when using direct port i/o when not using compile time bit masks is to mask interrupts during the register update.
You have to decide if you do that for each bit for for the entire byte (all 8 bits in the loop).
In this case for speed, I'd do it outside the two loops.
So to do it in a portable way you would
- get the address of the port register
- get the bit mask within the port register
- check for LSB vs MSB first and jump and split off to two separate sections of code that each do a loop.
- save the IS state and mask interrupts just outside the bit shifting loop
- do the bit checking and shifting
- restore the IS state
You will also want to re-write the way the individual bits are checked as the way they are doing it is very inefficient. i.e. don't do it the dumb way they are doing by using a loop index and then doing math on that index to determine how much to shift the value each time.
Once you have two loops (one for each direction) you simply shift the original value 1 bit each time and check a constant bit position. This is WAY faster than what they are doing and this simple change without messing with the indirect port i/o will offer a performance increase.
All these options are fully portable and much faster than the existing shiftOut() code.
The good question is why isn't the official code doing any of these things to bump the performance?
--- bill