10 times faster LedControl library - Fast replacement for slow shiftOut

Attached is a 10x faster LedControl library.

I just made LedControl library 10 times faster with a small modification. The problem has always been that ShiftOut is a tired dog which runs on 3 legs. In short... It is SLOW!

This modified library uses a routine internal to LedControl.cpp named ShiftOutFast. It gets its speed from directly controlling the SPI transfer via PORTB bit manipulation commands. This puts in a restriction that data control pins must be 8 or higher so you can only have 2 fast SPI units on a single Uno. Usually one is enough. Also, as written, it is MSBFIRST. It could be changed to LSBFIRST with a slight modification. QED.

This same routine works well for LED Matrix and I will upload that to my web site or here in a little while, with examples for driving 8 of 8x8 LED matrices far faster than you can read it. It is just a total blur at full speed. I expect it could drive more but I only have 8 of them in series for now. And the library as currently written only supports 8.

And, unlike some other code I have seen (MD_Parola), this is understandable as well as being blindingly fast. It is easy to modify to make it do what you want. Of course, MD_Parola does a lot more really neat stuff. But it displays backwards (times 2) on my LED Matrices and will take some time to fix.

Here is the centerpiece of the speedup. The clockLow, dataLow, clockHigh and dataHigh are computed in the code setup routine "LedControl" in the .cpp file where the pin numbers are captured.

void LedControl::shiftOutFast(uint8_t val) // Write a byte by software SPI, MSB first.
{
  for (uint8_t i = 8; i; i--) {          // Run through the 8 data bits outputting them while clocking.

    PORTB &= clockLow;                   // Set Clock pin low.
    PORTB &= dataLow;                    // Set Data pin low (might change quite quickly!)
    if (val & 0x80) PORTB |= dataHigh;   // Data proved to be high, set Data pin High now.
    PORTB |= clockHigh;                  // Set Clock pin high. Data now captured in unit.
    val <<= 1;                           // Shift user data byte one left for next time.
  }
}  // end of shiftOutFast routine.

It is unclear to me why this code, having been known for at least 5 years, is not already in the standard library under ShiftOut or, since it only works on PORTB, call if ShiftOutB or whatever.

Enjoy your new speed,
Mike

LedControl.zip (16.3 KB)

MikeyMoMo:
It is unclear to me why this code, having been known for at least 5 years, is not already in the standard library under ShiftOut or, since it only works on PORTB, call if ShiftOutB or whatever.

Mike, I suspect the reason is portability. Your suggested code will only run on some atmega chips. The standard shiftOut() will run on any chip for which the basic digitalWrite() function has been translated.

If speed is very important, the usual advice is to use the hardware SPI by using SPI.transfer().

Paul

PS. Please use code tags in future.

Great to see that you have made LEDControl faster, as it as always been a somewhat slow library.

As Paul mentioned, your change locks the code to AVR architecture, which is fine for custom development but not really desirable for general libraries. The SPI library used the hardware SPI and has been ported to the different architectures. MD_MAX72xx has just been updated to use the standard library for this reason - many users are now using Due and other non-AVR boards.

I am also intrigued as to what you mean by Parola displaying backwards x2. If you have a LED matrix module I have not yet discovered, I would be interested in understanding the characteristics. Rather than hijack this thread, please raise the question at the Parola thread LED matrix display - MD_Parola, MD_MAX72xx and MD_MAXPanel - Exhibition / Gallery - Arduino Forum

The biggest reason for slowness is the way Arduino does it i/o combine with the poor way shiftOut() was written. DigitalWrite() uses runtime table look ups. Some of that information is known at compile time based on which variant is being used. But to really fix this requires changing the digitalWrite() API.
However, there are ways to re-write the code in shiftOut() that is not quite as fast as direct port i/o (it uses indirect port i/o) but is portable across all architectures and yet is much faster than the original arduino shiftOut()

For example, they are checking direction for each bit vs just once up front.
Then the big speedup is to only look up the port address and bitmask only once vs on each bit.
You can use:
portOutputRegister(digitalPinToPort(pin)) to get the port address for the arduino pin
digitalPinToBitMask(pin) to get the bitmask within the port for the arduino pin

A key thing to keep in mind whenever stomping on the port directly vs letting digitalWrite() do it is that if you want to code to be able to co-exist with other code that runs at interrupt level, you must ensure the atomicity of the port register update.
If you don't, then you run the risk of corrupting the port register.
The code above will corrupt the port register if other code writes to PORTB at ISR level.
The only way to ensure atomicity when using direct port i/o when not using compile time bit masks is to mask interrupts during the register update.
You have to decide if you do that for each bit for for the entire byte (all 8 bits in the loop).
In this case for speed, I'd do it outside the two loops.

So to do it in a portable way you would

  • get the address of the port register
  • get the bit mask within the port register
  • check for LSB vs MSB first and jump and split off to two separate sections of code that each do a loop.
  • save the IS state and mask interrupts just outside the bit shifting loop
  • do the bit checking and shifting
  • restore the IS state

You will also want to re-write the way the individual bits are checked as the way they are doing it is very inefficient. i.e. don't do it the dumb way they are doing by using a loop index and then doing math on that index to determine how much to shift the value each time.
Once you have two loops (one for each direction) you simply shift the original value 1 bit each time and check a constant bit position. This is WAY faster than what they are doing and this simple change without messing with the indirect port i/o will offer a performance increase.

All these options are fully portable and much faster than the existing shiftOut() code.
The good question is why isn't the official code doing any of these things to bump the performance?

--- bill

Good points, from all. Thanks!!

I had not expanded my brain to think of other platforms. Sometimes compatibility can cause slowness for all the standard reasons.

I found the right setting for MD_Parola and it looks good now. Quite complex. Seems hard to use though. Will take lots of reading. But lots of function there. Someone has put lots of time into that code!

This is pretty quick. I get 280 full 8-matrix updates per second using this code. 400 microseconds per matrix including driver code as I figure it. But, as mentioned, limited to certain hardware.

I have already calculated the PORTB offset when the pin number is passed in.

 SPI_MOSI = dataPin;
  dataHigh = (1 << (dataPin-8)); // Used for Or'ing on data bits
  dataLow  = ~dataHigh;          // Used for And'ing off data bits

  SPI_CLK   = clkPin;
  clockHigh = (1 << (clkPin-8)); // Used for Or'ing the clock pin High
  clockLow  = ~clockHigh;        // Used for And'ing the clock pin Low

//The following seems backwards but that's the way.  Enabled = Low on CS, Disabled = High on CS
  SPI_CS = csPin;
  ChipSelectLatch = (1 << (csPin-8)); // High - Used to Or CS High thereby latching putting data into the display
  ChipSelectLoad = ~ChipSelectLatch;  // Low -  Used to And CS Low thereby allowing data loading.

And since I don't check, this does MSBFIRST only. Trivial to change it to do either.

void LedControl::shiftOutFast(uint8_t val) // Write a byte by software SPI, MSB first.
{
  for (uint8_t i = 8; i; i--) {            // Run through the 8 data bits outputting them while clocking.
    PORTB &= clockLow;                     // Set Clock pin low.
    PORTB &= dataLow;                      // Set Data pin low (might change quite quickly!)
    if (val & 0x80) PORTB |= dataHigh;     // Data proved to be high, set Data pin High now.
    PORTB |= clockHigh;                    // Set Clock pin high.
    val <<= 1;                             // Shift user data byte one left for next time.
  }
}  // end of shiftOutFast routine.

I guess this will end here. Seems like I had been scooped before I even started. But still fun to play with.

Mike