How fast is ShiftOut

I cannot find anywhere how fast the shiftOut injstruction works, e.g. how long does it take to shift out one byte. I'm using Arduino Uno. Thanks, Arnoud

Snipped of shiftOut code

void shiftOut(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, uint8_t val)
{
	uint8_t i;

	for (i = 0; i < 8; i++)  {
		if (bitOrder == LSBFIRST)
			digitalWrite(dataPin, !!(val & (1 << i)));
		else	
			digitalWrite(dataPin, !!(val & (1 << (7 - i))));
			
		digitalWrite(clockPin, HIGH);
		digitalWrite(clockPin, LOW);		
	}
}

So as fast as possible. But it’s all in software so not that fast…

But really, it’s a XY-problem. What’s you’re real problem?

Hi, i think my answer would be that if you care how fast it is, its probably not fast enough for you. If speed is important, use the Arduino's built-in SPI hardware. When speed is not important, use shiftOut().

Paul

Thanks Paul.
I’m multiplexing an LED display (12 columns, 5 rows for 4 5x3 digits) for a clock.
I think I need 4ms change rate, so every row is on 4ms every 20ms.
It seems not fast enough.
I’ll start playing with SPI

If I somehow did not follow the forum rules, I’m sorry.

In that case shiftOut() will be fast enough. If you had 10 times as many leds, then SPI would be definitely worth looking at. But go ahead and try SPI anyway, it's good learning.

You didn't break any forum rules as such, but it would have helped us give you the right answer first time if you had explained your application in your first post.

Thanks Paul. I'm newbie to Arduino but I'm experienced electronics engineer, that's why I phrased my question so in general. I'll try SPi anyway, good to learn. Also after some experimenting I think 3x 595 (2 for columns, 1 for rows) is at the "edge" for flickering. Thanks again.

SPI is way faster. Can increase the default clock speed from 4 MHz up to 8 MHz, take 3-4uS to send out 3 bytes, and use direct port manipulation to replace the much slower digitalWrite as well.

PORTD = PORTD & 0b11111011; // clear D2 for example as '595 RCLK signal
SPI.transfer (byte0); // or pull the data from an array.  D13 to SRCLK, D11 to Serial data in
SPI.transfer (byte1); // or pull the data from an array
SPI.transfer (byte2); // or pull the data from an array

PORTD = PORTD | 0b00000100; // RCLK high to load output stage

vs shiftOut where the each bit is loaded at the output pin in code and the clock is created in code. SPI uses dedicated hardware to blast the data out fast.

"but I'm experienced electronics engineer" So you would also know how to use an oscilloscope to compare the timing of the two if you wanted. 5 rows on for 4mS each = 20mS, 1/.02 = 50Hz, I wouldn't think that would seem flickery. Are you turning one row off before turning the next on? That would eliminate any ghosting type effects too.

Thanks. Yes, I was also thinking about hooking up a scope to observe the clock. Will do after I figured out SPI and then compare. May be useful for many. Yes, I turn off a row before turning the next one on. The ShiftOut examples do that anyway.

Done some experiments. Arduino Uno.

  • ShiftOut does 1 byte in ca. 150us

  • SPI.beginTransaction(SPISettings(16000000,MSBFIRST,SPI_MODE0)) then SPI does 1 byte in ca. 1us.

Thanks everybody.

septillion:
Snipped of shiftOut code

void shiftOut(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, uint8_t val)

{
uint8_t i;

for (i = 0; i < 8; i++)  {
	if (bitOrder == LSBFIRST)
		digitalWrite(dataPin, !!(val & (1 << i)));
	else	
		digitalWrite(dataPin, !!(val & (1 << (7 - i))));
		
	digitalWrite(clockPin, HIGH);
	digitalWrite(clockPin, LOW);		
}

}



So as fast as possible. But it's all in software so not that fast...

But really, it's a XY-problem. What's you're real problem?

Nowhere near as fast as possible - look at all that loop-invariant code.

Arnoud: - ShiftOut does 1 byte in ca. 150us then SPI does 1 byte in ca. 1us.

Wow, quite a difference. But in practice, just how much difference does it actually make?

What if the the Arduino needs to transfer 1KB? If that data is held in an array in ram, and a for() loop is performed to go thru each byte, what's the timing for shiftOut() vs SPI.transfer() for the 1024 bytes of data?

If anybody is still reading this, shiftOut is slow. I’m playing with some 595s just now and ran some timing tests on Uno. Using some code from Fiz-ix to replace shiftOut with a separate function myShiftOut. I’ve augmented this with myShiftOut16 which will handle and unsigned integer.


const byte COL_COUNT=8;
unsigned int sequence[COL_COUNT]={0B0000000100000001, 0B0000001000000011, 0B0000010000000111,
                                 0B0000100000001111, 0B0001000000011111, 0B0010000000111111,
                                 0B0100000001111111, 0B1000000011111111};
// unsigned int sequence[COL_COUNT]={257,515,1031,2063,4127,8255,16511,33023};

long oldMillis;

static inline void latchPinH() {bitSet(PORTB,0);}
static inline void latchPinL() {bitClear(PORTB,0);}
static inline void clockPinH() {bitSet(PORTB,4);}
static inline void clockPinL() {bitClear(PORTB,4);}
static inline void dataPinH() {bitSet(PORTB,3);}
static inline void dataPinL() {bitClear(PORTB,3);}

int latchPin = 8;   // B0
int clockPin = 12;  // B4
int dataPin = 11;   // B3

void setup(){
 pinMode(latchPin, OUTPUT);
 pinMode(clockPin, OUTPUT);
 pinMode(dataPin, OUTPUT);
 Serial.begin(9600);
}

void loop(){
 oldMillis=millis();
 for (int j=1; j<=1000; j++){
   for (int col = 0; col < COL_COUNT; col++){
     //digitalWrite(latchPin,LOW);
     latchPinL();
     //shiftOut(dataPin,clockPin,MSBFIRST,(sequence[col]>>8));
     //myShiftOut((sequence[col]>>8));
     //shiftOut(dataPin,clockPin,MSBFIRST,sequence[col]);
     //myShiftOut(sequence[col]);
     myShiftOut16(sequence[col]);
     //digitalWrite(latchPin,HIGH);
     latchPinH();
     //delay(30);
   }
 }

 Serial.println(millis()-oldMillis);
}

void myShiftOut(byte data){
 boolean pinState;
 clockPinL();
 dataPinL();

 for (int i=0; i<=7; i++){
   if (data & (B10000000>>i)) dataPinH();
   else dataPinL();
   clockPinH();
   clockPinL();
 }
}

void myShiftOut16(uint16_t data){
 boolean pinState;
 clockPinL();
 dataPinL();

 for (int i=0; i<=15; i++){
   if (data & (0B1000000000000000>>i)) dataPinH();
   else dataPinL();
   clockPinH();
   clockPinL();
 }
}

Now the results from the 1000 loop using shiftOut is 2480ms while myShiftOut clocked in at 354ms. Significant.

The more interesting thing however (for me), is that when I added myShiftOut16, it was quite a bit slower than using myShiftOut twice (byte by byte). Okay, nevermind. When I went back to the byte-wide routine, the timing had now jumped to 503ms. Not really understanding why this change all of a sudden, I tried resetting, reloading … all the sam. Finally, I commented out the myShiftOut16 function and the timing went back to 354ms.

So, all you hot young engineers, I’m an old man - explain this one to me.

Hello, please edit your post above and put code tags around your sketch. It’s the </> icon.

I can’t explain some of your strange timings, but I suppose the 16-bit version might have been more than twice as slow as the 8-bit version because you are running the code on an 8 bit CPU.

It’s not difficult to come up with some code to replace ShiftOut() that’s faster, but there are disadvantages to doing so. Firstly, the code you come up with probably won’t be platform independent. That’s true of your code. It may work on AVR-based Arduino, but not something based on ARM or ESP. Secondly, it’s missing my earlier point that ShiftOut() isn’t intended to be fast; if speed is important you should be using SPI, which most non-AVR chips have.

You would be surprised how many of us here are no longer spring chickens!

Thanks for your comments PaulRB. Code tags done.

Sounds about right with the 8-bit CPU trying to perform 16-bit ops, and platform independence is an issue as well, I agree.

My more confused dilemma, if you look at the code as posted, it runs in 632ms. If you comment out the ‘myShiftOut16(sequence[col]);’ in loop() and un-comment out the 2 myShiftOut calls, speed increases to 503ms. This is explained by your 8/16-bit theory.

Now comment out the entire ‘void myShiftOut16(uint16_t data)’ function, which should do nothing more than take up space (but would an optimized compiler even deal with that bit of code as it’s not called any more???) and the speed increases to 354ms. If the program was large enough that the additional, albeit unused, code caused it to exceed a memory page limit, one might argue that pointers to the function may have gone from 16 to 32-bit. Such is not the case. So I remained baffled, not stalled, as to why and unused function would cause a loop to slow down.

What I try to do with embedded systems PaulRB, is optimize to the max in terms of speed and memory (ie. minimize the overhead). Invariably this process becomes platform/MPU dependent, but by then we’re beyond portability. With this example I don’t have an answer just yet.

And I apologize for the reference to the season of your chicken, no offense was intended.

I understood your point about a function that is not even called somehow affecting performance elsewhere in the code. It makes no sense at all. As I understand code that is not called or referenced somewhere else is not included in the compiled image that gets uploaded to the board.