Higher Performance IO

I am fairly new to arduino, so forgive me if some of these things are obvious... I am in the middle of a project which will involve performing TDM across a bunch of shift register lines to control some large 8 segment LED displays. To do this I have been thinking of using an ISR with FrequencyTimer2 to output data using SPI to my latches. What I have come to realize though is that SPI using shiftOut() is not very fast.

I performed some rudimentary measurements and it seems like shiftOut() takes about 110us to output a byte. Perhaps this explains the 9600 baud limitation to the SoftwareSerial library? After delving into the shiftOut() and digitalWrite() code, (shiftOut() simply calls a bunch of digitalWrite()s in succession), I was able to perform some simple optimizations since digitalWrite() performs much of the same preparatory stuff over and over. I was able to create a dropin replacement for shiftOut() that works in 31ms. This could probably be further optimized but I thought that it was a good start.

This leas me to wonder whether others need higher performance IO in general with the Arduino and whether the core libraries should potentially be better optimized. Perhaps it would be even good to create a separate high performance IO library which may have slightly more complicated but more accurate IO functions. These functions might be slightly harder to use than the standard functions, but if properly used could give a substantial performance improvement still? For example: I split the shiftOut() function into a shiftOutPrep() and a shiftOutRaw() function. If using repetitive data, it can be prepared with shiftOutPrep() and then transmitted with shiftOutRaw(). If the same data needs to be transmitted again later, there is no need to call shiftOutPrep() again. I can get even better performance out of shiftOutRaw() than the optimized shiftOut() above, ~18us to transmit a byte!

These split functions would probably not be useful for random SerialIO such as the SoftwareSerial library, but for my use such as a TDM application where the data to be shifted out cycles through a repeated set of bytes, it pays to prepare the data with shiftOutPrep() function once in advance and then simply send the prepared bytes out with shiftOutRaw() cycling through a few prepared set of bytes. With the ability to send bytes serially in less than 20us one could perhaps perform repetitive serial IO at 50K baud! This is much better than the current 9600baud limitation.

Is anyone else interested in creating a High Performance IO library? I imagine that others could suggest even better optimizations for these functions greatly increasing the arduino's abilities.

If interested, I will post some of my optimizations here along with time measurements. for each.

shiftOut() is not the interface to the ATmega's SPI hardware. IIRC, it was written to deal with using serial-in/parallel-out shift registers.

digitalWrite() is part of the nice, easy, safe Arduino environment. If you need more speed, you can write to the output pins directly using PORTx.

The ATmega has SPI hardware. If you are using SPI peripherals you should probably use it for best performance.

As an aside, I glanced at the shiftOut() reference, and it claims to be SPI. This should be corrected; not only does it not use the ATmega SPI hardware, it doesn't even have MISO (peripheral data input).

-j

Ah, yes, you are correct. I was mislead by the use of SPI in the docs. I do not have SPI hardware, I have simple three wire (data, clock, latch/strobe) serial shift register drivers, thus the use of shiftOut().

I will look into PORTx to speed things up.

It still seems like there should be a safe (nicely integrated in the Arduino environment) faster version of shiftOut() available for general Arduino use, would you agree? I can't help but wonder how many Arduino users simply gave up performing serial IO with software because they thought they were limited to a slow shiftOut() implementation and did not have the expertise to code up a faster version.

I have simple three wire (data, clock, latch/strobe) serial shift register drivers

If they can clock at 16MHz, hook 'em to the SPI hardware and blast away. come to think of it, seems like I publicly assumed that SPI clocked at the system clock rate, but some kind soul corrected me. Guess I should have paid more attention...

-j

Nice idea, but they are only 5MHZ! :slight_smile:

I have further improved my shiftOutHP() (High Performance). It now clocks out a byte in 20us, almost a 6 fold improvement over the original!! That means that 50K serial output should be possible. This function should be almost exactly functionally equivalent (sanity checks, setup, etc) to the original shiftOut(), but it has many redundant operations removed. It seems to work with my chips. :slight_smile:

void shiftOutHP(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
{
  uint8_t dport = digitalPinToPort(dataPin);
  uint8_t cport = digitalPinToPort(clockPin);
  if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
  uint8_t dmask = digitalPinToBitMask(dataPin);
  uint8_t cmask = digitalPinToBitMask(clockPin);
  volatile uint8_t *dreg = portOutputRegister(dport);
  volatile uint8_t *creg = portOutputRegister(cport);
  uint8_t dtimer = digitalPinToTimer(dataPin);
  uint8_t ctimer = digitalPinToTimer(clockPin);
  if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
  if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

  if (bitOrder == LSBFIRST) {
    for (int i = 0; i < 8; i++) {
      if (!!(val & (1 << i)) == LOW) *dreg &= ~dmask;
      else                           *dreg |= dmask;
      *creg |= cmask;
      *creg &= ~cmask;
    }
  } else {
    for (int i = 0; i < 8; i++) {
      if (!!(val & (1 << (7 - 0))) == LOW) *dreg &= ~dmask;
      else                                 *dreg |= dmask;
      *creg |= cmask;
      *creg &= ~cmask;
    }
  }
}

To compile it you need the pins_arduino.h and the cbi/turnOffOPWN definitions from wiring_digital.c:

#include "pins_arduino.h"

#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))

static inline void turnOffPWM(uint8_t timer) __attribute__ ((always_inline));
static inline void turnOffPWM(uint8_t timer)
{
      if (timer == TIMER1A) cbi(TCCR1A, COM1A1);
      if (timer == TIMER1B) cbi(TCCR1A, COM1B1);

#if defined(__AVR_ATmega168__)
      if (timer == TIMER0A) cbi(TCCR0A, COM0A1);
      if (timer == TIMER0B) cbi(TCCR0A, COM0B1);
      if (timer == TIMER2A) cbi(TCCR2A, COM2A1);
      if (timer == TIMER2B) cbi(TCCR2A, COM2B1);
#else
      if (timer == TIMER2) cbi(TCCR2, COM21);
#endif
}

I hope this may help someone someday. Or, even better would be if it could replace the real shiftOut()!!

I have a naive question, it seems like digitalWrite() ends up performing either of the following:

*port_reg &= ~port_mask;

or

*port_reg |= port_mask;

correct me if I am wrong, but these operations are not atomic are they? Some quick investigation with avr-objdump leads me to believe that they will get compiled down to three machine operations. Does this mean that there is always a window of opportunity (although rather small) where an ISR (interrupt handler) could change the value of the output register/port after it has been read and AND/ORed, but before the new AND/ORed value is written back to the register? If so, could this be prevented, is there a mechanism to disable interrupts during these three cycles? If so, is/should the digitalWrite() function be doing that? I didn't notice any code where this looked like it was being done.

Again, forgive me if this makes no sense and/or I am missing something completely.

I turns out that I had a bug in my first version (surprise) which made it go faster! Looks like my testing needs to be more thorough. I tried so many variations of this function and this was the only one with the bug (and thus the fastest!) Oh well, back to 33us for now, but I think that I can still improve it, I got some new ideas for that.

Here is the one character update to fix the code above:

if (!!(val & (1 << (7 - 0))) == LOW) *dreg &= ~dmask;
to
if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask;

The compiler will produce code that will do port I/O in a single instruction (65 nanoseconds) if it knows the port, pin and state at compile time.

So if you can eliminate the intermediate variables in that code your IO will be much faster. It doesn't matter if you code this using something like:
PORTC &= ~(1 << (dataP);
Or
cbi PORTC,5

the compiler will produce the latter fast and atomic code.

I found it convenient to define a macro that converts Arduino ports to registers and pins.

Here is your code rewritten to use the macro:

for( int i = 0; i < 8; i++)
if (!!(val & (1 << (7 - i))) == LOW)
fastWriteA(5, LOW );

the machine instructions for this are as follows, only one line of code is doing IO, everything else is the for loop and bit test.

for( int i = 0; i < 8; i++)
   if (!!(val & (1 << (7 - i))) == LOW)
 168:      cb 01             movw      r24, r22
 16a:      02 2e             mov      r0, r18
 16c:      02 c0             rjmp      .+4            ; 0x172 <loop+0x7a>
 16e:      95 95             asr      r25
 170:      87 95             ror      r24
 172:      0a 94             dec      r0
 174:      e2 f7             brpl      .-8            ; 0x16e <loop+0x76>
 176:      80 ff             sbrs      r24, 0
     fastWriteA(5, LOW );    
 178:      45 98             cbi      0x08, 5      ; 8  // set pin 5 of port C low (analogPin 5)
 17a:      21 50             subi      r18, 0x01      ; 1
 17c:      30 40             sbci      r19, 0x00      ; 0
 17e:      8f ef             ldi      r24, 0xFF      ; 255
 180:      2f 3f             cpi      r18, 0xFF      ; 255
 182:      38 07             cpc      r19, r24
 184:      89 f7             brne      .-30           ; 0x168 <loop+0x70>

Here are the macros

// the following macro sets a digital pin high or low, pin must be between 0 and 13 inclusive
// usage: fastWrite(2,HIGH); fastWrite(13,LOW);
#define fastWrite(_pin_, _state_) ( _pin_ < 8 ? (_state_ ?  PORTD |= 1 << _pin_ : PORTD &= ~(1 << _pin_ )) : (_state_ ?  PORTB |= 1 << (_pin_ -8) : PORTB &= ~(1 << (_pin_ -8)  )))
// the macro sets or clears the appropriate bit in port D if the pin is less than 8 or port B if between 8 and 13

// this macro does fast digital write on pins shared with  the analog port
#define fastWriteA(_pin_,_state_) (_state_ ?  PORTC |= 1 << (_pin_ ) : PORTC &= ~(1 << (_pin_ ) ))

Thanks for the info. I didn't realize that the: PORTC &= ~(1 << (dataP); got translated to one CBI atomic instruction. In effect your code suggestion looks to be exactly the same as mine then, does it not? Above you are (in C) ANDing the bit(pin) specific MASK with the register which, I believe is the same as this from my code: *dreg &= ~dmask; I just thought that it took three instructions to do this, I guess it pays to read the manual on what the opcodes actually do. :slight_smile: There should be no more intermediate values in what I wrote than what you are suggesting, are there? Perhaps I misunderstood your suggestion, is there anything different about your code besides coding style?

Because your code uses variables to reference the port and mask, the compiler puts these values into registers, which it does not do if the port and bit to set are known at compile time.

You can compare the assembly code below to the code above. bear in mind that the fastWrite macro in the above code is the same as: PORTC &= ~(1 << 5);

here is the assembly code for your version that uses dreg and dmask variables:

    for (int i = 0; i < 8; i++) {
      if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask; 
 12e:      cb 01             movw      r24, r22
 130:      04 2e             mov      r0, r20
 132:      02 c0             rjmp      .+4            ; 0x138 <_Z10shiftOutHPhhhh+0x46>
 134:      95 95             asr      r25
 136:      87 95             ror      r24
 138:      0a 94             dec      r0
 13a:      e2 f7             brpl      .-8            ; 0x134 <_Z10shiftOutHPhhhh+0x42>
 13c:      80 fd             sbrc      r24, 0
 13e:      03 c0             rjmp      .+6            ; 0x146 <_Z10shiftOutHPhhhh+0x54>
 140:      80 81             ld      r24, Z
 142:      83 23             and      r24, r19
 144:      80 83             st      Z, r24
 146:      41 50             subi      r20, 0x01      ; 1
 148:      50 40             sbci      r21, 0x00      ; 0
 14a:      8f ef             ldi      r24, 0xFF      ; 255
 14c:      4f 3f             cpi      r20, 0xFF      ; 255
 14e:      58 07             cpc      r21, r24
 150:      71 f7             brne      .-36           ; 0x12e <_Z10shiftOutHPhhhh+0x3c>

So, let me understand what I am seeing. These three lines are the "equivalent" of the CBI from your snippet, the actual pin update, correct?

140: 80 81 ld r24, Z
142: 83 23 and r24, r19
144: 80 83 st Z, r24

Of course, they are not actually equivalent as you are pointing out. The cbi is an atomic instruction (I am assuming) and the three lines above are not! Is this correct? So, does this mean that my original concern about the real digitalWrite() having a non atomic implementation of the pin update a valid concern? If so, I would say that speed issues aside, this should be fixed in the Arduino environment so that ISRs which may interrupt the digitalWrite() function will never result in incorrect values written to the pins. Perhaps one way to do this would be the equivalent of:

switch(pin) {
...
case 5: PORTB &= ~(1 << (5);
break;
...
}

This would get rather long (one case for each pin), but perhaps the compiler would then produce atomic code. I imagine that this could also be a performance hit, I will profile it tonight to see, but if it increases code reliability, it may be worth it. I guess profiling it against turning off interrupts would be interesting.

Thanks again for your clarification.

If so, I would say that speed issues aside, this should be fixed in the Arduino environment so that ISRs which may interrupt the digitalWrite() function will never result in incorrect values written to the pins.

I'm not sure I follow. How would an ISR in the middle of the above three instructions lead to an incorrect digital output value? The only problem I see would arise from having an ISR that changes both pin mode and pin state interrupting a program that changes that same pin's mode and state. For example:

set pin 1 as an output
<- interrupt happens here
ISR sets pin 1 as an input with high Z (i.e. its port bit is zero)
return from ISR
attempt to drive pin 1 high (which actually now enables the pin's internal pull-up)

So the only two states we want are high-Z input or driven-high output, but instead we end up with a state we never envisioned: pulled-high input. Such a sequence could never be atomic, however, so it would be up to you to notice the potential for this kind of problem in your program and to actively prevent it.

  • Ben

I agree that the scenario you describe cannot be avoided (ie. changing the mode of the pin), but this can be easily anticipated at least since both routines were designed to manipulate the same pin. But I am more concerned with a simpler problem, the pin values being wrong even when two routines use different pins!

Imagine a main line program that uses pin 5 as output and an ISR that uses pin4 as output. Both pins are outputs, but both pins share the same port!

  1. pin4 & pin5 are LOW
  2. Main program wants to raise pin5, it reads the port value and ORs it with the appropriate bit mask to raise pin 5 HIGH, but it has not yet written the ORed value to the port (since it takes three steps, not one atomic step)
  3. ISR interrupts main line program and raises pin4 HIGH, returns
  4. Main program continues and now writes out the port value with pin5 HIGH, but with pin4 still LOW since it was LOW when it did the initial read in step2!

The ISR's pin4 has now just been toggled back to LOW. Not only may this be too soon, but the ISR if called in the future may be unaware (unexpecting) of this change. It may assume that it is still HIGH and if it wanted to keep it HIGH do nothing.

This situation arrived even though the programmer of both routines may have believed he was using fairly safe techniques. His ISR was designed to affect pin4 and his main program was designed to affect pin5. Why should he expect these two routines to ever interfere with the pin values of the other if he is using "safe" IO routines such as digitalWrite() in both routines?

Yeah, that makes perfect sense. I wasn't thinking about different pins on the same port. I guess this problem would apply to any registers that both your ISR and main program modify using &= or |= operations that don't get converted into SBI or CBI.

  • Ben

I guess this problem would apply to any registers that both your ISR and main program modify using &= or |= operations that don't get converted into SBI or CBI.

I would assume so, unless they are running with interrupts off. I am not yet very familiar with the Arduino environment, but is it possible that the environment already has ISRs that interfere with ports in unsafe ways like this? The PWM code comes immediately to mind. Could digitalWrite() interfere with PWM pins that it should not? If so, a programmer need not even write an ISR for this to be a potential problem. I realize that it would probably go unnoticed since the PWM might just skip a cycle, but if the PWM were being used for something important, it might matter.

I wonder if there are other main Arduino envrionment potential victims where data loss/corruption could actually ensue, the SoftwareSerial library comes to mind, although it probably does not run in an ISR. The ISR seems to be the routine that will lose out, not the main program right? I would still think that it would be nice if digitalWrite() did not interfere with another digitalWrite() even if one of them is in an ISR.

PWM generation should be ok because that is taken care of entirely in hardware on the mega168 (i.e. it's not the case of a timer-based ISR manually toggling an output pin). There is an ISR bug of this nature that has to do with the millis() timer and its unsafe use of globals that can be updated mid-calculation by the timer0 overflow ISR, but this problem should be corrected in the arduino-0012 release (a temporary fix is to surround every millis() call by cli() and sei() to disable interrupts during the calculation).

I think the only ISRs used by the Arduino environment are the timer0 overflow interrupt (which just updates global variables used by millis()) and the serial receive interrupt. I haven't looked at the serial receive ISR to see exactly what it does, but I'm assuming it just reads data from UDR0 and stores it in a global ring buffer. It's possible the environment also uses this buffer in an unsafe way, but I can't say without looking at the code.

  • Ben

It still seems like there should be a safe (nicely integrated in the Arduino environment) faster version of shiftOut() available for general Arduino use, would you agree? I can't help but wonder how many Arduino users simply gave up performing serial IO with software because they thought they were limited to a slow shiftOut() implementation and did not have the expertise to code up a faster version.

I agree and it should have the ability to designate whether the clock pin idles high or low.

I had to write my own shiftOut to get the clock to idle high and I suffered a glacially slow new function - each clock cycle was taking 14uS. Re writing as a Port manipulation improved to 3uS. I didn't know about macros and the compiler interpretation issues brought up here.

But my devices can clock at 20MHz so I am looking at using SPI (didn't know about that until someone here told me). It seems promising. My only problem with that is my device only uses 2 wires and strobes the chips by holding the clock high and pulsing the data line 8 times. I might have to disable the SPI when I need t do this or maybe multiplex the clock and data lines so I can drive them with both the SPI and some other pins to send the strobe signal.

But basically I agree that a faster shiftOut would be great - lots and lots of people are trying to shiftOut to LED arrays and end up finding the Arduino a bit pokey at this.

Martin, cant you divide the SPI clock/4 to get to 4MHz so tha SPI would work with your 5MHz devices?

Hey, guys, this discussion is great, and I'd love to improve the shiftOut() performance for Arduino 0012. Can someone post an updated version of the shiftOut() code based on the conversation? It might be simpler not to use the fastWrite() macros, but just do everything directly in the shiftOut() code for now. I'm also looking at adding more general purpose bitRead() and bitWrite() macros which would something similar to fastWrite(), but until then, it's probably simpler to do without them.

OK, I have more results. First off, I benchmarked a switch() version of shiftOut() using hardcoded port/pins for each case. This switch trick allows the compiler to see the ports and pins as constants and it indeed does use single instruction bit set/clears. Unfortunately, the extra comparison and branches needed for every iteration of the loop leads to a slower performing shiftOut(), although it would use an atomic update and be more "correct" than the original shiftOut(). The other problem with using a switch is that it makes the loop times potentially different depending on which pin is chosen since it will involve either fewer or more comparisons. I benchmarked this shiftOutSwitch() at ~40us, which is not bad, but not great and the code is ugly (2 large 13 step switches)!!!

Displeased with these results, I tried another approach to try and use real cbi/sbi instructions. Instead of using a switch statement, I created separate cbi1(), cbi2()... functions and then created an array of function pointers to each function. With this I built a new shiftOutFP() (function pointer), which before looping picks the appropriate sbi/sbi functions for each of the data and clock pins and simply calls those in the loop. This also led the compiler to perform the constant port/pin optimization and thus to use the atomic cbi/sbi machine instructions. The performance results were however still
disappointing, ~42us. Not only that, but this is even uglier code than shiftOutSwitch() since it involves defining at least 13x2=26 functions and then uses 23*2=46 bytes of SRAM to store the pointers to these in an array. Needless to say I was disappointed, while the attempt to use cbi/sbi did to lead safer more correct code, in both cases it is much uglier code and slower.

So, I decided to give up on using sbi/cbi for now and focused on making the shiftOutHP() from previous posts more "correct", i.e. fixing the non-atomic write from digitalWrite(). To do this I simply disabled interrupts during each pin update and restored the previous state afterwards. Surprisingly, this ends up being faster than the previous two atomic attempts, ~37us. So, while this is slightly slower than the non-atomic version, this is still (exactly) a 3 fold improvement over the current arduino shiftOut() and is more correct to boot. While some speedup may be gained by turning interrupts off once and keeping them off the whole time, this could increase latency for interrupts dramatically and at best cause a 4us speedup, I would not consider this a good tradeoff. This would be my current suggestion of a function to replace the arduino core shiftOut().

void shiftOutHPA(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
{
  uint8_t dport = digitalPinToPort(dataPin);
  uint8_t cport = digitalPinToPort(clockPin);
  if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
  uint8_t dmask = digitalPinToBitMask(dataPin);
  uint8_t cmask = digitalPinToBitMask(clockPin);
  volatile uint8_t *dreg = portOutputRegister(dport);
  volatile uint8_t *creg = portOutputRegister(cport);
  uint8_t dtimer = digitalPinToTimer(dataPin);
  uint8_t ctimer = digitalPinToTimer(clockPin);
  if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
  if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

  byte sreg = SREG;
  if (bitOrder == LSBFIRST)
    for (int i = 0; i < 8; i++) {
      cli();
      if (!!(val & (1 << i)) == LOW) *dreg &= ~dmask;
      else *dreg |= dmask;
      SREG=sreg; cli();
      *creg |= cmask;
      SREG=sreg; cli();
      *creg &= ~cmask;
      SREG=sreg;
    }
  else
    for (int i = 0; i < 8; i++) {
      cli();
      if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask;
      else *dreg |= dmask;
      SREG=sreg; cli();
      *creg |= cmask;
      SREG=sreg; cli();
      *creg &= ~cmask;
      SREG=sreg;
    }
}

I have not heavily tested this function yet, so it would be best if others could test it too before including it in the core. I would also suggest a similar SREG/cli fix to go into digitalWrite() so that it is less likely to interfere with other pins used by interrupts, the performance impact will be negligible.

Does it actually matter in practice if these writes are atomic or not? Have people had problems with this? Most of the other core functions don't disable interrupts while writing to the ports, so I'm not sure it makes sense to do it here.