Go Down

Topic: Higher Performance IO (Read 16559 times) previous topic - next topic


I am fairly new to arduino, so forgive me if some of these things are obvious...  I am in the middle of a project which will involve performing TDM across a bunch of shift register lines to control some large 8 segment LED displays.  To do this I have been thinking of using an ISR with FrequencyTimer2 to output data using SPI to my latches.  What I have come to realize though is that SPI using shiftOut() is not very fast.

I performed some rudimentary measurements and it seems like shiftOut() takes about 110us to output a byte.  Perhaps this explains the 9600 baud limitation to the SoftwareSerial library?  After delving into the shiftOut() and digitalWrite() code, (shiftOut() simply calls a bunch of digitalWrite()s in succession), I was able to perform some simple optimizations since digitalWrite() performs much of the same preparatory stuff over and over.  I was able to create a dropin replacement for shiftOut() that works in 31ms.  This could probably be further optimized but I thought that it was a good start.  

This leas me to wonder whether others need higher performance IO in general with the Arduino and whether the core libraries should potentially be better optimized.  Perhaps it would be even good to create a separate high performance IO library which may have slightly more complicated but more accurate IO functions.  These functions might be slightly harder to use than the standard functions, but if properly used could give a substantial performance improvement still?  For example: I split the shiftOut() function into a shiftOutPrep() and a shiftOutRaw() function.  If using repetitive data, it can be prepared with shiftOutPrep() and then transmitted with shiftOutRaw().  If the same data needs to be transmitted again later, there is no need to call shiftOutPrep() again.  I can get even better performance out of shiftOutRaw() than the optimized shiftOut() above, ~18us to transmit a byte!  

These split functions would probably not be useful for random SerialIO such as the SoftwareSerial library, but for my use such as a TDM application where the data to be shifted out cycles through a repeated set of bytes, it pays to prepare the data with shiftOutPrep() function once in advance and then simply send the prepared bytes out with shiftOutRaw() cycling through a few prepared set of bytes.  With the ability to send bytes serially in less than 20us one could perhaps perform repetitive serial IO at 50K baud!  This is much better than the current 9600baud limitation.

Is anyone else interested in creating a High Performance IO library?  I imagine that others could suggest even better optimizations for these functions greatly increasing the arduino's abilities.

If interested, I will post some of my optimizations here along with time measurements. for each.


shiftOut() is not the interface to the ATmega's SPI hardware.  IIRC, it was written to deal with using serial-in/parallel-out shift registers.

digitalWrite() is part of the nice, easy, safe Arduino environment.  If you need more speed, you can write to the output pins directly using PORTx.

The ATmega has SPI hardware.  If you are using SPI peripherals you should probably use it for best performance.

As an aside, I glanced at the shiftOut() reference, and it claims to be SPI.  This should be corrected; not only does it not use the ATmega SPI hardware, it doesn't even have MISO (peripheral data input).



Ah, yes, you are correct.  I was mislead by the use of SPI in the docs.  I do not have SPI hardware, I have simple three wire (data, clock, latch/strobe) serial shift register drivers, thus the use of shiftOut().

I will look into PORTx to speed things up.

It still seems like there should be a safe (nicely integrated in the Arduino environment) faster version of shiftOut() available for general Arduino use, would you agree?  I can't help but wonder how many Arduino users simply gave up performing serial IO with software because they thought they were limited to a slow shiftOut() implementation and did not have the expertise to code up a faster version.


I have simple three wire (data, clock, latch/strobe) serial shift register drivers

If they can clock at 16MHz, hook 'em to the SPI hardware and blast away.  come to think of it, seems like I publicly assumed that SPI clocked at the system clock rate, but some kind soul corrected me.  Guess I should have paid more attention...



Nice idea, but they are only 5MHZ! :)

I have further improved my shiftOutHP() (High Performance).  It now clocks out a byte in 20us, almost a 6 fold improvement over the original!!  That means that 50K serial output should be possible.  This function should be almost exactly functionally equivalent (sanity checks, setup, etc) to the original shiftOut(), but it has many redundant operations removed.  It seems to work with my chips. :)

Code: [Select]

void shiftOutHP(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
 uint8_t dport = digitalPinToPort(dataPin);
 uint8_t cport = digitalPinToPort(clockPin);
 if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
 uint8_t dmask = digitalPinToBitMask(dataPin);
 uint8_t cmask = digitalPinToBitMask(clockPin);
 volatile uint8_t *dreg = portOutputRegister(dport);
 volatile uint8_t *creg = portOutputRegister(cport);
 uint8_t dtimer = digitalPinToTimer(dataPin);
 uint8_t ctimer = digitalPinToTimer(clockPin);
 if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
 if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

 if (bitOrder == LSBFIRST) {
   for (int i = 0; i < 8; i++) {
     if (!!(val & (1 << i)) == LOW) *dreg &= ~dmask;
     else                           *dreg |= dmask;
     *creg |= cmask;
     *creg &= ~cmask;
 } else {
   for (int i = 0; i < 8; i++) {
     if (!!(val & (1 << (7 - 0))) == LOW) *dreg &= ~dmask;
     else                                 *dreg |= dmask;
     *creg |= cmask;
     *creg &= ~cmask;

To compile it you need the pins_arduino.h and the cbi/turnOffOPWN definitions from wiring_digital.c:

Code: [Select]

#include "pins_arduino.h"

#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))

static inline void turnOffPWM(uint8_t timer) __attribute__ ((always_inline));
static inline void turnOffPWM(uint8_t timer)
     if (timer == TIMER1A) cbi(TCCR1A, COM1A1);
     if (timer == TIMER1B) cbi(TCCR1A, COM1B1);

#if defined(__AVR_ATmega168__)
     if (timer == TIMER0A) cbi(TCCR0A, COM0A1);
     if (timer == TIMER0B) cbi(TCCR0A, COM0B1);
     if (timer == TIMER2A) cbi(TCCR2A, COM2A1);
     if (timer == TIMER2B) cbi(TCCR2A, COM2B1);
     if (timer == TIMER2) cbi(TCCR2, COM21);

I hope this may help someone someday.  Or, even better would be if it could replace the real shiftOut()!!


I have a naive question, it seems like digitalWrite() ends up performing either of the following:

 *port_reg &= ~port_mask;


 *port_reg |= port_mask;

correct me if I am wrong, but these operations are not atomic are they?  Some quick investigation with avr-objdump leads me to believe that they will get compiled down to three machine operations.  Does this mean that there is always a window of opportunity (although rather small) where an ISR (interrupt handler) could change the value of the output register/port after it has been read and AND/ORed, but before the new AND/ORed value is written back to the register?  If so, could this be prevented, is there a mechanism to disable interrupts during these three cycles?  If so, is/should the digitalWrite() function be doing that?  I didn't notice any code where this looked like it was being done.

Again, forgive me if this makes no sense and/or I am missing something completely.


I turns out that I had a bug in my first version (surprise) which made it go faster!  Looks like my testing needs to be more thorough.  I tried so many variations of this function and this was the only one with the bug (and thus the fastest!)  Oh well, back to 33us for now, but I think that I can still improve it, I got some new ideas for that.

Here is the one character update to fix the code above:

     if (!!(val & (1 << (7 - 0))) == LOW) *dreg &= ~dmask;
     if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask;


The compiler will produce code that will do port I/O in a single instruction (65 nanoseconds) if it knows the port, pin and state at compile time.

So if you can eliminate the intermediate variables in that code your IO will be much faster. It doesn't matter if you code this using something like:
 PORTC &= ~(1 << (dataP);
 cbi      PORTC,5      

the compiler will produce the latter fast and atomic code.

I found it convenient to define a macro that converts Arduino ports to registers and pins.

Here is your code rewritten to use the macro:

for( int i = 0; i < 8; i++)
  if (!!(val & (1 << (7 - i))) == LOW)
    fastWriteA(5, LOW );  

the machine instructions for this are as follows, only one line of code is doing IO, everything else is the for loop and bit test.
Code: [Select]
for( int i = 0; i < 8; i++)
  if (!!(val & (1 << (7 - i))) == LOW)
168:      cb 01             movw      r24, r22
16a:      02 2e             mov      r0, r18
16c:      02 c0             rjmp      .+4            ; 0x172 <loop+0x7a>
16e:      95 95             asr      r25
170:      87 95             ror      r24
172:      0a 94             dec      r0
174:      e2 f7             brpl      .-8            ; 0x16e <loop+0x76>
176:      80 ff             sbrs      r24, 0
    fastWriteA(5, LOW );    
178:      45 98             cbi      0x08, 5      ; 8  // set pin 5 of port C low (analogPin 5)
17a:      21 50             subi      r18, 0x01      ; 1
17c:      30 40             sbci      r19, 0x00      ; 0
17e:      8f ef             ldi      r24, 0xFF      ; 255
180:      2f 3f             cpi      r18, 0xFF      ; 255
182:      38 07             cpc      r19, r24
184:      89 f7             brne      .-30           ; 0x168 <loop+0x70>

Here are  the macros

Code: [Select]

// the following macro sets a digital pin high or low, pin must be between 0 and 13 inclusive
// usage: fastWrite(2,HIGH); fastWrite(13,LOW);
#define fastWrite(_pin_, _state_) ( _pin_ < 8 ? (_state_ ?  PORTD |= 1 << _pin_ : PORTD &= ~(1 << _pin_ )) : (_state_ ?  PORTB |= 1 << (_pin_ -8) : PORTB &= ~(1 << (_pin_ -8)  )))
// the macro sets or clears the appropriate bit in port D if the pin is less than 8 or port B if between 8 and 13

// this macro does fast digital write on pins shared with  the analog port
#define fastWriteA(_pin_,_state_) (_state_ ?  PORTC |= 1 << (_pin_ ) : PORTC &= ~(1 << (_pin_ ) ))


Thanks for the info.   I didn't realize that the: PORTC &= ~(1 << (dataP);  got translated to one CBI atomic instruction.  In effect your code suggestion looks to be exactly the same as mine then, does it not?  Above you are (in C) ANDing the bit(pin) specific MASK with the register which, I believe is the same as this from my code: *dreg &= ~dmask;  I just thought that it took three instructions to do this, I guess it pays to read the manual on what the opcodes actually do. :)  There should be no more intermediate values in what I wrote than what you are suggesting, are there?  Perhaps I misunderstood your suggestion, is there anything different about your code besides coding style?  


Jul 24, 2008, 07:09 pm Last Edit: Jul 24, 2008, 07:10 pm by mem Reason: 1
Because your code uses variables to reference the port and mask, the compiler puts these values into registers, which it does not do if the port and bit to set are known at compile time.

You can compare the assembly code below to the code above. bear in mind that the fastWrite macro in the above code is the same as:   PORTC &= ~(1 << 5);  

here is the assembly code for your version that uses dreg and dmask variables:
Code: [Select]

   for (int i = 0; i < 8; i++) {
     if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask;
12e:      cb 01             movw      r24, r22
130:      04 2e             mov      r0, r20
132:      02 c0             rjmp      .+4            ; 0x138 <_Z10shiftOutHPhhhh+0x46>
134:      95 95             asr      r25
136:      87 95             ror      r24
138:      0a 94             dec      r0
13a:      e2 f7             brpl      .-8            ; 0x134 <_Z10shiftOutHPhhhh+0x42>
13c:      80 fd             sbrc      r24, 0
13e:      03 c0             rjmp      .+6            ; 0x146 <_Z10shiftOutHPhhhh+0x54>
140:      80 81             ld      r24, Z
142:      83 23             and      r24, r19
144:      80 83             st      Z, r24
146:      41 50             subi      r20, 0x01      ; 1
148:      50 40             sbci      r21, 0x00      ; 0
14a:      8f ef             ldi      r24, 0xFF      ; 255
14c:      4f 3f             cpi      r20, 0xFF      ; 255
14e:      58 07             cpc      r21, r24
150:      71 f7             brne      .-36           ; 0x12e <_Z10shiftOutHPhhhh+0x3c>


So, let me understand what I am seeing.  These three lines are the "equivalent" of the CBI from your snippet, the actual pin update, correct?

140:      80 81             ld      r24, Z
142:      83 23             and      r24, r19
144:      80 83             st      Z, r24

Of course, they are not actually equivalent as you are pointing out.  The cbi is an atomic instruction (I am assuming) and the three lines above are not!  Is this correct?  So, does this mean that my original concern about the real digitalWrite() having a non atomic implementation of the pin update a valid concern?  If so, I would say that speed issues aside, this should be fixed in the Arduino environment so that ISRs which may interrupt the digitalWrite() function will never result in incorrect values written to the pins.  Perhaps one way to do this would be the equivalent of:

switch(pin) {
 case 5:  PORTB &= ~(1 << (5);
 ...<many more cases removed>

This would get rather long (one case for each pin), but perhaps the compiler would then produce atomic code.  I imagine that this could also be a performance hit, I will profile it tonight to see, but if it increases code reliability, it may be worth it.  I guess profiling it against turning off interrupts would be interesting.

Thanks again for your clarification.


If so, I would say that speed issues aside, this should be fixed in the Arduino environment so that ISRs which may interrupt the digitalWrite() function will never result in incorrect values written to the pins.

I'm not sure I follow.  How would an ISR in the middle of the above three instructions lead to an incorrect digital output value?  The only problem I see would arise from having an ISR that changes both pin mode and pin state interrupting a program that changes that same pin's mode and state.  For example:

set pin 1 as an output
<- interrupt happens here
 ISR sets pin 1 as an input with high Z (i.e. its port bit is zero)
return from ISR
attempt to drive pin 1 high (which actually now enables the pin's internal pull-up)

So the only two states we want are high-Z input or driven-high output, but instead we end up with a state we never envisioned: pulled-high input.  Such a sequence could never be atomic, however, so it would be up to you to notice the potential for this kind of problem in your program and to actively prevent it.

- Ben


I agree that the scenario you describe cannot be avoided (ie. changing the mode of the pin), but this can be easily anticipated at least since both routines were designed to manipulate the same pin.  But I am more concerned with a simpler problem, the pin values being wrong even when two routines use different pins!

Imagine a main line program that uses pin 5 as output and an ISR that uses pin4 as output.  Both pins are outputs, but both pins share the same port!

1) pin4 & pin5 are LOW
2) Main program wants to raise pin5, it reads the port value and ORs it with the appropriate bit mask to raise pin 5 HIGH, but it has not yet written the ORed value to the port (since it takes three steps, not one atomic step)
3) ISR interrupts main line program and raises pin4 HIGH, returns
4) Main program continues and now writes out the port value with pin5 HIGH, but with pin4 still LOW since it was LOW when it did the initial read in step2!

The ISR's pin4 has now just been toggled back to LOW.  Not only may this be too soon, but the ISR if called in the future may be unaware (unexpecting) of this change.  It may assume that it is still HIGH and if it wanted to keep it HIGH do nothing.

This situation arrived even though the programmer of both routines may have believed he was using fairly safe techniques.  His ISR was designed to affect pin4 and his main program was designed to affect pin5.  Why should he expect these two routines to ever interfere with the pin values of the other if he is using "safe" IO routines such as digitalWrite() in both routines?


Yeah, that makes perfect sense.  I wasn't thinking about different pins on the same port.  I guess this problem would apply to any registers that both your ISR and main program modify using &= or |= operations that don't get converted into SBI or CBI.

- Ben


I guess this problem would apply to any registers that both your ISR and main program modify using &= or |= operations that don't get converted into SBI or CBI.

I would assume so, unless they are running with interrupts off.  I am not yet very familiar with the Arduino environment, but is it possible that the environment already has ISRs that interfere with ports in unsafe ways like this?  The PWM code comes immediately to mind.  Could digitalWrite() interfere with PWM pins that it should not?  If so, a programmer need not even write an ISR for this to be a potential problem.  I realize that it would probably go unnoticed since the PWM might just skip a cycle, but if the PWM were being used for something important, it might matter.  

I wonder if there are other main Arduino envrionment potential victims where data loss/corruption could actually ensue,  the SoftwareSerial library comes to mind, although it probably does not run in an ISR. The ISR seems to be the routine that will lose out, not the main program right?  I would still think that it would be nice if digitalWrite() did not interfere with another digitalWrite() even if one of them is in an ISR.

Go Up