Higher Performance IO

The compiler will produce code that will do port I/O in a single instruction (65 nanoseconds) if it knows the port, pin and state at compile time.

So if you can eliminate the intermediate variables in that code your IO will be much faster. It doesn't matter if you code this using something like:
PORTC &= ~(1 << (dataP);
Or
cbi PORTC,5

the compiler will produce the latter fast and atomic code.

I found it convenient to define a macro that converts Arduino ports to registers and pins.

Here is your code rewritten to use the macro:

for( int i = 0; i < 8; i++)
if (!!(val & (1 << (7 - i))) == LOW)
fastWriteA(5, LOW );

the machine instructions for this are as follows, only one line of code is doing IO, everything else is the for loop and bit test.

for( int i = 0; i < 8; i++)
   if (!!(val & (1 << (7 - i))) == LOW)
 168:      cb 01             movw      r24, r22
 16a:      02 2e             mov      r0, r18
 16c:      02 c0             rjmp      .+4            ; 0x172 <loop+0x7a>
 16e:      95 95             asr      r25
 170:      87 95             ror      r24
 172:      0a 94             dec      r0
 174:      e2 f7             brpl      .-8            ; 0x16e <loop+0x76>
 176:      80 ff             sbrs      r24, 0
     fastWriteA(5, LOW );    
 178:      45 98             cbi      0x08, 5      ; 8  // set pin 5 of port C low (analogPin 5)
 17a:      21 50             subi      r18, 0x01      ; 1
 17c:      30 40             sbci      r19, 0x00      ; 0
 17e:      8f ef             ldi      r24, 0xFF      ; 255
 180:      2f 3f             cpi      r18, 0xFF      ; 255
 182:      38 07             cpc      r19, r24
 184:      89 f7             brne      .-30           ; 0x168 <loop+0x70>

Here are the macros

// the following macro sets a digital pin high or low, pin must be between 0 and 13 inclusive
// usage: fastWrite(2,HIGH); fastWrite(13,LOW);
#define fastWrite(_pin_, _state_) ( _pin_ < 8 ? (_state_ ?  PORTD |= 1 << _pin_ : PORTD &= ~(1 << _pin_ )) : (_state_ ?  PORTB |= 1 << (_pin_ -8) : PORTB &= ~(1 << (_pin_ -8)  )))
// the macro sets or clears the appropriate bit in port D if the pin is less than 8 or port B if between 8 and 13

// this macro does fast digital write on pins shared with  the analog port
#define fastWriteA(_pin_,_state_) (_state_ ?  PORTC |= 1 << (_pin_ ) : PORTC &= ~(1 << (_pin_ ) ))

Thanks for the info. I didn't realize that the: PORTC &= ~(1 << (dataP); got translated to one CBI atomic instruction. In effect your code suggestion looks to be exactly the same as mine then, does it not? Above you are (in C) ANDing the bit(pin) specific MASK with the register which, I believe is the same as this from my code: *dreg &= ~dmask; I just thought that it took three instructions to do this, I guess it pays to read the manual on what the opcodes actually do. :slight_smile: There should be no more intermediate values in what I wrote than what you are suggesting, are there? Perhaps I misunderstood your suggestion, is there anything different about your code besides coding style?

Because your code uses variables to reference the port and mask, the compiler puts these values into registers, which it does not do if the port and bit to set are known at compile time.

You can compare the assembly code below to the code above. bear in mind that the fastWrite macro in the above code is the same as: PORTC &= ~(1 << 5);

here is the assembly code for your version that uses dreg and dmask variables:

    for (int i = 0; i < 8; i++) {
      if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask; 
 12e:      cb 01             movw      r24, r22
 130:      04 2e             mov      r0, r20
 132:      02 c0             rjmp      .+4            ; 0x138 <_Z10shiftOutHPhhhh+0x46>
 134:      95 95             asr      r25
 136:      87 95             ror      r24
 138:      0a 94             dec      r0
 13a:      e2 f7             brpl      .-8            ; 0x134 <_Z10shiftOutHPhhhh+0x42>
 13c:      80 fd             sbrc      r24, 0
 13e:      03 c0             rjmp      .+6            ; 0x146 <_Z10shiftOutHPhhhh+0x54>
 140:      80 81             ld      r24, Z
 142:      83 23             and      r24, r19
 144:      80 83             st      Z, r24
 146:      41 50             subi      r20, 0x01      ; 1
 148:      50 40             sbci      r21, 0x00      ; 0
 14a:      8f ef             ldi      r24, 0xFF      ; 255
 14c:      4f 3f             cpi      r20, 0xFF      ; 255
 14e:      58 07             cpc      r21, r24
 150:      71 f7             brne      .-36           ; 0x12e <_Z10shiftOutHPhhhh+0x3c>

So, let me understand what I am seeing. These three lines are the "equivalent" of the CBI from your snippet, the actual pin update, correct?

140: 80 81 ld r24, Z
142: 83 23 and r24, r19
144: 80 83 st Z, r24

Of course, they are not actually equivalent as you are pointing out. The cbi is an atomic instruction (I am assuming) and the three lines above are not! Is this correct? So, does this mean that my original concern about the real digitalWrite() having a non atomic implementation of the pin update a valid concern? If so, I would say that speed issues aside, this should be fixed in the Arduino environment so that ISRs which may interrupt the digitalWrite() function will never result in incorrect values written to the pins. Perhaps one way to do this would be the equivalent of:

switch(pin) {
...
case 5: PORTB &= ~(1 << (5);
break;
...
}

This would get rather long (one case for each pin), but perhaps the compiler would then produce atomic code. I imagine that this could also be a performance hit, I will profile it tonight to see, but if it increases code reliability, it may be worth it. I guess profiling it against turning off interrupts would be interesting.

Thanks again for your clarification.

If so, I would say that speed issues aside, this should be fixed in the Arduino environment so that ISRs which may interrupt the digitalWrite() function will never result in incorrect values written to the pins.

I'm not sure I follow. How would an ISR in the middle of the above three instructions lead to an incorrect digital output value? The only problem I see would arise from having an ISR that changes both pin mode and pin state interrupting a program that changes that same pin's mode and state. For example:

set pin 1 as an output
<- interrupt happens here
ISR sets pin 1 as an input with high Z (i.e. its port bit is zero)
return from ISR
attempt to drive pin 1 high (which actually now enables the pin's internal pull-up)

So the only two states we want are high-Z input or driven-high output, but instead we end up with a state we never envisioned: pulled-high input. Such a sequence could never be atomic, however, so it would be up to you to notice the potential for this kind of problem in your program and to actively prevent it.

  • Ben

I agree that the scenario you describe cannot be avoided (ie. changing the mode of the pin), but this can be easily anticipated at least since both routines were designed to manipulate the same pin. But I am more concerned with a simpler problem, the pin values being wrong even when two routines use different pins!

Imagine a main line program that uses pin 5 as output and an ISR that uses pin4 as output. Both pins are outputs, but both pins share the same port!

  1. pin4 & pin5 are LOW
  2. Main program wants to raise pin5, it reads the port value and ORs it with the appropriate bit mask to raise pin 5 HIGH, but it has not yet written the ORed value to the port (since it takes three steps, not one atomic step)
  3. ISR interrupts main line program and raises pin4 HIGH, returns
  4. Main program continues and now writes out the port value with pin5 HIGH, but with pin4 still LOW since it was LOW when it did the initial read in step2!

The ISR's pin4 has now just been toggled back to LOW. Not only may this be too soon, but the ISR if called in the future may be unaware (unexpecting) of this change. It may assume that it is still HIGH and if it wanted to keep it HIGH do nothing.

This situation arrived even though the programmer of both routines may have believed he was using fairly safe techniques. His ISR was designed to affect pin4 and his main program was designed to affect pin5. Why should he expect these two routines to ever interfere with the pin values of the other if he is using "safe" IO routines such as digitalWrite() in both routines?

Yeah, that makes perfect sense. I wasn't thinking about different pins on the same port. I guess this problem would apply to any registers that both your ISR and main program modify using &= or |= operations that don't get converted into SBI or CBI.

  • Ben

I guess this problem would apply to any registers that both your ISR and main program modify using &= or |= operations that don't get converted into SBI or CBI.

I would assume so, unless they are running with interrupts off. I am not yet very familiar with the Arduino environment, but is it possible that the environment already has ISRs that interfere with ports in unsafe ways like this? The PWM code comes immediately to mind. Could digitalWrite() interfere with PWM pins that it should not? If so, a programmer need not even write an ISR for this to be a potential problem. I realize that it would probably go unnoticed since the PWM might just skip a cycle, but if the PWM were being used for something important, it might matter.

I wonder if there are other main Arduino envrionment potential victims where data loss/corruption could actually ensue, the SoftwareSerial library comes to mind, although it probably does not run in an ISR. The ISR seems to be the routine that will lose out, not the main program right? I would still think that it would be nice if digitalWrite() did not interfere with another digitalWrite() even if one of them is in an ISR.

PWM generation should be ok because that is taken care of entirely in hardware on the mega168 (i.e. it's not the case of a timer-based ISR manually toggling an output pin). There is an ISR bug of this nature that has to do with the millis() timer and its unsafe use of globals that can be updated mid-calculation by the timer0 overflow ISR, but this problem should be corrected in the arduino-0012 release (a temporary fix is to surround every millis() call by cli() and sei() to disable interrupts during the calculation).

I think the only ISRs used by the Arduino environment are the timer0 overflow interrupt (which just updates global variables used by millis()) and the serial receive interrupt. I haven't looked at the serial receive ISR to see exactly what it does, but I'm assuming it just reads data from UDR0 and stores it in a global ring buffer. It's possible the environment also uses this buffer in an unsafe way, but I can't say without looking at the code.

  • Ben

It still seems like there should be a safe (nicely integrated in the Arduino environment) faster version of shiftOut() available for general Arduino use, would you agree? I can't help but wonder how many Arduino users simply gave up performing serial IO with software because they thought they were limited to a slow shiftOut() implementation and did not have the expertise to code up a faster version.

I agree and it should have the ability to designate whether the clock pin idles high or low.

I had to write my own shiftOut to get the clock to idle high and I suffered a glacially slow new function - each clock cycle was taking 14uS. Re writing as a Port manipulation improved to 3uS. I didn't know about macros and the compiler interpretation issues brought up here.

But my devices can clock at 20MHz so I am looking at using SPI (didn't know about that until someone here told me). It seems promising. My only problem with that is my device only uses 2 wires and strobes the chips by holding the clock high and pulsing the data line 8 times. I might have to disable the SPI when I need t do this or maybe multiplex the clock and data lines so I can drive them with both the SPI and some other pins to send the strobe signal.

But basically I agree that a faster shiftOut would be great - lots and lots of people are trying to shiftOut to LED arrays and end up finding the Arduino a bit pokey at this.

Martin, cant you divide the SPI clock/4 to get to 4MHz so tha SPI would work with your 5MHz devices?

Hey, guys, this discussion is great, and I'd love to improve the shiftOut() performance for Arduino 0012. Can someone post an updated version of the shiftOut() code based on the conversation? It might be simpler not to use the fastWrite() macros, but just do everything directly in the shiftOut() code for now. I'm also looking at adding more general purpose bitRead() and bitWrite() macros which would something similar to fastWrite(), but until then, it's probably simpler to do without them.

OK, I have more results. First off, I benchmarked a switch() version of shiftOut() using hardcoded port/pins for each case. This switch trick allows the compiler to see the ports and pins as constants and it indeed does use single instruction bit set/clears. Unfortunately, the extra comparison and branches needed for every iteration of the loop leads to a slower performing shiftOut(), although it would use an atomic update and be more "correct" than the original shiftOut(). The other problem with using a switch is that it makes the loop times potentially different depending on which pin is chosen since it will involve either fewer or more comparisons. I benchmarked this shiftOutSwitch() at ~40us, which is not bad, but not great and the code is ugly (2 large 13 step switches)!!!

Displeased with these results, I tried another approach to try and use real cbi/sbi instructions. Instead of using a switch statement, I created separate cbi1(), cbi2()... functions and then created an array of function pointers to each function. With this I built a new shiftOutFP() (function pointer), which before looping picks the appropriate sbi/sbi functions for each of the data and clock pins and simply calls those in the loop. This also led the compiler to perform the constant port/pin optimization and thus to use the atomic cbi/sbi machine instructions. The performance results were however still
disappointing, ~42us. Not only that, but this is even uglier code than shiftOutSwitch() since it involves defining at least 13x2=26 functions and then uses 23*2=46 bytes of SRAM to store the pointers to these in an array. Needless to say I was disappointed, while the attempt to use cbi/sbi did to lead safer more correct code, in both cases it is much uglier code and slower.

So, I decided to give up on using sbi/cbi for now and focused on making the shiftOutHP() from previous posts more "correct", i.e. fixing the non-atomic write from digitalWrite(). To do this I simply disabled interrupts during each pin update and restored the previous state afterwards. Surprisingly, this ends up being faster than the previous two atomic attempts, ~37us. So, while this is slightly slower than the non-atomic version, this is still (exactly) a 3 fold improvement over the current arduino shiftOut() and is more correct to boot. While some speedup may be gained by turning interrupts off once and keeping them off the whole time, this could increase latency for interrupts dramatically and at best cause a 4us speedup, I would not consider this a good tradeoff. This would be my current suggestion of a function to replace the arduino core shiftOut().

void shiftOutHPA(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
{
  uint8_t dport = digitalPinToPort(dataPin);
  uint8_t cport = digitalPinToPort(clockPin);
  if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
  uint8_t dmask = digitalPinToBitMask(dataPin);
  uint8_t cmask = digitalPinToBitMask(clockPin);
  volatile uint8_t *dreg = portOutputRegister(dport);
  volatile uint8_t *creg = portOutputRegister(cport);
  uint8_t dtimer = digitalPinToTimer(dataPin);
  uint8_t ctimer = digitalPinToTimer(clockPin);
  if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
  if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

  byte sreg = SREG;
  if (bitOrder == LSBFIRST)
    for (int i = 0; i < 8; i++) {
      cli();
      if (!!(val & (1 << i)) == LOW) *dreg &= ~dmask;
      else *dreg |= dmask;
      SREG=sreg; cli();
      *creg |= cmask;
      SREG=sreg; cli();
      *creg &= ~cmask;
      SREG=sreg;
    }
  else
    for (int i = 0; i < 8; i++) {
      cli();
      if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask;
      else *dreg |= dmask;
      SREG=sreg; cli();
      *creg |= cmask;
      SREG=sreg; cli();
      *creg &= ~cmask;
      SREG=sreg;
    }
}

I have not heavily tested this function yet, so it would be best if others could test it too before including it in the core. I would also suggest a similar SREG/cli fix to go into digitalWrite() so that it is less likely to interfere with other pins used by interrupts, the performance impact will be negligible.

Does it actually matter in practice if these writes are atomic or not? Have people had problems with this? Most of the other core functions don't disable interrupts while writing to the ports, so I'm not sure it makes sense to do it here.

Since I am fairly new to this, I don't know if anyone has had any problems with the non atomic writes. if they did, I would think that it might be hard for them to even figure out why they are having a problem. The symptom itself might be hard to detect and simply result in people thinking the arduino is a POS and giving up, but I am surely just guessing.

When it comes down to it, feel free to treat this as two separate issues. The one that started this thread is performance. You could make the performance modification alone. If you want to do a code audit for things like atomic updates at some point, you do that separately on the entire code base. My two cents. I just happened to come about the conclusion of "incorrect design" by investigating the performance issue.

If you want the performance fix without the atomic fix, you can use the original shiftOutHP() with the fix mentioned in the following post. If you further more want to include a shiftOutRaw() at some point, that might be nice too, but it may not fit well into the goals of the core arduino stuff since it would primarily eliminate all the safety checks done at the beginning. I will post the best shiftOutRaw() that I can come up with soon, a hardcoded one to specific compile time chosen pins should be quite a bit faster!

OK, I'm sorry I missed some really obvious optimizations still, I have now doubled the best previous performance I had with a new shiftOutHPPM() (High Performance PreMasked). By using predefined masks for bit comparisons I saved quite a bit of time. Once this optimization was in place it seemed worth uglying up the code a little and unrolling the loop and gaining the extra 6us which is a fairly high percentage at this point. This leads me to a shiftOutHPPM() which can execute in 18us!!! Adding an atomic fix to this seems to only take another 2us. While this is not the prettiest code, it is the fastest dropin replacement for shiftOut() that I could come up with to date, more than a 6 fold improvement over the current arduino code.

void shiftOutHPPM(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
{
  uint8_t dport = digitalPinToPort(dataPin);
  uint8_t cport = digitalPinToPort(clockPin);
  if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
  uint8_t dmask = digitalPinToBitMask(dataPin);
  uint8_t cmask = digitalPinToBitMask(clockPin);
  volatile uint8_t *dreg = portOutputRegister(dport);
  volatile uint8_t *creg = portOutputRegister(cport);
  uint8_t dtimer = digitalPinToTimer(dataPin);
  uint8_t ctimer = digitalPinToTimer(clockPin);
  if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
  if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

  if (bitOrder == LSBFIRST) {
    if (val & B00000001) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000010) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000100) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00001000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00010000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00100000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B01000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B10000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
  } else {
    if (val & B10000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B01000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00100000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00010000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00001000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000100) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000010) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000001) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
  }
}

If you want atomic behavior, change it to look like this:

    byte sreg=SREG;
    if (val & B00000001) { cli(); *dreg |=  dmask; SREG=sreg; }
    else                 { cli(); *dreg &= ~dmask; SREG=sreg; }
    cli(); *creg |= cmask; SREG=sreg; cli(); *creg &= ~cmask; sreg= SREG;
    if (...

Note: this suggested atmoic implementation is slightly better than what I posted earlier since it does not disable the interrupts during the comparison.

I believe that I have also come up with the most efficient possible shiftOutRaw(). This requires hardcoding (predetermined at compile time) pins and the bitorder and thus is NOT a valid dropin for shiftOut(). I have looked at the generated object code and feel that it is essentially ideal (could not be beat by hand coding it). Of course, I would love to be proven wrong and go faster still. :slight_smile: This shiftOutRaw() takes only 6us. I believe that should be fast enough to transmit at 1Mbps!!!! This is barely tested (I am focusing on performance), so let me know if you see any flaws. This relies on some new macros that I defined that are similar in spirit to the fastWrite() macro, but it uses and is more stylistically like some of the already defined cbi() and sbi() macros in the wiring code.

#define sbi(sfr, bit) (_SFR_BYTE(sfr) |= _BV(bit))
#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))

#define sbipin(_pin_) sbi(_pin_<8 ? PORTD:PORTB, _pin_ - (_pin_<8 ? 0:8))
#define cbipin(_pin_) cbi(_pin_<8 ? PORTD:PORTB, _pin_ - (_pin_<8 ? 0:8))

#define DATA_PIN   11
#define CLOCK_PIN  12
#define BIT_ORDER  LSBFIRST

void shiftOutRaw(byte val)
{
  byte  mask = BIT_ORDER == LSBFIRST ? 1 : 128;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
}

Inventor:

No, I have not tried the SPI hardware and I currently know little about it. I would like to see how far these optimizations will take me first. For my needs, it's probably enough, but I have a friend who could potentially need more, I will keep it in mind, thanks for the idea.

Looks like good progress.

I do have to say that I think the fastWrite style macro is more in keeping with the spirit of the digitalWrite type expressions that arduino users on this board would be familiar with. I don't see the advantage of macro syntax that moves away from the friendly arduino abstractions if the code is intended for arduino users.

mem:

Fair enough, I was thinking along the lines of core developers, but you are right, the function is probably aimed at arduino users. I could see that as being useful also. I'll leave the implementation of that up to ... someone else. :slight_smile:

However, I will take the opportunity to point out a tiny nit that I also noticed while doing these optimizations. The current arduino shiftOut() function (and thus also some of my previous optimizations), make the assumption that LOW is equal to 0 and HIGH is equal to 1 in lines like this:

digitalWrite(dataPin, !!(val & (1 << i)));

While this works, if someone ever changed the definition of LOW/HIGH to something else, it would break. The latest shiftOutHPPM() does not suffer from this since it writes directly to the ports. :slight_smile:

In the arduino world, HIGH is defined as 1 and LOW as 0 (see wiring.h).

sometimes it's hard to remember that the arduino environment is not raw AVR GCC :wink:

I decided to try and make the shiftOutRaw() a macro so that it can be used in place of the regular shiftOut() without having to setup defines. The macro I came up with shares the same interface as the regular arduino shiftOut(), but it does not do any special pin setups/checks, you must initialize the system/pins/timers to the proper state before using it. It pretty much executes at the same speed as the shiftOutRaw() except if you use a constant value it will actually go faster, ~3us, (useful in some rare cases). This macro builds upon a few other macros and should hopefully be fairly legible. One of the other macros used is a bitWrite() which mellis was interested in having. Hopefully this macro is mostly inline with the arduino coding style and perhaps something similar could even be included in the next release along with the updated shiftOutHPPM()?

#define sbi(sfr, bit) (_SFR_BYTE(sfr) |= _BV(bit))
#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))

#define sbipin(pin) sbi((pin)<8 ? PORTD:PORTB, (pin) - ((pin)<8 ? 0:8))
#define cbipin(pin) cbi((pin)<8 ? PORTD:PORTB, (pin) - ((pin)<8 ? 0:8))

#define bitWrite(pin, val) { \
  if ((val) == LOW) cbipin(pin); \
  else              sbipin(pin); \
}

#define shiftOutBit(dataPin, clockPin, val, bit) { \
  bitWrite(dataPin, ((val) & (1 << (bit))) ? HIGH:LOW); \
  bitWrite(clockPin, HIGH); \
  bitWrite(clockPin, LOW); \
}

#define shiftOutByte(dataPin, clockPin, bitOrder, val) { \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?0:7); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?1:6); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?2:5); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?3:4); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?4:3); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?5:2); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?6:1); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?7:0); \
}

shiftOutByte() should mimick the interface of shiftOut(), but it benefits mostly from passing compile time
resolvable constants as arguments, the more constants, the better!

Since most of these macros are not a functions, be careful not to pass in arguments which have side effects when being referenced since all of the arguments will be referenced more than once.