Pages: 1 [2] 3 4   Go Down
Author Topic: Higher Performance IO  (Read 7197 times)
0 Members and 1 Guest are viewing this topic.
Las Vegas, NV
Offline Offline
God Member
*****
Karma: 0
Posts: 507
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

PWM generation should be ok because that is taken care of entirely in hardware on the mega168 (i.e. it's not the case of a timer-based ISR manually toggling an output pin).  There is an ISR bug of this nature that has to do with the millis() timer and its unsafe use of globals that can be updated mid-calculation by the timer0 overflow ISR, but this problem should be corrected in the arduino-0012 release (a temporary fix is to surround every millis() call by cli() and sei() to disable interrupts during the calculation).

I think the only ISRs used by the Arduino environment are the timer0 overflow interrupt (which just updates global variables used by millis()) and the serial receive interrupt.  I haven't looked at the serial receive ISR to see exactly what it does, but I'm assuming it just reads data from UDR0 and stores it in a global ring buffer.  It's possible the environment also uses this buffer in an unsafe way, but I can't say without looking at the code.

- Ben
Logged


0
Offline Offline
Newbie
*
Karma: 0
Posts: 9
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

 
Quote
It still seems like there should be a safe (nicely integrated in the Arduino environment) faster version of shiftOut() available for general Arduino use, would you agree?  I can't help but wonder how many Arduino users simply gave up performing serial IO with software because they thought they were limited to a slow shiftOut() implementation and did not have the expertise to code up a faster version.

I agree and it should have the ability to designate whether the clock pin idles high or low.

I had to write my own shiftOut to get the clock to idle high and I suffered a glacially slow new function - each clock cycle was taking 14uS. Re writing as a Port manipulation improved to 3uS. I didn't know about macros and the compiler interpretation issues brought up here.

But my devices can clock at 20MHz so I am looking at using SPI (didn't know about that until someone here told me). It seems promising. My only problem with that is my device only uses 2 wires and strobes the chips by holding the clock high and pulsing the data line 8 times. I might have to disable the SPI when I need t do this or maybe multiplex the clock and data lines so I can drive them with both the SPI and some other pins to send the strobe signal.

But basically I agree that a faster shiftOut would be great - lots and lots of people are trying to shiftOut to LED arrays and end up finding the Arduino a bit pokey at this.

Martin, cant you divide the SPI clock/4 to get to 4MHz so tha SPI would work with your 5MHz devices?
Logged

Forum Administrator
Cambridge, MA
Offline Offline
Faraday Member
*****
Karma: 11
Posts: 3538
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Hey, guys, this discussion is great, and I'd love to improve the shiftOut() performance for Arduino 0012.  Can someone post an updated version of the shiftOut() code based on the conversation?  It might be simpler not to use the fastWrite() macros, but just do everything directly in the shiftOut() code for now.  I'm also looking at adding more general purpose bitRead() and bitWrite() macros which would something similar to fastWrite(), but until then, it's probably simpler to do without them.
Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

OK, I have more results.  First off, I benchmarked a switch() version of shiftOut() using hardcoded port/pins for each case.  This switch trick allows the compiler to see the ports and pins as constants and it indeed does use single instruction bit set/clears.  Unfortunately, the extra comparison and branches needed for every iteration of the loop leads to a slower performing shiftOut(), although it would use an atomic update and be more "correct" than the original shiftOut().  The other problem with using a switch is that it makes the loop times potentially different depending on which pin is chosen since it will involve either fewer or more comparisons. I benchmarked this shiftOutSwitch() at ~40us, which is not bad, but not great and the code is ugly (2 large 13 step switches)!!!

Displeased with these results, I tried another approach to try and use real cbi/sbi instructions.  Instead of using a switch statement, I created separate cbi1(), cbi2()... functions and then created an array of function pointers to each function.  With this I built a new shiftOutFP() (function pointer), which before looping picks the appropriate sbi/sbi functions for each of the data and clock pins and simply calls those in the loop.  This also led the compiler to perform the constant port/pin optimization and thus to use the atomic cbi/sbi machine instructions.  The performance results were however still
disappointing, ~42us.  Not only that, but this is even uglier code than shiftOutSwitch() since it involves defining at least 13x2=26 functions and then uses 23*2=46 bytes of SRAM to store the pointers to these in an array.  Needless to say I was disappointed, while the attempt to use cbi/sbi did to lead safer more correct code, in both cases it is much uglier code and slower.

So, I decided to give up on using sbi/cbi for now and focused on making the shiftOutHP() from previous posts more "correct", i.e. fixing the non-atomic write from digitalWrite().  To do this I simply disabled interrupts during each pin update and restored the previous state afterwards.  Surprisingly, this ends up being faster than the previous two atomic attempts, ~37us.  So, while this is slightly slower than the non-atomic version, this is still (exactly) a 3 fold improvement over the current arduino shiftOut() and is more correct to boot.  While some speedup may be gained by turning interrupts off once and keeping them off the whole time, this could increase latency for interrupts dramatically and at best cause a 4us speedup, I would not consider this a good tradeoff.  This would be my current suggestion of a function to replace the arduino core shiftOut().

Code:
void shiftOutHPA(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
{
  uint8_t dport = digitalPinToPort(dataPin);
  uint8_t cport = digitalPinToPort(clockPin);
  if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
  uint8_t dmask = digitalPinToBitMask(dataPin);
  uint8_t cmask = digitalPinToBitMask(clockPin);
  volatile uint8_t *dreg = portOutputRegister(dport);
  volatile uint8_t *creg = portOutputRegister(cport);
  uint8_t dtimer = digitalPinToTimer(dataPin);
  uint8_t ctimer = digitalPinToTimer(clockPin);
  if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
  if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

  byte sreg = SREG;
  if (bitOrder == LSBFIRST)
    for (int i = 0; i < 8; i++) {
      cli();
      if (!!(val & (1 << i)) == LOW) *dreg &= ~dmask;
      else *dreg |= dmask;
      SREG=sreg; cli();
      *creg |= cmask;
      SREG=sreg; cli();
      *creg &= ~cmask;
      SREG=sreg;
    }
  else
    for (int i = 0; i < 8; i++) {
      cli();
      if (!!(val & (1 << (7 - i))) == LOW) *dreg &= ~dmask;
      else *dreg |= dmask;
      SREG=sreg; cli();
      *creg |= cmask;
      SREG=sreg; cli();
      *creg &= ~cmask;
      SREG=sreg;
    }
}

I have not heavily tested this function yet, so it would be best if others could test it too before including it in the core.  I would also suggest a similar SREG/cli fix to go into digitalWrite() so that it is less likely to interfere with other pins used by interrupts, the performance impact will be negligible.
Logged

Forum Administrator
Cambridge, MA
Offline Offline
Faraday Member
*****
Karma: 11
Posts: 3538
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Does it actually matter in practice if these writes are atomic or not?  Have people had problems with this?  Most of the other core functions don't disable interrupts while writing to the ports, so I'm not sure it makes sense to do it here.
Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Since I am fairly new to this, I don't know if anyone has had any problems with the non atomic writes.  if they did, I would think that it might be hard for them to even figure out why they are having a problem.  The symptom itself might be hard to detect and simply result in people thinking the arduino is a POS and giving up, but I am surely just guessing.

When it comes down to it, feel free to treat this as two separate issues.  The one that started this thread is performance.  You could make the performance modification alone.  If you want to do a code audit for things like atomic updates at some point, you do that separately on the entire code base.  My two cents.  I just happened to come about the conclusion of "incorrect design" by investigating the performance issue.

If you want the performance fix without the atomic fix, you can use the original shiftOutHP() with the fix mentioned in the following post.  If you further more want to include a shiftOutRaw() at some point, that might be nice too, but it may not fit well into the goals of the core arduino stuff since it would primarily eliminate all the safety checks done at the beginning.  I will post the best shiftOutRaw() that I can come up with soon, a hardcoded one to specific compile time chosen pins should be quite a bit faster!
Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

OK, I'm sorry I missed some really obvious optimizations still, I have now doubled the best previous performance I had with a new shiftOutHPPM() (High Performance PreMasked).  By using predefined masks for bit comparisons I saved quite a bit of time.  Once this optimization was in place it seemed worth uglying up the code a little and unrolling the loop and gaining the extra 6us which is a fairly high percentage at this point.  This leads me to a shiftOutHPPM() which can execute in 18us!!!  Adding an atomic fix to this seems to only take another 2us.  While this is not the prettiest code, it is the fastest dropin replacement for shiftOut() that I could come up with to date, more than a 6 fold improvement over the current arduino code.

Code:
void shiftOutHPPM(uint8_t dataPin, uint8_t clockPin, uint8_t bitOrder, byte val)
{
  uint8_t dport = digitalPinToPort(dataPin);
  uint8_t cport = digitalPinToPort(clockPin);
  if (dport == NOT_A_PIN || cport == NOT_A_PIN ) return;
  uint8_t dmask = digitalPinToBitMask(dataPin);
  uint8_t cmask = digitalPinToBitMask(clockPin);
  volatile uint8_t *dreg = portOutputRegister(dport);
  volatile uint8_t *creg = portOutputRegister(cport);
  uint8_t dtimer = digitalPinToTimer(dataPin);
  uint8_t ctimer = digitalPinToTimer(clockPin);
  if (dtimer != NOT_ON_TIMER) turnOffPWM(dtimer);
  if (ctimer != NOT_ON_TIMER) turnOffPWM(ctimer);

  if (bitOrder == LSBFIRST) {
    if (val & B00000001) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000010) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000100) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00001000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00010000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00100000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B01000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B10000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
  } else {
    if (val & B10000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B01000000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00100000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00010000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00001000) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000100) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000010) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
    if (val & B00000001) *dreg |=  dmask;
    else                 *dreg &= ~dmask;
    *creg |= cmask; *creg &= ~cmask;
  }
}


If you want atomic behavior, change it to look like this:
Code:
   byte sreg=SREG;
    if (val & B00000001) { cli(); *dreg |=  dmask; SREG=sreg; }
    else                 { cli(); *dreg &= ~dmask; SREG=sreg; }
    cli(); *creg |= cmask; SREG=sreg; cli(); *creg &= ~cmask; sreg= SREG;
    if (...

Note: this suggested atmoic implementation is slightly better than what I posted earlier since it does not disable the interrupts during the comparison.
Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I believe that I have also come up with the most efficient possible shiftOutRaw().  This requires hardcoding (predetermined at compile time) pins and the bitorder and thus is NOT a valid dropin for shiftOut().  I have looked at the generated object code and feel that it is essentially ideal (could not be beat by hand coding it).  Of course, I would love to be proven wrong and go faster still. smiley  This shiftOutRaw() takes only 6us.  I believe that should be fast enough to transmit at 1Mbps!!!!  This is barely tested (I am focusing on performance), so let me know if you see any flaws.  This relies on some new macros that I defined that are similar in spirit to the fastWrite() macro, but it uses and is more stylistically like some of the already defined cbi() and sbi() macros in the wiring code.

Code:
#define sbi(sfr, bit) (_SFR_BYTE(sfr) |= _BV(bit))
#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))

#define sbipin(_pin_) sbi(_pin_<8 ? PORTD:PORTB, _pin_ - (_pin_<8 ? 0:8))
#define cbipin(_pin_) cbi(_pin_<8 ? PORTD:PORTB, _pin_ - (_pin_<8 ? 0:8))

#define DATA_PIN   11
#define CLOCK_PIN  12
#define BIT_ORDER  LSBFIRST

void shiftOutRaw(byte val)
{
  byte  mask = BIT_ORDER == LSBFIRST ? 1 : 128;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
  mask = BIT_ORDER == LSBFIRST ? mask << 1 : mask >> 1;
  if (val & mask) sbipin(DATA_PIN);
  else            cbipin(DATA_PIN);
  sbipin(CLOCK_PIN); cbipin(CLOCK_PIN);
}

Inventor:

No, I have not tried the SPI hardware and I currently know little about it.  I would like to see how far these optimizations will take me first.  For my needs, it's probably enough, but I have a friend who could potentially need more, I will keep it in mind, thanks for the idea.
Logged

London
Offline Offline
Tesla Member
***
Karma: 10
Posts: 6250
Have fun!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Looks like good progress.

I do have to say that I think the fastWrite style macro is more in keeping with the spirit of the digitalWrite type expressions that arduino users on this board would be familiar with. I don't see the advantage of macro syntax that moves away from the friendly arduino abstractions if the code is intended for arduino users.
« Last Edit: July 31, 2008, 02:29:51 pm by mem » Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

mem:

Fair enough, I was thinking along the lines of core developers, but you are right, the function is probably aimed at arduino users.  I could see that as being useful also.  I'll leave the implementation of that up to ... someone else. smiley

However, I will take the opportunity to point out a tiny nit that I also noticed while doing these optimizations.  The current arduino shiftOut() function (and thus also some of my previous optimizations), make the assumption that LOW is equal to 0 and HIGH is equal to 1 in lines like this:

   digitalWrite(dataPin, !!(val & (1 << i)));

While this works, if someone ever changed the definition of LOW/HIGH to something else, it would break.  The latest shiftOutHPPM() does not suffer from this since it writes directly to the ports. smiley
Logged

London
Offline Offline
Tesla Member
***
Karma: 10
Posts: 6250
Have fun!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

In the arduino world, HIGH is defined as 1 and LOW as 0 (see wiring.h).

sometimes it's hard to remember that the arduino environment is not raw AVR GCC  smiley-wink



« Last Edit: July 31, 2008, 02:51:35 pm by mem » Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

I decided to try and make the shiftOutRaw() a macro so that it can be used in place of the regular shiftOut() without having to setup defines.  The macro I came up with shares the same interface as the regular arduino shiftOut(), but it does not do any special pin setups/checks, you must initialize the system/pins/timers to the proper state before using it.  It pretty much executes at the same speed as the shiftOutRaw() except if you use a constant value it will actually go faster, ~3us, (useful in some rare cases).  This macro builds upon a few other macros and should hopefully be fairly legible.  One of the other macros used is a bitWrite() which mellis was interested in having.  Hopefully this macro is mostly inline with the arduino coding style and perhaps something similar could even be included in the next release along with the updated shiftOutHPPM()?

Code:
#define sbi(sfr, bit) (_SFR_BYTE(sfr) |= _BV(bit))
#define cbi(sfr, bit) (_SFR_BYTE(sfr) &= ~_BV(bit))

#define sbipin(pin) sbi((pin)<8 ? PORTD:PORTB, (pin) - ((pin)<8 ? 0:8))
#define cbipin(pin) cbi((pin)<8 ? PORTD:PORTB, (pin) - ((pin)<8 ? 0:8))

#define bitWrite(pin, val) { \
  if ((val) == LOW) cbipin(pin); \
  else              sbipin(pin); \
}

#define shiftOutBit(dataPin, clockPin, val, bit) { \
  bitWrite(dataPin, ((val) & (1 << (bit))) ? HIGH:LOW); \
  bitWrite(clockPin, HIGH); \
  bitWrite(clockPin, LOW); \
}

#define shiftOutByte(dataPin, clockPin, bitOrder, val) { \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?0:7); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?1:6); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?2:5); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?3:4); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?4:3); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?5:2); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?6:1); \
  shiftOutBit(dataPin, clockPin, val, (bitOrder) == LSBFIRST ?7:0); \
}

shiftOutByte() should mimick the interface of shiftOut(), but it benefits mostly from passing compile time
resolvable constants as arguments, the more constants, the better!

Since most of these macros are not a functions, be careful not to pass in arguments which have side effects when being referenced since all of the arguments will be referenced more than once.
« Last Edit: August 07, 2008, 12:47:50 am by MartinFick » Logged

London
Offline Offline
Tesla Member
***
Karma: 10
Posts: 6250
Have fun!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Martin, I suggest you extend the fast write macros so the code will work on pins 14-19.
As you probably know, check if the pin is not less than 14 then subtract 14 from the given pin argument and set/clear on PORTC
« Last Edit: August 07, 2008, 02:00:33 am by mem » Logged

0
Offline Offline
Jr. Member
**
Karma: 0
Posts: 59
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hey mem and Martin,

May I include some of your code (attributing it to you) in the next version of the ShiftBrite library I'm putting together?
Logged

Boulder, CO
Offline Offline
Newbie
*
Karma: 0
Posts: 43
I want to be a really useful engine!
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Of course, you may use the code, especially since they are mods of arduino core stuff which is under the GPL.  No attribution required, but thanks for asking. smiley
Logged

Pages: 1 [2] 3 4   Go Up
Jump to: