Fast alternative to digitalRead/digitalWrite

I have developed a new C++ library for fast digital I/O and would appreciate any comments and suggestions. The library is posted here code.google.com/p/beta-lib/downloads/list as the file DigitalPinBeta20120113.zip.

A number of people have developed fast versions of digitalRead/digitalWrite using C macros to generate fast inline code.

I decided to to design a new API and use a template class. Here is an example program that generates two 125 ns wide pulses for a scope timing test, assuming a 16 MHz CPU.

// scope test for write timing
#include <DigitalPin.h>

// class with compile time pin number
DigitalPin<13> pin13;

void setup() {
  // set mode to OUTPUT
  pin13.outputMode();
}
void loop() {
  pin13.high();
  pin13.low();
  pin13.high();
  pin13.low();
  delay(1);
}

Each of the statements, pin13.outputMode(), pin13.low(), and pin13.high(), compiles into a two byte cbi/sbi instruction.

For high address Mega pins the functions are larger and slower but provide atomic access to the pins.

Hi, nice work.

I should have waited a few weeks... Your previous FastDigitalIO class allowed me to scrap my re-invention of a wheel. But I recently added this same functionality and moved the files into one... Anyway your new code is nice n tidy so I might put mine in the archive now.

As a c++ programmer, macros as long as some of the port map versions are just a headache. As i'm writing a template HAL library, this code almost seems custom written for me :slight_smile: I have a system I'm deriving off parts of this to allow my HAL to write pins on like ports in one operation. I will post it when done if you would like a look.

Looks like nice code to my (mostly hardware) eyes, and I far prefer the OO syntax.

Now I have a request, how about a "pingroup" class where you define a random selection of pins and can then apply a value to them. Maybe limited to 8 bits eg

PinGroup myPG (1,3,5,6,7,9,23,24);  // -1 for unused pins or 8 constructors ?

myPG.set(0xF5);

for (int i = 0; i < 100; i++) myPG.set(i);

This may or may not look very clean under the covers but would be a lot better than the

digitalWrite (1, HIGH);
digitalWrite (2, HIGH);
digitalWrite (3, LOW);
digitalWrite (4, HIGH);
digitalWrite (7, LOW);
digitalWrite (9, LOW);
digitalWrite (23, HIGH);
digitalWrite (24, HIGH);

That we currently have.


Rob

Now I have a request, how about a "pingroup" class where you define a random selection of pins and can then apply a value to them. Maybe limited to 8 bits eg

I too see the value in something implementing those ideas.

That is essentially what I have started making.

I have a class 'WriteMany' specialised for up to 8 template paramaters, using compile time logic it determines which pins are on like ports and writes them together, any unique pins generate FastDigitalIO/digitalPin write methods. I have hit a small block as I rethink the port grouping logic, after specialising a four pin write I noticed how the next 4 specialisations will have quite a bit of code to them and would increase compile time dramatically. Not good as I plan to have it handling 69 pins.

PinGroup myPG (1,3,5,6,7,9,23,24);  // -1 for unused pins or 8 constructors ?
myPG.set(0xF5);
for (int i = 0; i < 100; i++) myPG.set(i);

On a Mega a PinGroup could become quite large so a pingroup should have its max size as param: PinGroup myPG;

Furthermore must it set pins of the same register simultaneously? If pins are in different registers this is not possible ...

Internally I would keep it simple, something like :

myPG.set(int val)
{
  for (uint8_t i=0; i<size; i++)
  {
    if (pin[i] >= 0) digitalWrite(pin[i], bitset(val,i);  // -1 in a group just means skip this pin 
  }
}

Collecting registers and setting them at once would cause extra code so I doubt if it is faster
PinGroup myPG (1,3,5,6,7,9,23,24);  // -1 for unused pins or 8 constructors ?

Unfortunatly the pins have to be defined in template parameters too,
If you use a formal parameter in a non-type template specification you will get an error ( XXX cannot appear in a constant-expression ). Meaning the DigitalPin library can not be used this way.

I'm overcoming this with a few macros to combine the pin numbers into one data block. So each template parameter contains a number of pins. I still have testing to see if large data types are compatible ( 64-bit integer ), should be as they are implemented by the compiler rather than the arduino.

A pin grouping system is definitely a task I would like to utilise and help create if needed.

EDIT: this task seems to be hindered by the arduino ide itself, if it supported c++0x ( or whatever the new standard is ) variadic templates would be perfect for this situation

I have played with multiple pins for timing tests. Scope tests with the following five pin example show that a single call to writeGroup() takes 2.5 microseconds. That is faster than a call to digitalWrite() for a single pin.

It takes 80 microseconds for the loop to go through all 32 possible values for five pins.

#include <DigitalPin.h>
DigitalPin<3>  pin0; // bit 0X01
DigitalPin<9>  pin1; // bit 0X02
DigitalPin<7>  pin2; // bit 0X04
DigitalPin<5>  pin3; // bit 0X08
DigitalPin<13> pin4; // bit 0X10

void initGroup () {
  pin0.outputMode();
  pin1.outputMode();
  pin2.outputMode();
  pin3.outputMode();
  pin4.outputMode();  
}
void writeGroup(uint8_t val) {
  pin0.write(1  & val);
  pin1.write(2  & val);
  pin2.write(4  & val);
  pin3.write(8  & val);
  pin4.write(16 & val);
}

void setup() {
  initGroup();
}

void loop() {
  for (uint8_t i = 0; i < 32; i++) {
    writeGroup(i);
  }
}

The sketch take 552 bytes of flash on Arduino 1.0. The "empty" sketch

void setup() {}
void loop() {}

takes 466 bytes of flash so that is only 86 additional bytes.

You could write a templates for a given number of pins. Not so neat but works.

TwoPinGroup<Pin0, Pin1>
ThreePinGroup<Pin0, Pin1, Pin2>
...
EightPinGroup<Pin0, Pin1, Pin2, Pin3, Pin4, Pin5, Pin6, Pin7>

I thought of a DigitalPort class for multiple bits on one port. Trying to combine bits that are on the same port in PinGroup has a very high overhead since you can't arrange for the compiler to optimize to efficient I/O instructions.

Here is an idea for a read group:

uint8_t readGroup() {
  uint8_t value = 0;
  if (pin0.read()) value |= 1;
  if (pin1.read()) value |= 2;
  if (pin2.read()) value |= 4;
  if (pin3.read()) value |= 8;
  if (pin4.read()) value |= 16;
  return value;
}

This function is small, less than 40 bytes of flash. I haven't timed it.

imho a pingroup would have an internal collection to which runtime pins can be added and removed (don't know the purpose for remove yet)
The collection is not sorted, so the adding order applies.

pinGroup <4> PG;  // size = 4

PG.add(2);    // or PG.add(pin2);
PG.add(13);
PG.add(6);
PG.add(8);  // internal array { 2,13,6,8 }; ActualSize = 4; maxSize = 4;

int x = PG.readGroup();  // auto outputMode
PG.writeGroup(5);         // and inputMode?   pin2 = 0 pin13 = 1 pin 6 = 0 pin 8 = 1
PG.write(HIGH);            // all pins HIGH

PG.remove(6); // internal array { 2,13,8 }; ActualSize = 3; maxSize = 4;
PG.add(6);      // internal array { 2,13,8,6 }; ActualSize = 4; maxSize = 4;  ==> last 2 lines swapped !

Have you looked at what's been done in v3 GLCD (Google Code Archive - Long-term storage for Google Code Project Hosting.)? There's some clever code in there to optimize port access.

Iain

Yes, I have looked at v3 GLCD and it contains the kind of macros I am trying to avoid.

DigitalPin generates the fastest and smallest possible code for reading and writing I/O port on 328 Arduinos and ports A-G on the Mega.

For high() this is a single sbi instruction and for low() a single cbi instruction. These instructions execute in two cycles.

This sketch only uses 14 bytes more than an empty sketch:

// read pin 12 write value to pin 13
#include <DigitalPin.h>

DigitalPin<12> readPin;
DigitalPin<13> writePin;

void setup() {
  readPin.inputMode();
  writePin.outputMode();
}
void loop() {
  writePin.write(readPin.read());
}

fat16lib:
Yes, I have looked at v3 GLCD and it contains the kind of macros I am trying to avoid.

DigitalPin generates the fastest and smallest possible code for reading and writing I/O port on 328 Arduinos and ports A-G on the Mega.

Sorry I don't think I made myself clear. I was referring to the discussion on pin groups. v3 GLCD has some clever code for recognising groups of pins are on the same port and generating faster code than accessing pins individually.

I think the DigitalPin template will be really useful but at present where speed is important I'm accessing the ports directly.

BTW why is access to ports H+ on the Mega slower?

Iain

Even for pin groups the overhead of combining pin access often is slower and takes more code.

Here are some examples (C++ statement followed by generated code):

To write one bit for ports A-G sbi/cbi is the winner:

  PORTB |= 0X1;
   c:   28 9a           sbi     0x05, 0 ; 5

With two or more pins, combining bits requires more instructions. You also need a cli/sei to make it atomic for general use.

  cli();
   c:   f8 94           cli
  PORTB |= 0X11;
   e:   85 b1           in      r24, 0x05       ; 5
  10:   81 61           ori     r24, 0x11       ; 17
  12:   85 b9           out     0x05, r24       ; 5
  sei();
  14:   78 94           sei

So it is hard to save time or code by combining bits. You do get all bits changing state at the same time.

The best plan for a pinGroup is to dedicate an entire port so you don't need to OR or AND bits and worry about atomic operations. That's why I think a DigitalPort class is best.

For Mega ports H, J, and K cbi/sbi can't be used since the port address is too large. Setting a single bit in these port is slow:

  cli();
   c:   f8 94           cli
  PORTH |= 0X1;
   e:   e2 e0           ldi     r30, 0x02       ; 2
  10:   f1 e0           ldi     r31, 0x01       ; 1
  12:   80 81           ld      r24, Z
  14:   81 60           ori     r24, 0x01       ; 1
  16:   80 83           st      Z, r24
  sei();
  18:   78 94           sei

On a Mega a PinGroup could become quite large so a pingroup should have its max size as param:

I would suggest limiting to 8 pins anyway.

Furthermore must it set pins of the same register simultaneously? If pins are in different registers this is not possible ...

Not necessarily, if the fact that pins are on the same port can be detected great, but even if behind the scenes it degenerates to a stack of single pin writes (as you show) at least the application code will be simpler and more readable.

You could write a templates for a given number of pins. Not so neat but works.
TwoPinGroup<Pin0, Pin1>
ThreePinGroup<Pin0, Pin1, Pin2>

I'm not strong on C++ but can't you have 8 constructors with different numbers of parms, that way there is only a single pinGroup object and the syntax is the same up to 8 pins.

As for simultaneous writes, it would be nice if the class auto detected pins on the same port but I don't think that's really important, maybe a second Port class that boils down to
simple "PORTx =" code with .bitSet() and .bitClear() methods that just do "PORTx != val" etc. At least that will add to the current HAL and isolate beginners from such "complex" ideas.

OTOH if all this can be rolled into a single class even better.


Rob

As for simultaneous writes, it would be nice if the class auto detected pins on the same port

The only purpose for my WriteMany class it to write like pins. A convenience factor is not really on my list at all.

imho a pingroup would have an internal collection to which runtime pins can be added and removed (don't know the purpose for remove yet)
The collection is not sorted, so the adding order applies.

Also pins probably should not be runtime, assigning runtime pins more than once doesn't really make sense unless you are physically re-wiring your hardware while the Arduino is on.

Also they are not usable values with digitalPin library and will have to resort to some slower lookup table version. making it more efficient to just individually write the pins.

Non-type template parameters also have no storage overhead, no SRAM is used to store the parameters past compilation as the compiled code is completely customised to those parameters. The alternative is a generic read/write that must look up the contents with every operation.

My code as tested for 3 and 4 pins produces less instructions on like pins rather than doing an individual write on each pin. When I finish the 4 & 5 pin writer I'll post it.

I'm not limiting this code to 8 pins though, The benefits my HAL will theoretically receive from writing any number of pins out ways this limitation by far.

Writing multiple pins seems like a good idea, at least in the abstract. There are cases where dedicating an entire 8-bit port to a device makes sense but this is not write multiple pins.

I have written a lot of bit-bang code for SPI, I2C, and various devices. When I get to real hardware, my abstract write multiple ideas never seem to help.

Does anyone have a situation with real hardware where an existing implementation would be improved by write multiple with three or four pins. The pins must be restricted to a single port.

The best example I have is something like an LCD display. In this case the restriction that all pins are on the same port is too severe. The library LiquidCrystal allows any pins and that doesn't add much complication. Here are the byte and nibble write functions.

void LiquidCrystal::write4bits(uint8_t value) {
  for (int i = 0; i < 4; i++) {
    pinMode(_data_pins[i], OUTPUT);
    digitalWrite(_data_pins[i], (value >> i) & 0x01);
  }
  pulseEnable();
}

void LiquidCrystal::write8bits(uint8_t value) {
  for (int i = 0; i < 8; i++) {
    pinMode(_data_pins[i], OUTPUT);
    digitalWrite(_data_pins[i], (value >> i) & 0x01);
  }
  pulseEnable();
}

Note this code has pinMode in the write function. LCD displays can be written or read so your write multiple should also support read.

It's the details of real complex devices that seems to kill the advantages of a library for accessing multiple pins.

I was very interested in this because I'm writing for dedicated hardware with my LCD data pins contiguous and on the same port. With a simple benchmark just converting the stock LiquidCrystal library to DigitalPin and nothing else I saw a 32% speed up. By changing the write4bits method to shift the nibble directly into the port I only saw an additional 1.1% speed increase from the pure DigitalPin version.

Unless I totally mangled my direct port code, which is a very real possibility

PORTC = (PORTC & (~B00111100)) | ((value << 2) & B00111100);// D0-3 on A2-A5

When I removed the section setting the pins to output in each write the difference between digitalPin and direct port was only .06%

Is that basically what you're getting at or did I miss the point entirely?

You got the point exactly.

Often what looks great in C/C++ code doesn't optimize well for I/O on AVR chips.

avr-gcc seems to really understand single bit operations. Sometimes it does really stupid things with more complex cases.

I am now doing a very general bit-bang SPI implementation for all SPI modes. I get a factor of four speedup by simple changes that make the compile so what I expect.

@fat16lib, I have recently converted my older code using FastDigitalIO to your newer DigitalPin library.
The compilation size grew by two bytes, I cannot find the reason why either ( 2 bytes is nothing anyway ).
Also I noticed the 'mode()' function is gone, Was easier to implement in some circumstances.

I will be posting another version of digitalRead/digitalWrite. The current version is not working well in a general implementation of software SPI master. I am also using it for a fast software I2C master.

I tested this library on my Uno against using digital.Write and I got a nice improvement in speed. I also thought your library was easy to use once I caught on to terms needed to activate the pins.

Can this library be used with any Arduino compatible board or do the pins have to be defined in the library first?

To narrow my question down, I would like to use it on a Leonardo board and "here comes a dream", try to use the library on a Maple.

Great work, thank you for your efforts!