Super High Performance Pin I/O Technique

I would like to share what I think is a very slick programming technique that
both the Arduino development team as well as Arduino users may well be
interested in – Achieving very high performance pin I/O. Fortunately,
this issue is not new and searches in the Forum has turned up quite a few hits
on this subject so I hope and expect this post will draw some interest.

My take on Arduino I/O is that it is unnecessarily slow for the following
reasons:

  1. The Arduino environment uses some on-chip memory to store look-up
    tables that map an Arduino pin to a port output register, port
    direction register, pin input register, as well as a port bit number.
    These onchip lookup tables incur runtime overhead whenever Arduino
    functions pinMode(), digitalRead(), and digitalWrite() access these
    tables during every function call.

  2. Although AVR processors have ideal instructions for high performance
    manipulation of I/O PORT registers (i.e. sbi/cbi instructions) the
    compiler seems to never issue them.

  3. Examination of the actual code for the functions pinMode(),
    digitalRead(), and digitalWrite() appear (to me) as being bloated
    with more functionality than is necessary. Specifically, interrupts
    are turned off/on and system status register is saved/restored
    during every call which slows things down yet more.

  4. On-chip lookup tables consume chip resources (i.e. RAM) that take
    away from what a user can use. Although this is not a performance
    issue rather it is a resouce overhead issue.

The programming technique that I am experimenting with has the following
characteristics (compare them with the above points):

  1. No on chip lookup tables are required which means there is no
    run time overhead.

  2. The technique issues atomic sbi/cbi instructions that don’t require
    turning off/on interrupts and saving/restoring system status
    register.

  3. The technique issues atomic sbi/cbi instuctions that replace
    entire function calls so this overhead is also eliminated (As far as
    I know, all I/O PORT registers are within the addressing range of the
    cbi/sbi instructions).

  4. No on chip lookup tables are used so more on chip resources are
    available to users.

  5. This technique can co-exist with the current Arduino environment but
    it is also capable of eliminating the need to call pinMode(),
    digitalRead(), and digitalWrite().

So, what is this technique? I’m glad you asked.

The technique is based on a clever use of MACROS to store and access the pin
mapping table. Macros can be defined to hold a lookup table (rows and columns
of information) and other macros are defined to extract and use specific
pieces of that information. Since macros are expanded during the preprocessing
stage of the compiler there is no need to store this lookup table on chip.

The best way of explaining how all this works is to illustrate these macros and
and provide an example sketch that uses them. This example is specific to the
ATmega328 but is easily generalized to other AVR architectures. Consider the
following blinking LED sketch (feel free to copy and paste it to the Arduino
IDE and compile/run it):

// Blinking LED sketch demonstration

#define    _LB5         0x05, 0x04, 0x03, PORTB, DDRB, PINB, 5	// table entry
#define    _D13         _LB5	// Arduino Pin 13 is an alias to bit 5 of PORTB

#define __PIN_O( o, d, i, O, D, I, B )	asm( "sbi " #d ", " #B )  // set DDR
#define __PIN_T( o, d, i, O, D, I, B )	asm( "sbi " #i ", " #B )  // toggle PORT

#define _PIN_CONFIG_OUT( P )	__PIN_O( P )	// equiv to pinMode(P,OUTPUT)
#define _PIN_TOGGLE(     P )	__PIN_T( P )	// no equivalent in Arduino

#define	   LED		_D13	// user alias to Arduino pin 13 called 'LED'

void setup() {
   _PIN_CONFIG_OUT( LED );	// equivalent to pinMode(13,OUTPUT)
   }
   
void loop() {
   for (;;) {			// toggle loop
      _PIN_TOGGLE( LED );	// flip state of LED
      delay( 500 );
   }
}

#define _LB5’ is one row of the lookup table. For this example I show one
row but in the ‘A328_PINS.h’ file that I am attaching shows table entries for all
I/O port pins (Again, this for the ATmega328 only). This macro expands to a 7
element comma separated argument list consisting of the address of the port B
output register (0x05), address of the port B data direction register (0x04),
address of port B input register (0x03), assembler names for these registers
(PORTB, DDRB, and PINB), and the bit position (5) within this port.

#define _D13’ defines an alias name for table entry ‘_LB5’. Table entries do
not need to be referenced directly. As will be seen, the macros will work just
fine with as many aliased names to table entries as needed.

#define __PIN_O( o, d, i, O, D, I, B )’ is the low level macro that extracts
the needed information from a table row and generates the machine instruction
that sets the data direction register appropriately. Note that even though 7
arguments are provided the macro only references only 2 of them.

#define __PIN_T( o, d, i, O, D, I, B )’ is the low level macro that extracts
the needed information from a table row and generates the machine instruction
that toggles/flips the state of the output Port. Note that this functionality
is not accessible from the Arduino environment even though the architecture
supports it.

#define _PIN_CONFIG_OUT( P )’ is a user level wrapper macro around the low
level __PIN_O macro and it is used to control the order in which the __PIN_O
macro is expanded.

#define _PIN_TOGGLE( P )’ is a user level wrapper macro around the low
level __PIN_T macro and it is used to control the order in which the __PIN_T
macro is expanded.

#define LED’ is a user level declaration that creates an alias to Arduino pin
13 and calls it ‘LED’

We will now trace the expansion of the macro call _PIN_CONFIG_OUT(LED) in the
function call setup(). The expansion follows the rules of the compiler
preprocessor:

  1. _PIN_CONFIG_OUT(LED) expands to:
  2. __PIN_O(LED) but LED is itself a macro so it is expanded to:
  3. __PIN_O(_D13) but _D13 is itself a macro so it is expanded to:
  4. __PIN_O(_LB5) but _LB5 is itself a macro so it is expanded to:
  5. __PIN_O(0x05, 0x04, 0x03, PORTB, DDRB, PINB, 5) which finally expands to:
  6. asm( "sbi " #0x04 ", " #5 ) which expands to:
  7. asm( "sbi " “0x04” ", " “5” ) which finally expands to:
  8. asm( “sbi 0x04, 5” ) which generates one atomic instruction to set
    the data direction register appropriately

The expansion of the macro _PIN_TOGGLE(LED) in loop() follows the same process:

  1. _PIN_TOGGLE(LED) expands to:
    .
    .
    .
  2. asm( “sbi 0x03, 5” ) which generates one atomic instruction to flip
    the bit of the output port register connected to
    the LED

That’s it!! These well crafted macros result in:

  1. look-up tables that take zero chip resources
  2. There is zero run time overhead
  3. In most cases, the macros expand to a single atomic machine instruction
  4. Results in maximum I/O speed!
  5. The user level macros like _PIN_CONFIG_OUT() are simple to read and can
    appear to a user as no different than a procedure call.
  6. The macros can co-exist with the Arduino environment functions pinMode(),
    digitalRead(), and digitalWrite() but could replace them if the Arduino
    development team chooses.
  7. Custom lookup macro tables for specific AVR architectures are simple to
    generate.
  8. The macro table technique lends itself to other uses as well – it some
    cases, it can replace a lot of conditional directives like the following
    with macro table lookups:

#ifdef something
.
.
#else
.
.
#endif

For comparison purposes, if the ‘delay(500)’ command is commented out in the
example sketch then the _PIN_TOGGLE loop will compile to just 2 machine
instructions that take a total of 4 cycles to execute. The loop has to execute
twice for each full symetrical output square wave cycle (8 cycles) so at 16 Mhz
clock rate, the LED pin will toggle at a 2 Mhz rate which is about 20X higher
than what can be achieved with digitalWrite() calls!!

Feel free to copy the attached A328_PINS.h file to your library directory.

It will be interesting to discover other uses for this macro technique.

I welcome your questions and/or comments.

ENJOY!!

A328_PINS.h (4 KB)

1 Like

The Arduino function bitSet() is actually a macro and it turns into sbi and cbi by the compiler.
http://arduino.cc/en/Reference/bitSet

The sbi and cbi in a 16MHz Arduino are 125 ns.
And the digitalRead() and digitalWrite() are between 3 and 4 us.

You must use your own pin definitions, like _D13 because you store all the information about the port into it.
That is very clever, but I don’t know what it will do in the Arduino environment.
How would setting the output HIGH and LOW be ?
What if the number of the pin is a variable ?

Peter,

First question:

I've dug into the Arduino internals and I have come across macros sbi() & cbi() and although they are functionally equivalent to sbi/cbi machine instructions they do not actually expand to those instructions. Instead they are defined similiar to

define sbi(r,b) r |= _BV(b)

define cbi(r,b) r &= ~_BV(b)

The result is that the compiler doesn't actually generate real sbi/cbi instructions to implement this macro!

Second question:

Because my macros do text substitution instead of parameter passing as in a true function call, the macros cannot handle 'bare' Arduino pin numbers (e.g. _D13 needs to be used instead of 13). Personally, I don't think this would be a big deal but there are some real Arduino functions that take pin numbers as formal parameters. This would present a problem but not an unsurmountable one.

If I want to do direct bit manipulation, I just use assembler.

bobo1234:
I’ve dug into the Arduino internals and I have come across macros sbi() & cbi() and although they are functionally equivalent to sbi/cbi machine instructions they do not actually expand to those instructions.

Bullshit.

#define sbi(r,b)  r |= _BV(b)
#define cbi(r,b) r &= ~_BV(b)

void setup() 
{
  sbi( PORTB, 3 );
}

void loop() 
{
}
...
000000a6 <setup>:
  a6:	2b 9a       	sbi	0x05, 3	; 5
  a8:	08 95       	ret
...
1 Like

Coding Badly,

I stand corrected.

Thanks for pointing this out. When I've looked at the compiler output in the past I've never seen it actually output the low level sbi/cbi instructions. Perhaps I was looking at Arduino 1.0. Anyway, it's really good to see that the compiler will generate them.

What does direct port manipulation yield?

PIND = 0b00000100; // toggle D2 by writing to the input register.

void setup() 
{
  PIND = 0b00000100; // toggle D2 by writing to the input register.
  PORTD = 0b00000100;
  DDRD = 0b00000100;
}

void loop() {
  // put your main code here, to run repeatedly:

}
000000a6 <setup>:
  a6:	84 e0       	ldi	r24, 0x04	; 4
  a8:	89 b9       	out	0x09, r24	; 9
  aa:	8b b9       	out	0x0b, r24	; 11
  ac:	8a b9       	out	0x0a, r24	; 10
  ae:	08 95       	ret

Load Immediate to get the 0b00000100 value into a register (r24) followed by an Out Port to write the value to an I/O port. Two machine instructions (or one if a register already has the value to be written).

Crossroads

arduino 1.5.6-r2 compiled 'PIND = 0x04' to:

a6: 84 e0 ldi r24, 0x04 ; 4 a8: 89 b9 out 0x09, r24 ; 9

So it looks like 4 bytes no matter what?

Crossroads,

Toggling an output bit takes just 2 bytes with sbi instruction

I think that in many cases a variable is used for a pin. Actually, the bigger my sketch, the more I use variables for pin numbers. That means your code and the Arduino functions should used together ? and the Arduino functions can not be replaced by your code ?

Exactly how portable is your technique? Will the same sketch on an Arduino, a DUE, a Leonardo, and a Mega exhibit the same behavior?

I suspect less that 1% (or 0.1%) of Arduino applications need speed that is faster than digitalRead or digitalWrite.

if those functions are too slow then I think a user will benefit more by learning direct port manipulation than by using, without understanding, some rather esoteric macros. In my opinion the knowledge gained in learning direct port manipulation would be much more widely applicable and will provide a useful grounding in the internals of the Atmega processors and microprocessors in general.

Also, if a user discovers (as someone almost certainly will) that one of the macros does something unexpected what is s/he to do? Spend time trying to understand and fix/modify the macro? Or spend time trying to fix his/her own project?

...R

I think we all agree. They can not replace the Arduino functions just like that. But you can make a page in the Playground section and present it as some kind of "library" for fast pin output.

I have my own set of macros somewhere, but I have not used them for a while.

“I have my own set of macros somewhere”

That’s the thing I run into also - just forgetting stuff!
I use direct port manipulation all the time with SPI.transfer( ) to make that go fast.
I don’t trust that the SS pin is always where it should be, so instead of just toggling the bit with
PIND = 0b00000100; for instance,
I’ll make sure I know what it is with
PORTD = PORTD & 0b11111011; to clear the pin and
PORTD = PORTD | 0b00000100; to set the pin
I don’t code every day, so I haven’t memorized the other direct methods, and this way I know my code is in sync with the hardware I am using because I usually have a schematic of what I am working on.
So, faster than digitalWrite, maybe not the most extreme for speed, but the code is inline and it goes pretty quick.
I will use the PIND method when I toggle an output and don’t care what it was previously.

Have you seen how the proposed technique compares to the digital write fast library? That's optimised for the case where pin numbers are compile-time constants, which avoids a lot of the overhead in the digitalWrite/Read functions.

CrossRoads: I'll make sure I know what it is with PORTD = PORTD & 0b11111011; to clear the pin and PORTD = PORTD | 0b00000100; to set the pin

+1 for clarity and simplicity.

...R

@ bobo1234 I've explored this also and the macros are not a good solution. They will only make the I/O faster for pin numbers known at compile-time. For pin numbers stored in variables it may be even slower than digitalRead/Write. It is discussed for long time also here: https://code.google.com/p/arduino/issues/detail?id=140 and worked out to probably its best in the Wiring implementation of its digital I/O (http://wiring.org.co/)

I believe better option is "encoding" the port address into input parameter of the digitalRead/Write functions. This gives the same speed as macros for compile-time constant pins (cbi/sbi instruction) but is also pretty fast for variable pins. Plus it is easily portable. I worked out this solution for Arduino Uno and Mega here: http://www.codeproject.com/Articles/732646/Fast-digital-I-O-for-Arduino

Just FYI: on ATmega2560 (Arduino Mega) some ports are outside the cbi/sbi range and the interrupts need to be disabled.