One issue with depending on the avr-gcc sbi/cbi optimization
that some folks may not be aware of is that
*reg |= mask;
*reg &= mask;
Is not guaranteed to generate a sbi/cbi instruction.
Here is a link that talks about the issue in detail.
http://forum.arduino.cc/index.php?topic=211415.msg1553690#msg1553690
The summary is that that on some processors the avr-gcc optimization hack
that converts |= and &= to sbi/cbi instruction fails because the register's
address is too large. In those cases the resulting code does not
update the register atomically and can cause register corruption
if the same port register is used in foreground and ISR routines.
The net result is that optimization silently fails for some of the AVR registers,
so if you are depending on it, you have to be very careful.
In terms of "arduino-like" "fast" AVR bit i/o APIs,
so far from what I've seen, none of these "fast" options solve the next level problem
and that is multi pin i/o for things like byte operations.
I did an implmentation that also provides multi bit i/o that I use in my openGLCD library.
It is licensed as GPL v3 code and can found here in my mcu-io project:
http://code.google.com/p/mcu-io
See the avrio code and download.
It provides an arduino like interface that will crush down
even multiple bit i/o when possible.
It also allows you to specify pins using the AVR PORT and bit number
rather than arduino pin numbers.
(Arduino raw pin numbers can be used but requires creating a pin mapping macro)
Right now it works for single pin and 8 pin i/o.
If there is an interest I could put in 4 pin support.
Another option, while not quite as fast is to avoid using the digitalWrite()/digitalRead() interface
all together and use indirect port i/o. This is portable across all processors and board types used on Arduino
and allows using raw ardiuno pin numbers.
What this does shift the run time penalty to only once during initalization rather than
on each and every single i/o.
To use this, the code fetches and saves the register pointers and bit masks up front using:
address:
reg = portOutputRegister(digitalPinToPort(pin));
mask:
mask = digitalPinToBitMask(pin);
You save them away and then later can do:
reg |= mask;
reg &= ~mask;
The restriction is that you must mask interrupts to ensure atomicity.
While not as fast direct raw port i/o it is much faster than the Arduino core code routines.
This method is quite effective for libraries and several out there are doing this.
They can get a substantialy bump in performance and yet remain portable across boards & processors.
(Well it is currently broken on DUE, but that is a Arduino team issue in the DUE code,
I entered a bug report for it)
Another simpler alternative for single pin i/o is to just switch to using Paul's Teensy boards.
The teensy core code used when using one of his boards
will optimize automagically to use port i/o when possible
without having to do anything special to your code.
When using a Teensy board, the digitalWrite()/digitalRead() code just magically much faster
if you use constants as the parameters.
There simply is no good excuse as to why the Arduino teams hasn't updated
the standard Arduino AVR core code to provided faster i/o when there are alternatives
that are much faster and yet preserve 100% of the existing API.
What is really needed is to abandon the digitalWrite()/digitalRead() API and
define a new one.
One that uses a SET and CLR semantic reather than a set to a value semantic.
This would allow using the better hardware capabilities available in other non AVR
processors like the pic32.
As-is, the better hardware is eternally limited and dramatically slowed down
by having to maintain the existing Arduino API.
--- bill