Fast DigitalWrite, made from Wire lib source

Hey Ardiuno Forum users,
I recently purchased an Ardiuno, and was trying to vary brightness of multiple led’s using shiftOut, i soon found its execution time excruciatingly long. After a lot of research i found that it was the long execution time of digitalWrite.
I looked at direct port manipulation but could never get it to work, I then came across this link. http://urbanhonking.com/ideasfordozens/2009/05/18/an_tour_of_the_arduino_interna/
After reading through it multiple times, I broke down digitalWrite into setup and error checking and port addressing.
I then put these into a class I call dWriteF.

#include <Wiring_digital.c>
// included this library mostly to include turnOffPWM function, since it's quite long
class dWriteF
{
public:
  dWriteF(uint8_t pin);       //constructor (no destructor, the compiler didn't like mine and didn't seem to care if one was given)
  void Write(bool val);        //contains all the code that needs to repeat to set a digital output after class is constructed
private:
uint8_t timer;                        // used in disabling PWM, not sure how this part works
uint8_t bitF;                           //declared as bitF instead of bit to avoid conflict, "bit" lights up like a keyword, so...
uint8_t port;                          //byte storing pin's port
volatile uint8_t *out;          // address of output register
byte error1;                             // Variable to store possible error if constructor is supplied an invalid pin number, 



dWriteF::dWriteF(uint8_t pin)
    {     timer = digitalPinToTimer(pin);
          bitF = digitalPinToBitMask(pin); 
          port = digitalPinToPort(pin);
          if (port == NOT_A_PIN) error1 = 100; //sets error 100
	                                     // If the pin that support PWM output, we need to turn it off before doing a digital write.
          if (timer != NOT_ON_TIMER) turnOffPWM(timer);
          out = portOutputRegister(port);                                  //returns address of port register
              }

void dWriteF::Write(bool val)
    {    
      if (error1 == 100) return;                                           //EDIT,RETURNS just like library function
        uint8_t oldSREG = SREG;                                         //  unknown???
	  /*cli() was here, but it was related to interrupts???, so I removed it, looks like a constructor to me*/
        if (val == LOW) {
		*out &= ~bitF;                                                    // sends value using bitmask to register
	} else {                
		*out |= bitF;   
	}
        SREG = oldSREG;
          }


//end dWriteF declarations
//-----------------------------------------------------------------------------------------------------//

Note that the constructor dWrite(uint8_t pin) cannot be called in loop, i forgot to try making it global, but here is my test code

const byte LEDpin = 13;


void setup()
{
pinMode(LEDpin, OUTPUT);
Serial.begin(115200);
}

void loop(){
  dWriteF LED13 (LEDpin);
  int time1;
  int time2;

  
  time1 = micros();
  LED13.Write(LOW);
  time2 = micros();
  delay(500);
  LED13.Write(HIGH);
  delay(500);
  LED13.Write(LOW);
  LED13.Write(HIGH);

Serial.println(time1);

Serial.println(time2);   
           }

am I doing this right?
I get an execution of 4 microseconds, is this ridiculous or is it truly the execution time of dWriteF::Write()?

from serial monitor

68
72
17356
17360
-30568
-30564
-12928
-12924
4708
4712
22208
22212
-25720
-25716
-8080
-8076

:slight_smile: seems int overflowed.
Thanks everyone, just to make sure I’m not confusing anyone, are these numbers fooling me?
And I’d love feedback of the practicality of implementing this more often.

NOTE: I ran the same test on digitalWrite just a second ago, its delay was 8 micros, as opposed to dwrite's 4 micros,
also, if i can optimize this any more please let me know.

4 micros is not much better than the current digitalWrite() which is about 5-6us,
that is not going to get you much performance improvement.
To get the fastest i/o you need to do direct port i/o
That will set a pin state in 125ns vs say 4us which is a factor of 32X.
Also the shiftout code itself can be optimized so that things like the
direction check are only done once vs every time.
Have a look at the FastIO files in this LCD library:
https://bitbucket.org/fmalpartida/new-liquidcrystal/wiki/Home
You can grab just the FastIO files and use them with your code.
See the other code in that library for how to use it.
That should give you what you want/need.

This code does what I call "indirect port i/o" while not as fast as direct port
i/o it moves most of the overhead of the table look ups to be done up front
which removes it from the run time path.
It also allows the pins to be configured at run time vs being hardcoded at compile
time.

If you are ok with hard coded pins and values at compile time
have look at the digitalwritefast library:
http://code.google.com/p/digitalwritefast/
or use my avrio routines:
http://code.google.com/p/mcu-io/

--- bill

I use fat16libs code for many apps. http://forum.arduino.cc/index.php?topic=150325.0

It is around 25 times faster than digitalWrite.

bperrybap:
To get the fastest i/o you need to do direct port i/o
That will set a pin state in 125ns vs say 4us which is a factor of 32X.

It's actually much better than that because with direct port manipulation you can set 6 (Uno) or 8 (Mega) bits at the same time.

(On the Uno some of the bits on each of the Ports is earmarked for other purposes).

...R

I suspect that the

i soon found its execution time excruciatingly long.

was some other problem in you code! as ther are a number of errors in the code you did post. For example
int time1; which should be unsigned long.

Mark

Thanks everyone, i am looking into these libraries.
Can anyone tell me more about this part of DigitalWrite?

if (val == LOW) {*/
		*out &= ~bit;
	} else {
		*out |= bit;
	}

by itself, or even broken down without the if statement, this segment of code takes 4 micros to rung, if direct port manipulation takes place at 125ns, then there is another layer of abstraction here. is this &= or |= logical statement overloaded?, is it run in a loop, once for each bit?
And i knew about the overflow of the int type, i expected it. i thought that the unsigned long type wold take to long to write.
I only needed a few loops worth of data once i removed the cli(); function, because execution time never varied.

Now the code above seems to write a bitmask... but it must do more than that to not zero out other pins.

Tried DigitalWriteFast library.
digitalWriteFast and pinModeFast2 gave "error: invalid type argument of 'unary *' "
so I used pinModeFast and digitalwriteFast2. the progormance of digitalwritefast2 ws 4 microseconds, does any one know how to make the faster version run on the 1.0.5 IDE?
given this experience I will be looking into direct port manipulation more though.

CodeNewton:
Thanks everyone, i am looking into these libraries.
Can anyone tell me more about this part of DigitalWrite?

if (val == LOW) {*/
	*out &= ~bit;
} else {
	*out |= bit;
}



by itself, or even broken down without the if statement, this segment of code takes 4 micros to rung, if direct port manipulation takes place at 125ns, then there is another layer of abstraction here. is this &= or |= logical statement overloaded?, is it run in a loop, once for each bit? 
And i knew about the overflow of the int type, i expected it. i thought that the unsigned long type wold take to long to write.
I only needed a few loops worth of data once i removed the cli(); function, because execution time never varied.

Now the code above seems to write a bitmask... but it must do more than that to not zero out other pins.

digitalWrite() is in a C module it is not doing any but plain vanilla C.
What are you using to measure this specific code block?
Did you insert special code inside of digitalWrite() around that block of code?
digitalWrite() does a whole lot more than just those few lines and the majority of its
overhead is not that code.
I can guarantee you that specific block of code does not take 4us to run on a 16Mhz AVR.
If you want to measure small blocks of code like that you have to use special
h/w instrumentation along with some specialized code that can wiggle a pin using direct port
i/o to trigger something like a logic analyzer.
(then you have to subtract out the port i/o pin update time)

You can not use something like micros() to measure very short times
(less than 3-5us).

Here is the code along with the assembler that the compiler generates:

        if (val == LOW) {
 422:   66 23           and     r22, r22
 424:   21 f4           brne    .+8             ; 0x42e <digitalWrite+0x9e>
                *out &= ~bit;
 426:   8c 91           ld      r24, X
 428:   90 95           com     r25
 42a:   89 23           and     r24, r25
 42c:   02 c0           rjmp    .+4             ; 0x432 <digitalWrite+0xa2>
        } else {
                *out |= bit;
 42e:   8c 91           ld      r24, X
 430:   89 2b           or      r24, r25
 432:   8c 93           st      X, r24
        }

You can see the instructions and can look up the number of cycles for each
instruction in the AVR documentation.

Some are 1 and some are 2 cycles. But only about half the instructions
ever get executed. So lets say it is 5 instructions and 8 cycles that
be 8 * 62.5ns or 500ns not 4us.
You are off by an order of magnitude.

— bill

Read this carefully: http://arduino.cc/en/Reference/Micros

Pay particular attention to this:

On 16 MHz Arduino boards (e.g. Duemilanove and Nano), this function has a resolution of four microseconds (i.e. the value returned is always a multiple of four).

There was relatively extensive discussion of the arduino max pin toggle speed (digitialWrite, and alternatives) in this old thread: http://forum.arduino.cc/index.php?topic=4324.0
The exact details may have changed somewhat, but the basics should still be the same.

Note that most of the faster alternatives rely on on both the pin number and the value being 'written' being constant. If they're not, you won't get nearly as much improvement (if any.) AFAIK, no one has implemented a variation that handles all the different cases of constant/variable pinnumber/value at the maximum theoretical speed: you either get "fast" if everything is constant, or "slow" (or "doesn't work") if anything is non-constant. The original arduino implementation actually isn't "awful", considering what it does.

I agree that fat16lib's variation is one of the nicest C++ style versions. I learned a lot just by trying to figure out what he had done...

I can guarantee you that specific block of code does not take 4us to run on a 16Mhz AVR.
If you want to measure small blocks of code like that you have to use special
h/w instrumentation along with some specialized code that can wiggle a pin using direct port
i/o to trigger something like a logic analyzer.
(then you have to subtract out the port i/o pin update time)

so you mean do it the hard way or use an oscilloscope... :slight_smile:

You can not use something like micros() to measure very short times
(less than 3-5us).

well thats good to know. Thanks

You are off by an order of magnitude.

thats why i asked:) my library seemed a good bit faster. Even with with from the assignment of SERG and oldSEREG before hand and after, (add two instructions? total 100 ns a cycle) and after, the total time is somewhere near 800ns, unless status register value preservation only runs once, in which case the time should be under 600 ns. Right?
(my class dWrite calls a function which has only the bits of code for status register, so it should be about the same speed right?)

Thanks for your help.

CodeNewton:
so you mean do it the hard way or use an oscilloscope... :slight_smile:

You can still time it with s/w using micros() and get very close to the actual overhead,
you just can't time an individual block that small.
Write a loop, that loops say 50 or 100 times, time that, then
use the same loop code with your function call in it and time that,
then subtract the loop overhead, then divide by the number of loops.
You can get pretty close to the correct overhead.
Make sure you use the right sized integer to hold the micros()
return value.
I use have used this technique quite often for timing things
and it is as accurate as when I used a logic analyzer.
I still use the logic analyzer more often since I use it to profile
running code when I'm doing extreme code optimizations.

my library seemed a good bit faster. Even with with from the assignment of SERG and oldSEREG before hand and after, (add two instructions? total 100 ns a cycle) and after, the total time is somewhere near 800ns, unless status register value preservation only runs once, in which case the time should be under 600 ns. Right?
(my class dWrite calls a function which has only the bits of code for status register, so it should be about the same speed right?)

Thanks for your help.

I'm not sure why you thought that cli() was a constructor. It is just a function call.
(actually it is a function like macro, that masks interrupts).

Since the AVR is a RISC processor it cannot to operations like |= and &= atomically.
That is why interrupts were masked. If you want the code to be interrupt safe you must
mask interrupts. If you don't care about this, just keep in mind that your code will corrupt
port registers when other code that uses interrupts is also used to modify the same port register.
This will be regardless of whether that code uses your code or not because
the issue is related to your code not code used in the interrupt routine.
Several libraries like servo and IR library modify port registers in an ISR.
If you don't intend to ensure atomicity, you can remove the masking of interrupts,
and then there is also no point in saving the status register and then restoring it.

The code you have done is what I mentioned previously as "indirect port i/o"
it is similar to the code in the FIO routines I mentioned earlier.
It moves the much of the overhead of the digitalWrite() API out of the run time path.
It is about as fast as you get the code, still allow the sketch to configure the pins, when
the code is implemented as a library.
The FIO routines will be a little faster as those routines are not actual function calls but macros
which are expanded inline so there is no function call overhead.
You might be able to do something similar using an inline function for Write() in your class.

You probably want to move the initialization stuff to where it is only done once like
in setup() or up into a global object that uses the constructor.

Another tip:
For the AVR, globals can often end up being faster than locals.
The reason being that the addresses of elements within the variable
can be calculated t time vs the runtime code having to fetch a memory
location then adding in the element offset.
It depends on the variable type and how it is used, but something to also keep in mind.

Also, if you going to use this faster pin update library for a shiftout() function,
have a look at the shiftout() code in the FIO code.
That refactors the shiftOut() code to generate faster code.
Don't use the shiftOut() code that comes with the IDE as a model. it SUCKS! from an
implementation standpoint in that it the way it was written generates much
slower code because of the way the loops, sifts, and bit tests are done.
There are much better/faster ways to do it. The faster code will be MUCH faster
but will be slightly larger if you still need to support MSB vs LSB directions.
Although you don't seperate the MSB vs LSB loops, the code will be smaller
and faster than the IDE shiftOut() code, it just won't be as fast as when you
separate out the two loops.
The key is how the bit tests and shifts are done.

To get an idea if your code is faster, just time something like a new shiftout()
that uses it. (or maybe time 100 of them).
You should clearly see the difference, and will also see a difference if you implement
your shiftout() loop, bit tests and shifts, the way the FIO code did it.

--- bill

Thanks, i tested this code:

#include <Wiring_digital.c>
// included this library mostly to include turnOffPWM function, since it's quite long
class dWriteF
{
public:
  dWriteF(uint8_t pin);       //constructor (no destructor, the compiler didn't like mine and didn't seem to care if one was given)
 inline void Write(bool val);        //contains all the code that needs to repeat to set a digital output after class is constructed
private:
                       // used in disabling PWM, not sure how this part works
uint8_t bitF;                           //declared as bitF instead of bit to avoid conflict, "bit" lights up like a keyword, so...
                          //byte storing pin's port
volatile uint8_t *out;          // address of output register
byte error1;                             // Variable to store possible error if constructor is supplied an invalid pin number, 
};


dWriteF::dWriteF(uint8_t pin)
    {     uint8_t timer = digitalPinToTimer(pin);
          bitF = digitalPinToBitMask(pin); 
          uint8_t port = digitalPinToPort(pin);
          if (port == NOT_A_PIN) error1 = 100; //sets error 100
	                                     // If the pin that support PWM output, we need to turn it off before doing a digital write.
          if (timer != NOT_ON_TIMER) turnOffPWM(timer);
          out = portOutputRegister(port);                                  //returns address of port register
              }

void dWriteF::Write(bool val)
    {    
      if (error1 == 100) return;                                           //EDIT,RETURNS just like library function
                                                 //  unknown???
	  /*cli() was here, but it was related to interrupts???, so I removed it, looks like a constructor to me*/
        if (val == LOW) {
		*out &= ~bitF;                                                    // sends value using bitmask to register
	} else {                
		*out |= bitF;   
	}
    
          }


//end dWriteF declarations
//-----------------------------------------------------------------------------------------------------//
#define LED 13
void setup() {pinMode(LED, OUTPUT);
Serial.begin(115200);
}


dWriteF LED13 (LED);
unsigned long time1, time2;


void loop(){
time1 = micros();


  //1
LED13.Write(HIGH);
LED13.Write(LOW);     
//2
LED13.Write(HIGH);
LED13.Write(LOW);     
//3
LED13.Write(HIGH);
LED13.Write(LOW);     
//4
LED13.Write(HIGH);
LED13.Write(LOW);     
//5
LED13.Write(HIGH);
LED13.Write(LOW);     
////////////////1
  //1
LED13.Write(HIGH);
LED13.Write(LOW);     
//2
LED13.Write(HIGH);
LED13.Write(LOW);     
//3
LED13.Write(HIGH);
LED13.Write(LOW);     
//4
LED13.Write(HIGH);
LED13.Write(LOW);     
//5
LED13.Write(HIGH);
LED13.Write(LOW);     
////////////////1

time2 = micros();

   


Serial.println(time1);
Serial.println(time2);
delay(1000);

}

my serial output was

64
84
1000388
1000412
2001252
2001272
3002116
3002140
4002980
4003004
5003840
5003860
6004708
6004728
7005580
7005600
8006448
8006468
9007308
9007332
10008184
10008204
11009156
11009176
12010128

20 us total, 20 calls to dWrite (i made it inline void) this gives one us a call.
note that the method Write showed the same speed(1 us a call without being called an inline!

I also tested the code

 if(val == LOW){*out &= ~bitF;}else{*out |= bitF;}

with 40 iterations not in a loop which had a 16 us total execution time, from this
1000 - 400 us gives 600 us for function call, i guess inlines in arduino are not what they seem to be? or does this indicate time to pass to the function? perhaps there is something slow about how private variables in classes are passed to members?
in that case i might declare some dedicated global variables with a macro???
Or it might be under my nose of course…
In any case, a variable at a higher scope from a faster shiftOut would not be hard to do. I could literally inline it into the code and get great performance.

thanks all.

In your code you remoived cli().

This is for larger processors ( Mega2560 ) which require two instructions to complete a pin change on registers outside the range of sbi/cbi.

If you have a compile time constant, you can have it omitted on smaller processors:

#ifdef __AVR_ATmega2560__
  const bool SafeMode = true;
#else
  const bool SafeMode = false;
#endif

//...
void dWriteF::Write(bool val){  
  
  if (error1 == 100) return;  //EDIT,RETURNS just like library function
  
  uint8_t oldSREG;
  
  if( SafeMode ){
    oldSREG = SREG;
    cli();
  }


  if (val == LOW) {
    *out &= ~bitF;      // sends value using bitmask to register
  } else {                
    *out |= bitF;   
  }
  
  if( SafeMode ) SREG = oldSREG;
}

When safemode is false, the if's and oldSREG do not get compiled in. You could go one step further and ensure the address (out) is above 0x5f before setting sreg, but is probably only efficient on a compile time pin numbers.

Also, 'inline' is only a hint the compiler can ignore. Use the GCC specific to force it inline if needed:

** **__attribute__((always_inline))** **

pYro_65:
In your code you remoived cli().

This is for larger processors ( Mega2560 ) which require two instructions to complete a pin change on registers outside the range of sbi/cbi.

That issue does not apply to this code as this code is fetching bit & port addresses
that are filled in at run time so sbi/cbi will never be used.

When using |= and &= that use bits and pointer addresses that are not known at compile time,
masking interrupts is always needed if you want to ensure atomicity.
The reason is the kludge in avr-gcc that converts |= and &= to sbi/cbi instructions
requires two things:

  • the address of pointer being used must be within range of sbi/cbi instructions
    (That is the issue you brought up with Mega2560)
  • The bit mask value is known at compile time

In the case of code like this or what is in digitalWrite() neither the port address or the bit
are known at compile time to the code using them. They are looked up runtime.
digitalWrite() looks them up every single time. This class code looks up the address and bit once
and saves them in private variables. Write() gets the bit & address from the private variables
in the class object when setting the pin state.
This means that sbi/cbi will never be used and since it takes multiple instructions
to perform |= and &=
it is not atomic unless you mask interrupts during the operation.

Other processors like the pic32 have set and clr registers.
This allows you do do atomic operations without having to mask interrupts
as well as set/clr multiple bits at the same time.

set/clr registers is a much better way to do i/o than bit set/bit clr instructions
because of the atomicity issues.
In order to maintain atomicity on the AVR when the information is not known
at compile time explodes what must be done.
On the AVR you have to:
(once you have all your bits & addresses)

  • save the ISR state,
  • mask interupts
  • read the port register into a tmp variable/register
  • or/and in the desired bit in to the tmp variable/register
  • write the tmp variable/register to the port register
  • restore the ISR state.

On a processor that has set/clr registers you have to:
(once you have all your bits & register addresses)

  • write to the set/clr register depending if setting or clearing the bit

--- bill

tested again with inline forcer

#include <Wiring_digital.c>
// included this library mostly to include turnOffPWM function, since it's quite long
#ifdef __AVR_ATmega2560__
  const bool SafeMode = true;
#else
  const bool SafeMode = false;
#endif


class dWriteF
{
public:
  dWriteF(uint8_t pin);       //constructor (no destructor, the compiler didn't like mine and didn't seem to care if one was given)
inline void Write(bool val); __attribute__((always_inline));       //contains all the code that needs to repeat to set a digital output after class is constructed
private:
                       // used in disabling PWM, not sure how this part works
uint8_t bitF;                           //declared as bitF instead of bit to avoid conflict, "bit" lights up like a keyword, so...
                          //byte storing pin's port
volatile uint8_t *out;          // address of output register
byte error1;   
// Variable to store possible error if constructor is supplied an invalid pin number, 
};


dWriteF::dWriteF(uint8_t pin)
    {     uint8_t timer = digitalPinToTimer(pin);
          bitF = digitalPinToBitMask(pin); 
          uint8_t port = digitalPinToPort(pin);
          if (port == NOT_A_PIN) error1 = 100; //sets error 100
	                                     // If the pin that support PWM output, we need to turn it off before doing a digital write.
          if (timer != NOT_ON_TIMER) turnOffPWM(timer);
          out = portOutputRegister(port); 
          //returns address of port register
              }

void dWriteF::Write(bool val){  
  
  if (error1 == 100) return;  //EDIT,RETURNS just like library function
  
  
  uint8_t oldSREG;
   
  if( SafeMode ){
     oldSREG = SREG;
    cli();
  }


  if (val == LOW) {
    *out &= ~bitF;      // sends value using bitmask to register
  } else {                
    *out |= bitF;   
  }
  
  if( SafeMode ) SREG = oldSREG;
}
 
 
 
 //-----------------------------------------------------------------------------------------
 
 #define LED 13
void setup() {pinMode(LED, OUTPUT);
Serial.begin(115200);
}


dWriteF LED13 (LED);
unsigned long time1, time2;


void loop(){
time1 = micros();


  //1
LED13.Write(HIGH);
LED13.Write(LOW);     
//2
LED13.Write(HIGH);
LED13.Write(LOW);     
//3
LED13.Write(HIGH);
LED13.Write(LOW);     
//4
LED13.Write(HIGH);
LED13.Write(LOW);     
//5
LED13.Write(HIGH);
LED13.Write(LOW);     
////////////////1
  //1
LED13.Write(HIGH);
LED13.Write(LOW);     
//2
LED13.Write(HIGH);
LED13.Write(LOW);     
//3
LED13.Write(HIGH);
LED13.Write(LOW);     
//4
LED13.Write(HIGH);
LED13.Write(LOW);     
//5
LED13.Write(HIGH);
LED13.Write(LOW);     
////////////////1

time2 = micros();

Serial.println(time1);
Serial.println(time2);
delay(1000);

}

giving this result on serial monitor

64
84
1000388
1000412
2001252
2001272
3002116
3002140
4002980
4003004
5003840
5003860
6004708
6004728
7005580
7005600
8006448
8006468
9007308
9007332
10008184
10008204
11009156
11009176

20 us and 20 calls.
-Still 1 us preformance, no different than before inline or force inline
I had to look up “force inline Gcc” and an example showed to put it like so

inline void Write(bool val);__attribute__((always_inline));

Is this correct? I’m having a hard time believing that im getting 600 us of overhead on an inline, (manual inlines showed 400 us, as opposed to 1000us)
NOTE:also added code for safe mode

seeing i had a discrepancy, I looked up arduino and inlining.

  1. Using inline functions

About 8 byte program memory can be saved when functions are made inline. It depends on the number of times the function is called and on the size of the function, whether it makes sense. By using refactoring, code is split into small member functions, which is part of Extreme Programming. When these functions are inline, no extra code will be generated.
Note that member functions defined inside their class are implicit inline. At the Arduino, we can use explicit only at declarations of constructors.
Inline library functions must be placed in the header.

if this be true, has any of you run into this problem? I also saw something about setting inline parameters

would this work as a workaround?

westfw:
Note that most of the faster alternatives rely on on both the pin number and the value being 'written' being constant. If they're not, you won't get nearly as much improvement (if any.) AFAIK, no one has implemented a variation that handles all the different cases of constant/variable pinnumber/value at the maximum theoretical speed: you either get "fast" if everything is constant, or "slow" (or "doesn't work") if anything is non-constant. The original arduino implementation actually isn't "awful", considering what it does.

Given the AVR internal design limitations,
it is impossible to get the best/fastest i/o performance when things are not
known at compile time.

The best you can do is the indirect port i/o like the way the FIO code does it.
That gives you most of the performance of direct port i/o while still allowing
pins to be configured at run time.

If you want to configure the pins at runtime and get the best performance possible,
you pretty much have to do it the way the FIO code does it.

CodeNewton,
you are going to need to start looking at the actual assembler output.
The only way to truly optimize this type of stuff is to look at the actual code generated
by the compiler.
You are also going to have to decide if you want this code to be interrupt safe.
It isn't about which AVR, it is decision of whether the code is interrupt safe or not.
Without being interrupt safe the code will be faster but you will not be able to use with
certain other libraries.

Any runtime tests in Write() cost time.
If you are going to have an error check then make the check based on 0/non-zero
it will be faster than comparing to an absolute value like 100
This is why I say you need to start looking at the assembler output
as you can see the effects of code changes on the generated code.
If it were me, I'd ditch the error code and check and simply make the out pointer
point to something safe, like maybe even bitF.
Given a bad pin # will make the code not work, who cares if "not working"
silently does nothing or corrupts bitF. Either one is safe and the latter
will be faster as there would be no check in Write()

--- bill

here is the new definition of the class, i Had to declare a bool error because of the order. but it now is as fast as the manual inline...

class dWriteF
{
public:
  dWriteF(uint8_t pin);       //constructor (no destructor, the compiler didn't like mine and didn't seem to care if one was given)
inline void Write(bool val);__attribute__((always_inline));       //contains all the code that needs to repeat to set a digital output after class is constructed // set to always inline
private:                    
uint8_t bitF;                   //declared as bitF instead of bit to avoid conflict, "bit" lights up like a keyword, so...    
volatile uint8_t *out;          // address of output register
bool error; 


};


dWriteF::dWriteF(uint8_t pin)
    {     uint8_t timer = digitalPinToTimer(pin);
          bitF = digitalPinToBitMask(pin); 
          uint8_t port = digitalPinToPort(pin);
          if (port == NOT_A_PIN) error = 1;
	                                     // If the pin that support PWM output, we need to turn it off before doing a digital write.
          if (timer != NOT_ON_TIMER) turnOffPWM(timer);
          out = portOutputRegister(port); 
          //returns address of port register
          if (error != 0)out = &bitF;
              }

 void dWriteF::Write(bool val){  
  
  uint8_t oldSREG;
   
  if( SafeMode ){
     oldSREG = SREG;
    cli();
  }


  if (val == LOW) {
    *out &= ~bitF;      // sends value using bitmask to register
  } else {                
    *out |= bitF;   
  }
  
  if( SafeMode ) SREG = oldSREG;
}

serial Monitor

64
96
1000396
1000432
2001268
2001300
3002148
3002180
4003024
4003060

i do notice a variation here but looks like 32 us

800ns per call (40 calls)
about 7.5 times as fast as digitalWrite