How might I bit bang this faster?

I have a function which bit bangs an SPI-like protocol, it currently takes 8us to complete a clocking cycle of "rise,high,drop,low,rise" on the CLK_pin_set pin. I have verified this with an oscilloscope.

I want to try to get it to run faster though, commenting out the delayMicroseconds() calls completely didn't change the speed.

uint8_t shiftBitMask=0b10000000;
  for(uint8_t i=0; i<TotalSendLength;i++){//iterates over bytes in message
    shiftBitMask=0b10000000;
    for(uint8_t j=0; j<8; j++){//iterates over bits in byte
      FastDWrite(MOSI_pin_set,(shiftBitMask & SendingArray[i])!=0 ); //write MOSI high (1) if relevant bit is high, write it low otherwise
      //delayMicroseconds(1);//this delay might be un-necessary
      FastDWrite(CLK_pin_set,1);//slave will read MOSI at this instant, and start setting MISO just after this instant
      //delayMicroseconds(us_delay);//adjust length for minimum to still be reliable, gives slave time to switch MISO to correct state to reply
      if(FastDRead(MISO_pin_set)){//if MISO reads high
        RecdInternal[i]=RecdInternal[i] | (shiftBitMask);
      }else{
        RecdInternal[i]=RecdInternal[i] & (~shiftBitMask);//anded with not of shiftBitMask
      }
      FastDWrite(CLK_pin_set,0);
      shiftBitMask=shiftBitMask>>1;//shift to next most significant bit
    }
  }

Functions used:

struct PinInfo {
  volatile uint8_t *reg;
  uint8_t mask;
};

const PinInfo unoPins[] = {
  {&PORTD, (uint8_t) 0b11111110},
  {&PORTD, (uint8_t) 0b11111101},
  {&PORTD, (uint8_t) 0b11111011},
  {&PORTD, (uint8_t) 0b11110111},
  {&PORTD, (uint8_t) 0b11101111},
 //and so on to save space in this forum post
};

void FastDWrite(uint8_t Pin,uint8_t Val){
  if (Val == 0) {
    *(unoPins[Pin].reg) &= unoPins[Pin].mask;
  } else {
    *(unoPins[Pin].reg) |= ~unoPins[Pin].mask;
  }
}

uint8_t FastDRead(uint8_t Pin){
  switch (Pin){
    case 0:
    return (PIND & _BV (0)) == 0; // digitalRead (0);
    break;//these breaks won't actually run due to the return
    case 1:
    return (PIND & _BV (1)) == 0; // digitalRead (1);
    break;
//and so on to save space in this forum post
  }
}

I have already done tests which show that:

while(1){
FastDWrite(4,1);
FastDWrite(4,0);
}

can run at 2.6MHz, so how might I get my more complex bit banging up to a higher speed? For various reasons I don't want to use the hardware SPI pins for this, because it isn't quite SPI and those pins are already in use for other things, I'm interested in understanding howto makethis bit banging faster if at all possible.

I had a go at looking at the .elf assembly of my code, but it was tricky finding anything which indicates where in the assembly corresponds to this section of code (only some functions and variables seem to get labelled there).

But with my fast digital read and write calls taking about 100ns each (from measurements seeing how much adding them in delayed that while(1) example), I struggle to see how each loop of the j=0; j<8 for loop should take a whole 80 times this 100ns time (or a whole 130 or so atmega328p clock cycles in other terms).
Thank you

A call and return in 100ns at 16MHz clock speed? Not likely. More likely you ran into compiler optimization differences.

Is that just a pulse?

You could try some things. Often bit banged stuff is very literal, tuned for speed not ease of maintenance or communication with other humans.

Lose the array references by working with a copy. It is conceivable that the optimiser might do that for you.

Lose the function calls to fastWrite and fastRead and replace them with direct manipulation at the exact point of those function calls in your code. You are only talking about two pins.

You might be able to write some macros that took care of some of the details, but they should expand to be nose on the metal operations.

You've got room in the program memory and a text editor, you might consider unrolling the inner loop and do way more literally.

Don't do that until you're sure of your algorithm, as doing or really any of the things I've said turns fixing, enhancing or otherwise modifying the code into a night mare.

In general, generalized things take longer. They are easier to write and read as code goes, but we are going for speed not beauty.

a7

Move to a faster peocessor with more capable timer/counter peripherals?

3 Likes

During the tests of the while(1) loop I tried that and used the underlying port manipulation directly, but there was no speed difference, still 2.6MHz in either case. I'll try the change in the more complex situation, but if the function calls were the same speed as direct port changes in the basic circumstance, won't they be here too?

See my comment. Your method of estimation is flawed. Unless you compare assembly listings, you can't be sure.

Some people will argue with me, but you can use assembly language and do whatever you like. Then you are really only limited by the MCU capability.

Having said that, the compiler is a top notch assembly language writer...

What is MOSI_pin_set

In order to get 2.6MHz, the pin and bit value have to both be constants, so that the compile can (hopefully) spit out an SBI or CBI instruction. In your SPI-like code, the bit value is certainly not a constant, and perhaps the port isn't either, so it's slower.

Without compilable code, it's hard to say whether it could be made faster.

FWIW, this isn't a NeoPixel driver, is it? AKA WS2812B?

8 MHz

void setup() 
{
  pinMode(13,OUTPUT);
  noInterrupts();
}

void loop() 
{
  while(true)
  {
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
    PINB = 1<<5;
  }
}
2 Likes

A general rule for performance is to not do runtime lookups for things that are not changing and are known at compile time.

If you want speed, you should as much as you can either up front (at compile time is ideal), or at least up front out of the realtime runtime path.

In your current code, you are calling FastDWrite() to do pin i/o.
While it is doing raw register accesses, it is paying a runtime penalty to look up the register address and bit mask every time when it is not changing since array lookups and structure member references are painful on the AVR given its limited instruction set and limited address indexing modes. This is particularly true if using a structure that is not a global variable.
You should do as much as you can up front.
i.e. lookup up the register address and its bit mask up front.
Likely you can do this one time and save it away since the pins are not ever changing.
For example. you could do the lookups and save these away and pass those into a function or macro instead of calling a function and passing in a pin number.
In fact the Arduino core has functions & macros to do the pin to register address and bitmask lookups so you don't have create your own.
I have done this in a few libraries.
I call it "indirect port i/o". While is isn't quite as fast as direct port i/o (which only works with hard coded constants), it is nearly as fast but allows a sketch to define the pins.

Also, another thing to keep in mind is that
when you use constants, the compiler optimizer can do some pretty intense optimizations.
This is likely why you see such better speed when you called FastDWrite() with constants.
The optimizer may have inlined the function and removed the array lookups since it knows that the values and the table in the lookup are not changing so it could lookup up the values at compile time and generate code to use them vs generate code to runtime lookup the values from the table.

Also, if you have to use variables, make them global.
This can generate better/faster code on the AVR since the AVR has limited address and indexing modes and when the variables are global, the address is known at compile time which can generate better/smaller/faster code.

Are you using linux?
If so I can give you a script that you can install into your IDE that will automatically generate listing and symbol table files each time you do a build.

--- bill

Ok, so it loosk like I've halved the time things take by making MOSI_pin_set and the other pin numbers in to const uint8_t rather than plain uint8_t.

I tried static inline before the FastDWrite and FastDRead function definitions, but that changed nothing.

I stopped using the arrays inside the for inenr for loop and tooklocal copies, that also contributed to the time halving, and making them local rather than global improved that further.

I turned off interupts for the duration of the sending, which prevented the arduino's millis clock from adding in occasional randomly timed extra delaying between actions.

I've currently got this:

uint8_t oldSREG = SREG;//optionally disable interrupts during send
  cli();//optionally disable interrupts during send
  for(uint8_t i=0; i<TotalSendLength;i++){//iterates over bytes in message
    uint8_t shiftBitMask=0b10000000;//defined locally for speed purposes
    uint8_t ByteInUseSend=SendingArray[i];
    uint8_t ByteInUseRec=0;
    for(uint8_t j=0; j<8; j++){//iterates over bits in byte
      FastDWrite(MOSI_pin_set,((shiftBitMask & ByteInUseSend)!=0) ); //write MOSI high (1) if relevant bit is high, write it low otherwise
      
      FastDWrite(CLK_pin_set,1);//slave will read MOSI at this instant, and start setting MISO just after this instant
      
      FastDWrite(CLK_pin_set,0);
      if(FastDRead(MISO_pin_set)){//if MISO reads high
        ByteInUseRec=ByteInUseRec | (shiftBitMask);//ByteInUseRec|=shiftBitMask;//ByteInUseRec=ByteInUseRec | (shiftBitMask);
      }else{
        ByteInUseRec=ByteInUseRec & (~shiftBitMask);//anded with not of shiftBitMask
      }
      shiftBitMask=shiftBitMask>>1;//shift to next most significant bit
    }
    RecdInternal[i]=ByteInUseRec;
  }
  SREG = oldSREG;//restores old interrupts after optionally disabling interrupts during send

Is there anything I could do to speed up further, or is this the plausible limit?

Oddly it seems the low part of the clock cycle is longer when MOSI is low or falling than when it is high or rising, does that offer any clues?

Can a lower-clock-cycle expression be used for any of the following actions:
ByteInUseRec=ByteInUseRec | (shiftBitMask);
ByteInUseRec=ByteInUseRec & (~shiftBitMask);
shiftBitMask=shiftBitMask>>1;

Thanks

EDIT:

found one good thing

if(shiftBitMask & ByteInUseSend){
  FastDWrite(MOSI_pin_set, 1);
}else{
  FastDWrite(MOSI_pin_set, 0);
}

seems a faster replacement for

FastDWrite(MOSI_pin_set,((shiftBitMask & ByteInUseSend)!=0) );

Made one other edit, thing I might now have optimised as far as things can go:

Replaced the lookup based FastDWrite with this

static inline uint8_t FastDWrite(uint8_t Pin,uint8_t Val){
  if (Val == 0) {
    switch (Pin){
      case 0:
      PORTD &= ~_BV (0); // digitalWrite (0, LOW);
      return 0;
      break;

      case 1:
      PORTD &= ~_BV (1); // digitalWrite (1, LOW);
      return 0;
      break;

      case 2:
      PORTD &= ~_BV (2); // digitalWrite (2, LOW);
      return 0;
      break;

      case 3:
      PORTD &= ~_BV (3); // digitalWrite (3, LOW);
      return 0;
      break;

      case 4:
      PORTD &= ~_BV (4); // digitalWrite (4, LOW);
      return 0;
      break;

      case 5:
      PORTD &= ~_BV (5); // digitalWrite (5, LOW);
      return 0;
      break;

      case 6:
      PORTD &= ~_BV (6); // digitalWrite (6, LOW);
      return 0;
      break;

      case 7:
      PORTD &= ~_BV (7); // digitalWrite (7, LOW);
      return 0;
      break;

      case 8:
      PORTB &= ~_BV (0); // digitalWrite (8, LOW);
      return 0;
      break;

      case 9:
      PORTB &= ~_BV (1); // digitalWrite (9, LOW);
      return 0;
      break;

      case 10:
      PORTB &= ~_BV (2); // digitalWrite (10, LOW);
      return 0;
      break;

      case 11:
      PORTB &= ~_BV (3); // digitalWrite (11, LOW);
      return 0;
      break;

      
      case 12:
      PORTB &= ~_BV (4); // digitalWrite (12, LOW);
      return 0;
      break;

      case 13:
      PORTB &= ~_BV (5); // digitalWrite (13, LOW);
      return 0;
      break;

      case 14:
      PORTC &= ~_BV (0); // digitalWrite (A0, LOW);
      return 0;
      break;

      case 15:
      PORTC &= ~_BV (1); // digitalWrite (A1, LOW);
      return 0;
      break;

      case 16:
      PORTC &= ~_BV (2); // digitalWrite (A2, LOW);
      return 0;
      break;

      case 17:
      PORTC &= ~_BV (3); // digitalWrite (A3, LOW);
      return 0;
      break;

      case 18:
      PORTC &= ~_BV (4); // digitalWrite (A4, LOW);
      return 0;
      break;

      case 19:
      PORTC &= ~_BV (5); // digitalWrite (A5, LOW);
      return 0;
      break;
    }
  }else{
    switch (Pin){
      case 0:
      PORTD |= _BV (0); // digitalWrite (0, HIGH);
      return 0;
      break;

      case 1:
      PORTD |= _BV (1); // digitalWrite (1, HIGH);
      return 0;
      break;

      case 2:
      PORTD |= _BV (2); // digitalWrite (2, HIGH);
      return 0;
      break;

      case 3:
      PORTD |= _BV (3); // digitalWrite (3, HIGH);
      return 0;
      break;

      case 4:
      PORTD |= _BV (4); // digitalWrite (4, HIGH);
      return 0;
      break;

      case 5:
      PORTD |= _BV (5); // digitalWrite (5, HIGH);
      return 0;
      break;

      case 6:
      PORTD |= _BV (6); // digitalWrite (6, HIGH);
      return 0;
      break;

      case 7:
      PORTD |= _BV (7); // digitalWrite (7, HIGH);
      return 0;
      break;

      case 8:
      PORTB |= _BV (0); // digitalWrite (8, HIGH);
      return 0;
      break;

      case 9:
      PORTB |= _BV (1); // digitalWrite (9, HIGH);
      return 0;
      break;

      case 10:
      PORTB |= _BV (2); // digitalWrite (10, HIGH);
      return 0;
      break;

      case 11:
      PORTB |= _BV (3); // digitalWrite (11, HIGH)
      return 0;
      break;

      
      case 12:
      PORTB |= _BV (4); // digitalWrite (12, HIGH);
      return 0;
      break;

      case 13:
      PORTB |= _BV (5); // digitalWrite (13, HIGH);
      return 0;
      break;

      case 14:
      PORTC |= _BV (0); // digitalWrite (A0, HIGH);
      return 0;
      break;

      case 15:
      PORTC |= _BV (1); // digitalWrite (A1, HIGH);
      return 0;
      break;

      case 16:
      PORTC |= _BV (2); // digitalWrite (A2, HIGH);
      return 0;
      break;

      case 17:
      PORTC |= _BV (3); // digitalWrite (A3, HIGH);
      return 0;
      break;

      case 18:
      PORTC |= _BV (4); // digitalWrite (A4, HIGH);
      return 0;
      break;

      case 19:
      PORTC |= _BV (5); // digitalWrite (A5, HIGH);
      return 0;
      break;
    }
  }
  return 0;
}

I'm getting 400ns clock high and just over 1us clock low now.

Do you need to generate pulses at will, or a continuous pulse train (rectangular wave)?

It's still a function that can handle any pin. Can you be sure the compiler even with all the help you give it is able to do as well as six macros, one for each of setting and clearing the three pins you actually use?

Above my pay grade, but what I can glean from the heavies right here seems to suggest it is worth thinking about.

Have you looked at the compiler output?

a7

Years ago I did a very fancy macro based pin i/o package for an Arduino glcd library.
I called the AVR macro pin i/o package "avrio".
It reduced the code to as small and as fast as could be done within the AVR instruction set.

It was thousands of lines of macros in a single header file with many nested macros that did all the lookup work at compile time so that the final compiled code could be single bit set/clear instructions when the pin an value parameters were constants.

The API included funtion like macros that mimicked the Arduino digital i/o library i.e.

 *		avrio_digitalWrite(pin, pinval)
 *		avrio_digitalWritepin(pin, pinval)
 *		avrio_digitalWrite8Pins(pin0, pin1, pin2, pin3, pin4, pin5, pin6, pin7, byteval)

which also supported multi pin writes like byte writes.
The macros would automagically boil this down to allow the compiler and optimizer to generate the minimal number of AVR instructions.
which could be a single 8 bit write, two nibble writes, or multiple bit sets/clears
whatever would work with the fewest AVR instructions.
In order to get byte or nibble operations the pins used had be adjacent.
But the macros figured all that out.

The macros could use the Arduino core macros to get pin and bit information which is what is used by the avrio macros at compile time.

There is lots and lots of macro magic going and a nearly a year of work getting it all to work. Lots of time spent look at compiler output and logic analyzer traces for timing for the openGLCD library.

While this was part of a glcd library (openGLCD), it isn't specific to that application.
At one point I created a separate project for avrio called mcu_io on google code.
Here is a link to what remains of that:
https://code.google.com/archive/p/mcu-io/
It is licensed GPL v3.0

And here is a link to the openGLCD library that uses avrio.h
https://bitbucket.org/bperrybap/openglcd/src/master/
avrio.h is down under include.

---- bill

1 Like

This is mostly unnecessary in the post LTO world. The compiler will happily optimize constant-offset array accesses to the the Arduino pin table, and produce optimal code without the usual macro nonsense.

boolean NonConstantUsed( void ) __attribute__ (( error("") ));
void LTO_Not_Enabled(void) __attribute__ (( error("") ));

#define digitalWriteFast(pin, val) \
    if (__builtin_constant_p(pin)) {_dwfast(pin, val);} else { NonConstantUsed(); }

static inline __attribute__((always_inline)) void _dwfast(int pin, int val) {
    uint8_t mask = digital_pin_to_bit_mask_PGM[pin];
    uint8_t port = digital_pin_to_port_PGM[pin];
    volatile uint8_t *outport = (volatile uint8_t*)port_to_output_PGM[port];
    if (digital_pin_to_bit_mask_PGM[0] == 0x99)  // Never true
        LTO_Not_Enabled();
    if (val) {
        *outport |= mask;
    } else {
        *outport &= ~mask;
    }
}
#define digitalReadFast(pin)  \
    (__builtin_constant_p(pin) ? (!!_drfast(pin)) : NonConstantUsed())

static inline __attribute__((always_inline)) uint8_t _drfast(uint8_t pin) {
    uint8_t mask = digital_pin_to_bit_mask_PGM[pin];
    uint8_t port = digital_pin_to_port_PGM[pin];
    volatile uint8_t *inport = (volatile uint8_t*)port_to_input_PGM[port];
    if (digital_pin_to_bit_mask_PGM[0] == 0x99)  // Never true
        LTO_Not_Enabled();
    return (*inport & mask);
}

This gives you portability across different AVR boards, and has also been implemented for the new-style AVRs as well. See https://github.com/WestfW/Duino-hacks/tree/master/fastdigitalIO
(I don't know how much it will help in this case, since you don't have constant bit values. Time to disassemble the code and see what the compiler is actually doing, rather than "guessing" at the C level.)
Writing code at the "digitalWriteFast()" level tends to be nice because there is always SOMEONE who will go off and implement that for new architectures as well.

See also Nerd Ralph: Fastest AVR software SPI in the West

(Note that LTO doesn't seem to be used in the non-AVR world. There were vague reports of "issues" with ARM, but I don't recall ever seeing details.)

Are you saying the compiler now knows how to reach down into tables at compile time that are declared as PROGMEM? That didn't' used to be the case.

Yeah I did my stuff nearly 15 years ago.

But, at least when I tested years ago, there were a few small changes needed to the declaration of the pin tables in the variant files to allow it to work.
The PROGMEM stuff was the issue.
I tried for years to the Arduino.cc devel team to make a few small changes to allow this to work with some conditionals that could be set to disable the PROGMEM declaration for special purposes.
It didn't affect anything already existing but they had no interest in making the simple changes.

In my case I needed something better than single pin as I also needed the ability to optimize multi pin accesses for both reads and writes.
The macros I have are smart enough to use the absolute minimal number of instructions. i.e. when all the pins are in proper order in a single register it will do a single byte access.
It was portable on all 8bit AVR processors across every AVR platform I tested on at the time.
i.e. Arduino.cc, Teensy, Sanguino, SleepingBeauty, 1284p, Tiny, etc...
Some of those were on the old "hardware" cores that were pre Board manager support.

It was incredible to see such a large amount of source code boil down to a single AVR instruction.

--- bill

As fast as possible is not always a good thing.
For example, I have found that even using shiftOut() on some platforms like the ESP parts, it can so fast that it is faster than the device can handle.
Which can be annoying since it means you have to code it yourself with a small added delay.

--- bill

I think since my post #12 I've had this about as fast as I need. A bit taking in the 1.5us to 2us range isn't bad by my reckoning.

Thanks for the further thoughts,

westfw is your post #16 code efectively the same underneath as mine in post #12? Just formatted in a way which is shorter in source code but compiles to the same thing?

2us range - 500 KHz - too slow.
An approach at post #9 give you a 8 Mhz

If you want done it really fast, you MUST abandon methods that take a pin as a parameter and work only with pins as a predefined constants.
As we can see from the comparison of posts #9 and #12 - this allows you to speed up the code by an order of magnitude