Optimization of LED driver

For manipulation with addressable LEDs I am using LED driver function developed by Kevin Darrah.
See his video on youtube:

It works perfectly. Fastest driver on the World, does not require any libraries.
BUT this function is tied with Uno and Port 8, which basically means - only one LED strip might be used.
To remove port 8 restriction, a bit complex bitwise operations are required , but to do it, some "nops" should be removed. As this function is based on timings, duration should remain the same- e.g "adding smth there means removing smth here". Kevin developed this function by looking in Oscilloscope to adjust timings.
Question - probably smbdy already did such adjustment. Or, as I do not have oscilloscope , probably, smbdy, who has, might do it.
Ideally - port number would be an argument for this function.
Pls do not post advises like - use libraries, like NeoPixel..
(This is a question to experts, not in a category: "advise for dummies")

//WS2812 Driver Function
void vRGB_update() {
  // LED is the LED number starting with 0
  // RED, GREEN, BLUE is the brightness 0..255 setpoint for that LED
  byte ExistingPort, WS2812pinHIGH;//local variables here to speed up pinWrites
    
  noInterrupts();                   // kill the interrupts while we send the bit stream out...
  ExistingPort = PORTB;             // save the status of the entire PORT B - let's us write to the entire port without messing up the other pins on that port
  WS2812pinHIGH = PORTB | 1;        // this gives us a byte we can use to set the whole PORTB with the WS2812 pin HIGH
  int bitStream = NUMBER_OF_LEDS * 3; // total bytes in the LED string

  // This for loop runs through all of the bits (8 at a time) to set the WS2812 pin ON/OFF times
  for (int i = 0; i < bitStream; i++) {

    PORTB = WS2812pinHIGH;//bit 7  first, set the pin HIGH - it always goes high regardless of a 0/1 
    
    //here's the tricky part, check if the bit in the byte is high/low then right that status to the pin
    // (RGB[i] & B10000000) will strip away the other bits in RGB[i], so here we'll be left with B10000000 or B00000000
    // then it's easy to check if the bit is high or low by AND'ing that with the bit mask ""&& B10000000)"" this gives 1 or 0
    // if it's a 1, we'll OR that with the Existing port, thus keeping the pin HIGH, if 0 the pin is written LOW 
    PORTB = ((abRGB[i] & B10000000) && B10000000) | ExistingPort; 
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");                  // these are NOPS - these let us delay clock cycles for more precise timing 
    PORTB = ExistingPort;                                                    // okay, here we know we have to be LOW regardless of the 0/1 bit state
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");// minimum LOW time for pin regardless of 0/1 bit state

    // then do it again for the next bit and so on... see the last bit though for a slight change
    PORTB = WS2812pinHIGH; // bit 6
    PORTB = ((abRGB[i] & B01000000) && B01000000) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");
    PORTB = ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");

    PORTB = WS2812pinHIGH; //bit 5
    PORTB = ((abRGB[i] & B00100000) && B00100000) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");
    PORTB = ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");

    PORTB = WS2812pinHIGH;//bit 4
    PORTB = ((abRGB[i] & B00010000) && B00010000) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");
    PORTB = ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");

    PORTB = WS2812pinHIGH;//bit 3
    PORTB = ((abRGB[i] & B00001000) && B00001000) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");
    PORTB = ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");

    PORTB = WS2812pinHIGH;//bit 2
    PORTB = ((abRGB[i] & B00000100) && B00000100) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");
    PORTB = ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");

    PORTB = WS2812pinHIGH;//bit 1
    PORTB = ((abRGB[i] & B00000010) && B00000010) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");
    PORTB = ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t");

    PORTB = WS2812pinHIGH;//bit 0
    __asm__("nop\n\t");//on this last bit, the check is much faster, so had to add a NOP here
    PORTB = ((abRGB[i] & B00000001) && B00000001) | ExistingPort;
    __asm__("nop\n\t""nop\n\t""nop\n\t""nop\n\t""nop\n\t"); 
    PORTB = ExistingPort;//note there are no NOPs after writing the pin LOW, this is because the FOR Loop uses clock cycles that we can use instead of the NOPS
  } // for loop
  interrupts();//enable the interrupts

// all done!
}//void RGB_update

You are aware that since the ws2811 chips run at the same max frequency, any bit-banged function will take the same amount of time ? (for the same amount of LED's) And that your gain can only be (prog) memory space ? Or do you think there is enough time to write to 2 pins simultaneously ?

Deva_Rishi:
You are aware that since the ws2811 chips run at the same max frequency, any bit-banged function will take the same amount of time ? (for the same amount of LED's) And that your gain can only be (prog) memory space ? Or do you think there is enough time to write to 2 pins simultaneously ?

Amount of memory is limitation, you are right. But - in a case if we need separate RGB array for each LED strip. But my algorithm might be following - use the same RGB array for LED strip 1, then change RGB array and use for LED strip 2 etc.
Beauty of this driver is - it requires very low amount of memory. It leaves space for your code..

I had implemented algorithm which uses 188 LEDs- one strip and I am using TWO RGB arrays,
(for restoring previous status, for changes calculation etc) . Works perfectly. Memory is enough despite on quite complex algorithms for "light music"

Regarding timing- I think so, but I cannot answer it correctly, as code is not mine and is developed by looking on oscilloscope to do necessary adjustments with inline nops. All what I can say - it works and works perfectly.
This is exactly my problem - I do not have ability to measure. Theoretically deeper knowledge about execution times for particular commands would help. I suspect bitwise operation like "|" or "&" is executed in one or two UNO processor loops, but I do not know.

If you want to use the same array to write to 2 different pins you can of course, and that would save RAM space yes !

Regarding timing- I think so, but I cannot answer it correctly, as code is not mine and is developed by looking on oscilloscope to do necessary adjustments with inline nops. All what I can say - it works and works perfectly.
This is exactly my problem - I do not have ability to measure. Theoretically deeper knowledge about execution times for particular commands would help. I suspect bitwise operation like "|" or "&" is executed in one or two UNO processor loops, but I do not know.

I also do not know, you could see if you can find the author.

Beauty of this driver is - it requires very low amount of memory. It leaves space for your code..

yes it does, but actually the other drivers also don't use an awful lot of PROGMEM though it just looks like a lot more because it has all the functions for all board variants in there as well, but if you want to write to more than 1 strip using the same Array, how about you write to the same pin and use an AND-gate TTL chip to divert the signal into a particular direction.
Keep writing to the same pin, connect that pin to 1 input of 2 different AND-gates and pull the other input high as 'Gate-Select' that will work.
Strip Select.JPG
just an 'obsolete' 74x08 and you'll be sorted.

Strip Select.JPG

Deva_Rishi:
If you want to use the same array to write to 2 different pins you can of course, and that would save RAM space yes ! I also do not know, you could see if you can find the author. yes it does, but actually the other drivers also don't use an awful lot of PROGMEM though it just looks like a lot more because it has all the functions for all board variants in there as well, but if you want to write to more than 1 strip using the same Array, how about you write to the same pin and use an AND-gate TTL chip to divert the signal into a particular direction.
Keep writing to the same pin, connect that pin to 1 input of 2 different AND-gates and pull the other input high as 'Gate-Select' that will work.
Strip Select.JPG
just an 'obsolete' 74x08 and you'll be sorted.

Thank you for idea, however it might be strange to introduce new circuit , but keep other ports unused.
I also not ready to do soldering, I assume, here is smth ready for Arduino. I did quick search for this, - had found only schematics...

But returning back to my initial question - I would be able to adjust commands for setting bits, if I know execution time for commands. For example - how long is execution of "I" in comparing with inline nop?
Tried to found some info, unsuccessful, but ,most likely, it exists somewhere, or smbdy knows...

how long is execution of "I" in comparing with inline nop?

i think a the extra bit-shift required to route it to a different pin takes 1 cycle per (left-) shift That was what i googled and got from the datasheet (you could experiment a little, by just remove a NOP where you think you need more time, i had some trouble figuring that out, since the pin starts high and you'd take more time to see if it goes low or not.),
Thing is that you would need to create a whole separate function (just adding a parameter won't work the extra condition or call would slow things down even more), which does go at the expense of progmem, or write to 2 pins at once (this can be done i'd say, would be pretty cool but not what you were after.) but then you would need to use a separate array (actually you'd have to combine the outgoing bits so 2 bits are ready to be written) so that would cost extra RAM which is your most scarce resource atm.

Deva_Rishi:
i think a the extra bit-shift required to route it to a different pin takes 1 cycle per (left-) shift That was what i googled and got from the datasheet (you could experiment a little, by just remove a NOP where you think you need more time, i had some trouble figuring that out, since the pin starts high and you'd take more time to see if it goes low or not.),
Thing is that you would need to create a whole separate function (just adding a parameter won't work the extra condition or call would slow things down even more), which does go at the expense of progmem, or write to 2 pins at once (this can be done i'd say, would be pretty cool but not what you were after.) but then you would need to use a separate array (actually you'd have to combine the outgoing bits so 2 bits are ready to be written) so that would cost extra RAM which is your most scarce resource atm.

Thks,
this is how is see it. Will do experiments