Parallel updating RGBW LED strip; fast memory shuffle (with inline assembler?)

Hallo everyone,

I am trying to parallel update four RGBW LED strips with an Arduino Nano.
The strips are wired to digital pins 0-3 which equals bit 0-3 of I/O register PORTD. (Image in the attachment)

The strip type is SK6812 RGBW but i don't think this is a very important information. (Datasheet in attachment)

The important thing is that in order to update one LED you need to give it 32 bit of data in quick succession as described in the datasheet.
I've managed to do that by preparing an array of 32 bits ( named LED[32] ) which hold the information for one LED of each strip. Than load these values in I/O register PORTD to drive the pins high and low.
The LED[32] array looks like this:

From LSB to MSB order:
w (white)
b (blue)
r (red)
g (green)

pins 4-7 were saved at the beginning and will be loaded in every frame (X) to keep them as they were

bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0
X X X X W3_0 W2_0 W1_0 W0_0
X X X X W3_1 W2_1 W1_1 W0_1
X X X X W3_2 W2_2 W1_2 W0_2
X X X X W3_3 W2_3 W1_3 W0_3
... ... ... ... ... ... ... ...
X X X X r3_4 r2_4 r1_4 r0_4
X X X X r3_5 r2_5 r1_5 r0_5
X X X X r3_6 r2_6 r1_6 r0_6
X X X X r3_7 r2_7 r1_7 r0_7

This information needs to be calculated prior to the writing process of an LED.
The time between two writing processes must be less than 80uS!
For the 16 MHz Arduino that is 1280 cycles.

At this point my calculation is not fast enough.

The information of the LEDs is stored in an array named LEDs["Number of LEDs"][4][4]
One dimension of the array for each LED, the second one for the four colors, and the last one for the four different strips.

My code to write to all LEDs:

static inline __attribute__ ((always_inline)) void showPixel() {
  // Send the 32 Bits down every row. Remember that each pixel is 32 bits wide (8 bits each for R,G, B & W)
  uint8_t bit;
  uint8_t onPixel,offPixel; //output of PORTD when high or low is being written
  cli(); //no interrupts
  offPixel = PIXEL_PORT;  //safe output of Port D
  offPixel &= 0xf0; //create Bitmask for setting bit 4-7 of Port D to original value and leds off 0bxxxx0000
  onPixel = offPixel | 0x0f;  //led pins high plus IO pins as they were 0bxxxx1111
  
  for(uint8_t ledNr=0; ledNr < NUM_LEDS; ledNr++)  {
    
    shuffle(0,LEDs[ledNr][3][0],LEDs[ledNr][3][1],LEDs[ledNr][3][2],LEDs[ledNr][3][3],offPixel);//white
    shuffle(8,LEDs[ledNr][2][0],LEDs[ledNr][2][1],LEDs[ledNr][2][2],LEDs[ledNr][2][3],offPixel);//blue
    shuffle(16,LEDs[ledNr][0][0],LEDs[ledNr][0][1],LEDs[ledNr][0][2],LEDs[ledNr][0][3],offPixel);//red
    shuffle(24,LEDs[ledNr][1][0],LEDs[ledNr][1][1],LEDs[ledNr][1][2],LEDs[ledNr][1][3],offPixel);//green
    
  
    bit=32; 
    while (bit--) { //send out the 32 bytes
      sendBitX4_lower( LED[bit] ,onPixel,offPixel); 
    }
  }
  sei(); //activate interrupts
}

My shuffle function:

static inline  __attribute__ ((always_inline)) void shuffle(uint8_t bit, uint8_t v0, uint8_t v1,uint8_t v2, uint8_t v3,uint8_t IOpins){
  uint8_t  res,pos,mask=8;
  pos=bit+8;
  LED[bit]=0;
  LED[bit++]=0;
  LED[bit++]=0;
  bit++;
  while(bit<pos){
    if(v3 & mask) res=8;
    else res=0;
    if(v2 & mask) res|=4;
    if(v1 & mask) res|=2;
    if(v0 & mask) res|=1;
    mask<<=1;
    res|=IOpins;      //Set bits 0-3 to the output that was present
    LED[bit]=res;
    bit++;
  }

It can be a bit hard to understand what needs to be done in the suffle function. I've tried to draw it so maybe you can understand it easier (attachment shuffle.pdf).
Essentially the calculation is split up in 4 parts for each color. Each shuffle will write to 8 bytes of the LED[32] array. This process looks a bit like a matrix that is being inverted. Each byte of the LED[32] has elements of 4 different bytes of the LEDs array. Starting with the LSB for LED[0] and moving up to the MSB for LED[8] and so on.

I've tried different examples of this. Some with bit shifting, some with pointers going through the arrays, but this was the fastest.

My question is: is it physically possible to do this calculation in this many cycles? And if yes, how?
Probably with inline assembler but i'm just getting into that...
Thanks for your help. If your interested we could refine this and make it accessible to everyone :slight_smile:

SK6812RGBW-datasheet short.pdf (386 KB)

Shuffle.pdf (424 KB)