digitalWrite() "loop hoisting"

Hi,

in a program to clear Nokia 5110 LCD display with 84x48 pixels this loop was used to clear the screen:

for(int i=0; i<504; i++) LcdWriteData(0x00);

On Arduino Uno the loop takes 87436μs, which translates to 11.44fps.
This is function LcdWriteData() used:

void LcdWriteData(byte dat)
{
  digitalWrite(DC, HIGH); //DC pin is low for commands
  digitalWrite(CE, LOW);
  shiftOut(DIN, CLK, MSBFIRST, dat); //transmit serial data
  digitalWrite(CE, HIGH);
}

Yesterday I did dive into the implementations of digitalWrite and shiftOut and explored quite some enhancement potential. Did a series of improvements from 11.44, to 23.74 to 206.27fps for just clear/fill screen and finally to 116.93fps for drawing a complete frame:

For posting here I did cleanup a superfluous comment and command from yesterday and change data type for bit iteration loop variable to uint8_t resulting in 121.01fps. This is the code between the microsecond measurements:

 long t0=micros();
//  for(int i=0; i<504; i++) LcdWriteData(0x00); // clear LCD
  digitalWrite(DC, HIGH); //DC pin is low for commands
  digitalWrite(CE, LOW);
  uint8_t bitD = digitalPinToBitMask(DIN);
  uint8_t bitC = digitalPinToBitMask(CLK);
  volatile uint8_t *outD=portOutputRegister(digitalPinToPort(DIN));
  volatile uint8_t *outC=portOutputRegister(digitalPinToPort(CLK));
  for(int i=0; i<504; i++) 
  {
   uint8_t val=frame[i];
   for(uint8_t j=0; j<8; ++j, val>>=1)
   {
    // digitalWrite(DIN, !!(val & (1 << (7 - j))));
    uint8_t oldSREG = SREG;
    cli();
    // DIN, now pin 7, is no PWM pin and valid 
    if ((val & 0x01)!=0)
      *outD |= bitD;
    else
      *outD &= ~bitD;
    SREG = oldSREG;

//    digitalWrite(CLK, HIGH);
    cli();
    // CLK, pin 8, is no PWM pin and valid
    *outC |= bitC;
    SREG = oldSREG;

//    digitalWrite(CLK, LOW);
    cli();
    *outC &= ~bitC;
    SREG = oldSREG;
   }
  }
  digitalWrite(CE, HIGH);
 long t1=micros();

The technique I used is called loop hoisting and moves loop-invariant code out of the loops. In above code this was bitMask and outputRegister determination for CLK and DIN pins. In addition shiftOut was sped up a bit. And I made use of that I took pins that were no PWM pins (had to switch D9 to D7 for that, see added cable from D7 to D9 below) because that allows to skip the test on whether the pin is a PWM pin (and turn off its timer). Also I knew that I provide valid pin numbers and therefore could skip the test for valid pin number.

In general I would not do stuff like that and use digitalWrite() as is with all the built in checks.
But for time critical stuff like drawing an image I feel not guilty to do loop hoisting in self-made digitalWrite() given the big increase in fps (11.74->206.27 for clear/fill display).

Hermann.

Nice, but was there a question?

aarg:
Nice, but was there a question?

Good question, so this maybe needs moved to another section, sorry.

Thanks for moving the thread.

There is a mistake in code shown above, it does output the bits LSBFIRST although MSBFIRST is needed.
This little diff corrects it:

$ diff ~/Arduino/sketch_feb27b/sketch_feb27b.ino ~/Arduino/sketch_feb28a/sketch_feb28a.ino 
92c92
<    for(uint8_t j=0; j<8; ++j, val>>=1)
---
>    for(uint8_t j=0; j<8; ++j, val<<=1)
98c98
<     if ((val & 0x01)!=0)
---
>     if ((val & 0x80)!=0)
$

I did further comment out two “SREG = oldSREG;” and “cli();” increasing from 121.01 to 142.94fps.
You can see correct display below (all 6 “rows” of 8 pixels are upside down compared to before).

If you want to play with it, here are both files needed, based on code from this 3part youtube video:
https://stamm-wilbrandt.de/en/forum/5110/sketch_feb28a.ino
https://stamm-wilbrandt.de/en/forum/5110/font.h

What I find amazing is the very short time that CLK goes from LOW to HIGH and then back to LOW.

This is the C++ code piece:

*outC |= bitC;
*outC &= ~bitC;
SREG = oldSREG;

I did compile with “-S” to get the generated assembler code, this is the corresponding code piece:

#clocks
2       ld r9,Z
1       or r9,r21
2       st Z,r9
        .loc 1 112 0
2       ld r9,Z
1       and r9,r11
2       st Z,r9
        .loc 1 113 0
1       out __SREG__,r10

r11 is the One’s complement of r21

1       mov r11,r21
1       com r11

So the CLK pin is HIGH for 2+1+2=5 clock cycles @16MHz, which is 0.3125μs only …

Hermann.

Above routine put into function “drawFrame()” is so quick that “delay(40)” was needed to see “something”.
Below code does sroll the frame 1 pixel to the right and then computes and draws new pixels for three functions.
No full blown graphics library yet, but shows that stuff can be built easily; here is video for below code:

// 240 step demo:  https://www.youtube.com/watch?v=FOr447w9HpE
 for(uint8_t z=0; z<240; ++z) 
 {
  // scroll 48x84 display 1 pixel to the right
  for(uint8_t i=0; i<84; ++i)
  {
    for(int8_t j=5; j>=0; --j)
    {
      frame[j*84+i]<<=1;
      if ((j>0) && (frame[(j-1)*84+i]&0x80))
        frame[j*84+i]|=0x01;    
    }
  }  

  // calculate new y-coordinates for 3 functions
  uint8_t a = 14+11*sin(z*0.25);
  uint8_t b = 42+11*cos(z*0.25);
  int8_t c = 11*cos(z*0.25);
  c = 70 + ((c<0) ? -11 : +11);
  
  // drawPixel(0,y) for three functions
  frame[a]|=0x01;
  frame[b]|=0x01;
  frame[c]|=0x01;
  
  drawFrame(frame);

  // allows to "see" something  
  delay(40);
 }

Coorodinate system for 48x84 display:

.83|
 82|
...|
...|
  1|
  0|
   +-------
    01...44
         67

Row y bytes and bits:

byte 0*84+y   1*84+y   2*84+y   3*84+y   4*84+y   5*84+y
   +--------+--------+--------+--------+--------+--------+
  y|01234567 01234567 01234567 01234567 01234567 01234567
   +--------+--------+--------+--------+--------+--------+
    01234567 89012345 67890123 45678901 23456789 01234567
               111111 11112222 22222233 33333333 44444444

Hermann.