Go Down

Topic: Enhanced LiquidCrystal (Read 45157 times) previous topic - next topic

jrraines

I finally got back to this. I tried unrolling the loops in write4bits and write8bits using the same idea, too.
Code: [Select]
void LiquidCrystal::write4bits(uint8_t value, uint8_t mode) {

digitalWrite(_rs_pin,mode);
// for (uint8_t i = 0; i < 4; i++) {   digitalWrite(_data_pins[i], (value >> i) & 0x01); }
 register uint8_t v=value;
 uint8_t *pinptr = _data_pins;
 digitalWrite(*pinptr++, v & 01 );
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
pulseEnable();
}


on the teensy++2.0 with the nonAxman LCD :

4bits       290
8bits       314
userbusy 192

pjrc

digitalWrite takes 0 for low and any non-zero value for high, so instead of "( v >>= 1) & 01", you could use "v & 2", "v & 4" and "v & 8".

jrraines

I tried that with disappointing results: it is about 60 bytes longer (that seems odd; it should only need 7 more constants for the 8 bit version and should be able to save a rotate each time), and not much faster.

4bit         288
8bit         311
userbusy 192

I wonder if the userbusy version has actually maxed out speed for sending 3200 characters.

pjrc

I believe might be more opportunities for speed improvement.

I downloaded the "LiquidCrystalBulletproof.zip", which appears to be the code before unrolling those loops.  Any chance you could upload the latest?  Could you also upload the benchmark program, or maybe even add it into the examples directory?

Here's a few more ideas...

Why are 3 digitalWrite calls necessary in pulseEnable?  Would it make more sense to write en LOW inside init, and then require only 2 calls to digitalWrite?  Especially on Arduino where digitalWrite is so slow, this could really help.

Perhaps writing to the rs_pin could be moved out of write4bits and write8bits, one level up in call hierarchy?  In 4 bit mode, that will avoid needlessly writing it twice, plus the extra overhead of passing it into another function, twice.  It looks like send and init are the only places that call write4bits, and multiple calls all pass the same value.  Who knows, this might even reduce code size?

Maybe pulseEnable could be folded into write4bits?  However, I'd expect the improvement may be minor, as the compiler is already doing optimizations there.

It might even make sense to create 2 send functions, like send_data() and send_cmd().  Every call to send() passes either HIGH or LOW constants.  Together with the idea of moving the rs_pin write up into send(), it would eliminate lots of parameter passing that is only constants.

In the the 1st line of write(), is _setCursFlag more likely to be zero?  If so, testing it first would be a win.  Then again, I'm not easily seeing how _scroll_count changes, but if either of these is usually zero, you want to test the one more likely to be zero first, since the compiler skips the 2nd test if the first indicates the condition won't call setCursor.

Also perhaps worth trying might be calling setCursor() in the only 2 places where _setCursFlag is set to 1.  Then again, if the user is writing strings that they know fit, which seems likely, this could be slower?

Ok, that's all the ideas I have right now, but I'll probably think of more.....



pjrc

If the userFunc check were dropped, inlining checkLcdBusyFlag or just folding that code into send() might speed things up too?  It writes to rs_pin, so it'd go before the digitialWrite to _rs_pin at the beginning of send... fitting together with the idea of moving the rs_pin wiring out of write4bits and eliminating passing it as a constant all the way through the call hierarchy.  That would get rid of a lot of redundant writes to rs_pin as well, since there be only 1 in each copy of send (send_data, send_cmd) right after the busy check/wait.  If the checkLcdBusyFlag code is folded into send, the case where _rw_pin is used, in send_cmd it won't be necessary to re-write rs_pin because the busy check leaves it low.  Unfortunately, the speed critical data case will still need it written to high after the busy check, but this should still be a net reduction in digitalWrite calls, plus savings in function calls and parameter passing.


jrraines

The benchmark is in the middle of my LCDtest program. The numbers I post are always for a 20x4 LCD.
Code: [Select]
   lcd.home();
   int length = nRows * nColumns;
   lcd.setCursor(0,0);
   char text[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
   text[length] = '\0';
     
   lcd.print(text);  //here we are printing the entire screen with one long string  lets you be sure that linewrap happens exactly as it should
   delay(2000);
  uint8_t a =0;
 if ((nColumns == 8) && (nRows == 2) ) a=1;  //This is a trick to make a 'crazy 8' 16x1 LCD work reasonably in a right to left language; I suppose it could be added to the library routine
                                             // as a conditional but  that seems like too much support. what happens is that we write the 2nd line first; when that fills with (8) characters
                                             // it wraps onto the first line. This means that lcd.print("abcdefghijklmno"); will print from right to left as you desire.
   lcd.clear();
   lcd.rightToLeft();
   lcd.setCursor(nColumns-1,a); //now we print it in right to left mode to test the same thing
   lcd.print (text);
   delay(3000);
   lcd.leftToRight();
//BENCHMARK actually starts here but uses the string set up just above:    
   long startTime=millis();    //let's try to benchmark how fast we can go; this will give us an idea about speed of the various interfaces 4/8 bit and checking busy flag or not:
   uint8_t repetitions = 20;
   char blanks[]="                                                                                                                                                                                                                                                                                                                                                 ";
   blanks[length] = '\0';
   while (repetitions--) {  //fill every screen pixel with text, then fill every pixel with blanks and repeat.
     lcd.setCursor(0,0);
     lcd.print(text);
     lcd.setCursor(0,0);
     lcd.print(blanks);
   }
   long endTime = millis();
   lcd.clear();
   lcd.setCursor(0,0);
   lcd.print("Benchmark took ");
   lcd.setCursor(0,1);
   lcd.print(endTime - startTime);
   lcd.print(" millisecs.");
   delay(5000);
   

jrraines

scroll count is used to peg the columns,line values to the screen rather than the LCD's RAM. it is modified in scroll left and right and zeroed in clear and home. Since sensible code (as opposed to test code) has little use for scrolling, it should be zero almost always.

the set cursor flag is used to fix line wrap. when I write a character to the screen, I check to see if I'm at the end of a line, so that the next character might need to be preceded by a setCursor to position it at the start of the next line rather than the line1,line3,line2,line4 order that otherwise occurs. That happens in write. An intervening user generated setCursor clears it.

jrraines

its been a while, but I THINK the reason I seem to do a redundant digitalWrite(en,LOW) in pulse enable has to do with the fact that 40x4 and 27x4 LCDs have 2 HD44780 chips and 2 enable lines.

jrraines

I actually thought I'd posted this yesterday. Puzzling over why using &02,&04 didn't gain much I created a version with (v>>=1) in write4bits and the &'s in write8bits and looked at the disassembly
Code: [Select]

void LiquidCrystal::write4bits(uint8_t value, uint8_t mode) {
    308:      dc 01             movw      r26, r24
                       "mov __tmp_reg__, %1"  "\n\t"
                       "call _digitalWrite"
                       : "+z" (tmp1)
                       : "r" (tmp2)
                       #endif
                 );
    30a:      16 96             adiw      r26, 0x06      ; 6
    30c:      8c 91             ld      r24, X
    30e:      16 97             sbiw      r26, 0x06      ; 6
    310:      e8 2f             mov      r30, r24
    312:      04 2e             mov      r0, r20
    314:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
digitalWrite(_rs_pin,mode);
// for (uint8_t i = 0; i < 4; i++) {   digitalWrite(_data_pins[i], (value >> i) & 0x01); }
 register uint8_t v=value;
 uint8_t *pinptr = _data_pins;
 digitalWrite(*pinptr++, v & 01 );
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
    318:      96 2f             mov      r25, r22
    31a:      91 70             andi      r25, 0x01      ; 1
    31c:      51 96             adiw      r26, 0x11      ; 17
    31e:      8c 91             ld      r24, X
    320:      51 97             sbiw      r26, 0x11      ; 17
    322:      e8 2f             mov      r30, r24
    324:      09 2e             mov      r0, r25
    326:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    32a:      66 95             lsr      r22
    32c:      96 2f             mov      r25, r22
    32e:      91 70             andi      r25, 0x01      ; 1
    330:      52 96             adiw      r26, 0x12      ; 18
    332:      8c 91             ld      r24, X
    334:      52 97             sbiw      r26, 0x12      ; 18
    336:      e8 2f             mov      r30, r24
    338:      09 2e             mov      r0, r25
    33a:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
    33e:      66 95             lsr      r22
    340:      96 2f             mov      r25, r22
    342:      91 70             andi      r25, 0x01      ; 1
    344:      53 96             adiw      r26, 0x13      ; 19
    346:      8c 91             ld      r24, X
    348:      53 97             sbiw      r26, 0x13      ; 19
    34a:      e8 2f             mov      r30, r24
    34c:      09 2e             mov      r0, r25
    34e:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
    352:      70 e0             ldi      r23, 0x00      ; 0
    354:      75 95             asr      r23
    356:      67 95             ror      r22
    358:      61 70             andi      r22, 0x01      ; 1
digitalWrite(_rs_pin,mode);
// for (uint8_t i = 0; i < 4; i++) {   digitalWrite(_data_pins[i], (value >> i) & 0x01); }
 register uint8_t v=value;
 uint8_t *pinptr = _data_pins;
 digitalWrite(*pinptr++, v & 01 );
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
    35a:      53 96             adiw      r26, 0x13      ; 19
    35c:      11 96             adiw      r26, 0x01      ; 1
    35e:      8c 91             ld      r24, X
    360:      11 97             sbiw      r26, 0x01      ; 1
    362:      53 97             sbiw      r26, 0x13      ; 19
    364:      e8 2f             mov      r30, r24
    366:      06 2e             mov      r0, r22
    368:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
 digitalWrite(*pinptr++,( v >>= 1) & 01 );
pulseEnable();
    36c:      cd 01             movw      r24, r26
}
    36e:      bb cf             rjmp      .-138          ; 0x2e6 <_ZN13LiquidCrystal11pulseEnableEv>


Code: [Select]
void LiquidCrystal::write8bits(uint8_t value, uint8_t mode) {
    490:      ef 92             push      r14
    492:      ff 92             push      r15
    494:      0f 93             push      r16
    496:      1f 93             push      r17
    498:      8c 01             movw      r16, r24
    49a:      f6 2e             mov      r15, r22
    49c:      e4 2e             mov      r14, r20
checkLcdBusyFlag();
    49e:      e4 df             rcall      .-56           ; 0x468 <_ZN13LiquidCrystal16checkLcdBusyFlagEv>
                       "mov __tmp_reg__, %1"  "\n\t"
                       "call _digitalWrite"
                       : "+z" (tmp1)
                       : "r" (tmp2)
                       #endif
                 );
    4a0:      f8 01             movw      r30, r16
    4a2:      86 81             ldd      r24, Z+6      ; 0x06
    4a4:      e8 2f             mov      r30, r24
    4a6:      0e 2c             mov      r0, r14
    4a8:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
 digitalWrite(*pinptr++, value & 01 );
 digitalWrite(*pinptr++,value & 0x2 );
 digitalWrite(*pinptr++,value & 0x4 );
 digitalWrite(*pinptr++,value & 0x8 );
 digitalWrite(*pinptr++,value & 0x10 );
 digitalWrite(*pinptr++,value & 0x20 );
    4ac:      9f 2d             mov      r25, r15
    4ae:      91 70             andi      r25, 0x01      ; 1
    4b0:      f8 01             movw      r30, r16
    4b2:      81 89             ldd      r24, Z+17      ; 0x11
    4b4:      e8 2f             mov      r30, r24
    4b6:      09 2e             mov      r0, r25
    4b8:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    4bc:      9f 2d             mov      r25, r15
    4be:      92 70             andi      r25, 0x02      ; 2
    4c0:      f8 01             movw      r30, r16
    4c2:      82 89             ldd      r24, Z+18      ; 0x12
    4c4:      e8 2f             mov      r30, r24
    4c6:      09 2e             mov      r0, r25
    4c8:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    4cc:      9f 2d             mov      r25, r15
    4ce:      94 70             andi      r25, 0x04      ; 4
    4d0:      f8 01             movw      r30, r16
    4d2:      83 89             ldd      r24, Z+19      ; 0x13
    4d4:      e8 2f             mov      r30, r24
    4d6:      09 2e             mov      r0, r25
    4d8:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    4dc:      9f 2d             mov      r25, r15
    4de:      98 70             andi      r25, 0x08      ; 8
    4e0:      f8 01             movw      r30, r16
    4e2:      84 89             ldd      r24, Z+20      ; 0x14
    4e4:      e8 2f             mov      r30, r24
    4e6:      09 2e             mov      r0, r25
    4e8:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    4ec:      9f 2d             mov      r25, r15
    4ee:      90 71             andi      r25, 0x10      ; 16
    4f0:      f8 01             movw      r30, r16
    4f2:      85 89             ldd      r24, Z+21      ; 0x15
    4f4:      e8 2f             mov      r30, r24
    4f6:      09 2e             mov      r0, r25
    4f8:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    4fc:      9f 2d             mov      r25, r15
    4fe:      90 72             andi      r25, 0x20      ; 32
    500:      f8 01             movw      r30, r16
    502:      86 89             ldd      r24, Z+22      ; 0x16
    504:      e8 2f             mov      r30, r24
    506:      09 2e             mov      r0, r25
    508:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    50c:      9f 2d             mov      r25, r15
    50e:      90 74             andi      r25, 0x40      ; 64
    510:      f8 01             movw      r30, r16
    512:      87 89             ldd      r24, Z+23      ; 0x17
    514:      e8 2f             mov      r30, r24
    516:      09 2e             mov      r0, r25
    518:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
    51c:      f0 e8             ldi      r31, 0x80      ; 128
    51e:      ff 22             and      r15, r31
    520:      09 5e             subi      r16, 0xE9      ; 233
    522:      1f 4f             sbci      r17, 0xFF      ; 255
    524:      f8 01             movw      r30, r16
    526:      81 81             ldd      r24, Z+1      ; 0x01
    528:      07 51             subi      r16, 0x17      ; 23
    52a:      10 40             sbci      r17, 0x00      ; 0
    52c:      e8 2f             mov      r30, r24
    52e:      0f 2c             mov      r0, r15
    530:      0e 94 e6 0c       call      0x19cc      ; 0x19cc <_digitalWrite>
 digitalWrite(*pinptr++,value & 0x40 );
 digitalWrite(*pinptr++,value & 0x80 );
pulseEnable();
    534:      c8 01             movw      r24, r16
    536:      d7 de             rcall      .-594          ; 0x2e6 <_ZN13LiquidCrystal11pulseEnableEv>
}
    538:      1f 91             pop      r17
    53a:      0f 91             pop      r16
    53c:      ff 90             pop      r15
    53e:      ef 90             pop      r14
    540:      08 95             ret

00000542 <_ZN13LiquidCrystal4sendEhh>:
}


I don't really understand avr assembler in detail but I'm pleased to see that the compiler understood that (v >>=1) is a single shift instruction. I'm puzzled at the symmetrical add/subtract in the 4bit version; it seems like it must be pulling something off the stack frame and resetting the stack pointer. And I'm amazed that the compiler is smart/tricky enough to turn the call to pulse enable at the end of write4bits into a jump, letting pulse enable do the return, popping the return address that was pushed when write4bits was called !

That bit of trickiness does not get used in write8bits. I suppose that implies that the size of the routine has gotten beyond what the optimizer looks at. Maybe it would be vice versa if I'd reversed which method was in 4bits and which one was in 8bits.

The 8bit routine seems to suffer from more setup/cleanup than the 4bit routine. the reason for that is not obvious to me.

pjrc

Quote
That bit of trickiness does not get used in write8bits. I suppose that implies that the size of the routine has gotten beyond what the optimizer looks at. Maybe it would be vice versa if I'd reversed which method was in 4bits and which one was in 8bits.


Most likely it's the call to pulseEnable that prevents the compiler from applying the same optimizations.  If you try moving it outside write8bits, there's a pretty good chance you'll see similar optimization as write4bits.

jrraines

Paul continued to work on this and considerably speeded things up. It seems like we hit a limit for speed at about 190 on the teensy with userBusy. 190 millisecs / 3200 characters works out to about 58 usec per character. The HD44780 spec sheet indicates about 43 usec per character. I think the arduino gets ready for the next character, switches all the data pins to input, finds the busy flag during the 43 usec.

Then it has to switch the pinModes back, discard the second nibble, put the new character on the data pins and clock that in by pulsing enable. That part doesn't overlap the 43 usec and must add about 15 usec. With this API (as opposed to one that specified the pin numbers in some other fashion) I don't think it can go much faster.

Anyway, in the next 24-48 hrs I will finish the full test sequence on Paul's version and post it. for the non-Axman LCD what i see now is:
Teensy
userBusy    189
4 + RW       219
8 + RW       296
4  no RW   1207
8  no RW   1212

Mega
userBusy    316
4 + RW       468
8 + RW       500
4  no RW   1349
8  no RW   1314

I think what I will do is test and post this, then promptly begin to remove much of the 8-bit specific code; preserve the 8 bit forms of the constructor for compatibility but actually use 4 bits internally. this should save some space and not really hurt functionality for anyone as far as I can see. Paul suggested this and with no real advantage to 8 bit mode in any way, that direction makes sense to me. If anyone sees a downside to that i would be interested to know.

floresta

Quote
preserve the 8 bit forms of the constructor for compatibility but actually use 4 bits internally

I thought about this a while ago but thought it was a bit sneaky.

Don

jrraines

When there was a significant performance advantage for 8 bit in some situations it would have been sneaky. Increasingly the situation is that the 8 bit code:
1. is rarely used
2. adds length, wastes resources
3. adds complexity to make reading the code harder.

pjrc

Yes, I would agree, this version is probably about as fast as it's ever going to get, at least without ugly tricks like hard coded pins or assembly language.

You've already incorporated so much feedback from me and others, and I can understand wanting to do one last round of final testing and wrap everything up.  But if I could talk you into reconsidering the userbusy API again, I just hope there's a simpler way.

Perhaps instead of userbusy, the same goals could be accomplished by making send() public?  Then an example could be crafted which creates a subclass where send() overridden?  That would probably be even faster than userbusy, since the 8 digitalWrite() and enable pulse could use digitalWriteFast, as well as reading the busy bit and switching pin modes.

Regarding not using 8 bit mode, perhaps it could be perceived as "sneaky", but if it's well documented that the 4 extra wires aren't actually used, that seems perfectly fair.  I believe pretty much anybody would appreciate the code being much smaller and simpler.  If they care about speed, I think they'll really appreciate all your hard work that's gone into benchmarking that proves using only 7 pins is actually fastest.

It's really great work you've done here.  A lot of users are going to benefit once this gets merged into an official Arduino release!

jrraines

Making send public rather than the user busy callback fcn is an interesting idea.

After all, every pin in the interface gets manipulated to test the busy flag. An example of how that could be overridden with digitalWriteFast and another example with port commands would make it perhaps more approachable than the callback fcn is.

I don't know if anyone else has actually used the user defined busy routine and its kind of ugly, so I haven't been motivated to fiddle with it.
User callback fcns are the basis for the arduino's approach to user supplied interrupt routines, but the way it works here may actually be harder to understand.

I have thought I was done several times already. I think introducing the little benchmark (which after all, is not all that important; all the interfaces go faster than the eye can follow) has kept this fun and interesting much longer than it otherwise might have been. The project has served to teach me quite a lot about C++ and the arduino that I wouldn't have run across otherwise and provides a practical focus when I read about the language that is much more useful than the umpteenth discussion of circles, squares, and triangles inheriting from shape.

Go Up