Cosa/Boosting LCD 1602 performance (7X SR4W, 6.5X 4-bit parallel, 1.7-2.3X I2C)

Typical LCD devices as the HD44780 and displays such as 1602/1604/2004 are connected to the Arduino using a parallel interface (4 or 8-bit). This takes a lot of pins. To allow more pins left to an application the LCD can be connected through an I2C IO expander. The major drawback being low update speed and the IO blocking in the I2C implementation in Wiring.

Looking into the implementation (LiquidCrystal_I2C) I can across a bottle-neck. When sending a command/data byte to the display the library issues four I2C transmissions. Each transmission is the I2C address of the IO expander and the byte to be written to the port. The reason it is four is because 1) 4-bit access is used to the LCD, 2) the need to toggle the LCD enable pin.

A simple optimization is to merge this to a single I2C transmission with the I2C address followed by the four bytes to be written (in sequence) to the port. This reduces the communication with 3 address bytes and the start-stop-ack-arbitration time. This is possible as the required delay between LCD commands (37 us) is much shorter than the time between I2C transmissions or between bytes in a transmission (min 200 us resp 100 us).

Read more about this on http://cosa-arduino.blogspot.se/.

There is also an optimized version of 4-bit parallel access that achieves 541 fps. The improvements are 80+ % compared to the New LiquidCrystal library.

Cheers!

Yep. I did the same thing a while back.
I've got an update for the new liquidcrystal library that does this.
It's part of an update that also auto detects and supports the MCP23008 chip.
I saw the effective byte transfer speeds from sketch through Print class to LCD
went from 957us to 544us or move from around 31 to 54 FPS.

If I understand your 4 bit fps number correctly, I would have expected much better than only 50% improvement
for 4 bit mode since the optimized 3 wire shift register code is already doing better than than the 444 fps number
you quoted.

Can you run LCDiSpeed sketch (included in the examples directory of the New LiquidCrystal library) on your library?
It reports all the byte transfer numbers and FPS rates normalized so that the display size is not a factor.
I'm curious if you are seeing the same/similar numbers for i2c and what numbers you get for
the optimized parallel mode.

--- bill

You could look into using the MCP23017 I/O expander. Use one 8-bit port for an 8-bit LCD interface and the other to implement the RS, RW, and E. This makes the programming a lot simpler and you can implement more than one E pin to run several LCDs if you wish. It would also get rid of half of your overhead.

Could one of you explain why it is so important to speed up the byte transfer speed when you then have to wait a much longer time for the device to be ready for the next piece of information?

Couldn't you simply subtract the extra time the I2C implementation takes from the delay times that you insert between the bytes that you send to the LCD controller? That would take care of half the problem with the single-port I2C devices and all of the problem with the two-port devices.

Don

Hi Bill thanks for the feedback. Your questions got me thinking if there was more to be done and if I have missed something.

For the I2C IO expander the exec-delay (32 us) could be removed as there is plenty of time between port writes. This then gave approx. 53 fps.

The 4-bit direct port could be optimized further to 523 fps. There was also a bug as I had been a bit sloppy and forgot that the port update should be synchronized (interrupts turned off). There are a few further optimization before going to assembly.

Cheers!

Don, that was an interesting I/O expander. The only problem I see with this device is the need to provide a register address in the protocol. The GPIO ports may be organized so that a 16-bit port write may be reduced to a four byte transmission (I2C address, register address, data1, data2). And the device requires multiple transmissions to flip any port bit, if I understand the spec correctly. This actually makes this expander slower than the PCF8574 even though it is 16-bit. But the SPI version could compensate for this by the higher serial bit-rate (4 Mhz, which is 40X compared to 100 Khz IC2).

Could one of you explain why it is so important to speed up the byte transfer speed when you then have to wait a much longer time for the device to be ready for the next piece of information?

Couldn't you simply subtract the extra time the I2C implementation takes from the delay times that you insert between the bytes that you send to the LCD controller?

This is more or less one of the optimizations I did lately where I simply removed the exec-delay (37 us) for the I2C IO expander adapter. This is not needed as the I2C serialization will give at least a 200+ us delay between two port updates (in a single transmission) which gives the LCD controller plenty of time.

The enable pulse should be at least 230 ns and as the time between two updates by two adjacent bytes in the same transmission is at least 100 us there is no need for an additional delay.

Cheers!

floresta:
You could look into using the MCP23017 I/O expander. Use one 8-bit port for an 8-bit LCD interface and the other to implement the RS, RW, and E. This makes the programming a lot simpler and you can implement more than one E pin to run several LCDs if you wish. It would also get rid of half of your overhead.

While it may be a bit simpler, I don't think it really makes programming a lot simpler.
Running in 4 bit mode, and having to share the output port between the 4 data lines
and the control lines and baclight control is not that difficult.
The MCP23017/MCP23008 will have more i2c overhead than the PCF8574.
This is because those chips are more flexible.
Because of that flexibility, they have control/configuration registers.
Because of these registers,
the first byte transfered always to the chip goes to the address pointer to select which register
you really wanted to write.
This address register normally increments after every write.
You can put the MCP chips into BYTE mode which disables this increment.
This allows writing to the OLAT register with back to back writes to the chip
the way the PCF8574 works.
However, even if you put the chip into BYTE mode, you still have to send an
extra byte to the chip each transmission to initially select the OLAT register.
So I don't think using a MCP23017 in 8 bit mode would be faster than using a PCF8574
in 4 bit mode.
ex:
4 bit PCF8574 to transfer a byte/command to LCD:

  • start
  • data byte: 4 bit LCD data/control E high
  • date byte: 4 bit LCD data/control E low
  • data byte: 4 bit LCD data/control E high
  • data byte: 4 bit LCD data/control E low
  • end

4 bit mode MCP23008 in BYTE mode:

  • start
  • data byte: 0x0A to point to OLAT
  • data byte: 4 bit LCD data/control E high
  • date byte: 4 bit LCD data/control E low
  • data byte: 4 bit LCD data/control E high
  • data byte: 4 bit LCD data/control E low
  • end

8 bit mode PCF23017 in bank = 1, BYTE mode:

  • start
  • data byte: 0x0A to point to OLATA
  • data byte: LCD data byte
  • end
  • start
  • data byte: 0x1A to point to OLATB
  • date byte: LCD control byte with E high
  • date byte: LCD control byte with E low
    -end

Oddly enough, while counter intuitive,
it looks like 4 bit mode on a MCP23008 will be faster
than 8 bit mode on the MCP23017. But both MCP chips will be slower
than the PCF8574 because there is simply more i2c overhead with those
chips.

One thing that could really speed things up on the MCP chips would be to bump
the speed of the i2c bus up since those chips can handle 1.7Mhz clock rates vs
the standard/default 100kHz.

Could one of you explain why it is so important to speed up the byte transfer speed when you then have to wait a much longer time for the device to be ready for the next piece of information?
Couldn't you simply subtract the extra time the I2C implementation takes from the delay times that you insert between the bytes that you send to the LCD controller? That would take care of half the problem with the single-port I2C devices and all of the problem with the two-port devices.

In fm's NewLiquidCrystal library, the byte/command delays are inside the interface layers themselves and do
take into consideration the interface transfer time as well as the actual time
of the LCD library code as well.
So for example on i2c there is no added delay between LCD data byte transfers.

Because of the optimizations already in place, the only area left to optimize for LCD data transfers
for i2c, is the time to get the control and data information to the LCD.

The times I quoted are not actual byte transfer times but an averaged and normalized
effective byte transfer time based on updating the full display. This time includes
all the overhead to get from the sketch through the LCD code, over the i2c bus and
to the LCD.

Eliminating extra i2c starts/stops and extra byte transfers is actually pretty significant
as you can see from the 16x2 frame rate numbers.
It goes from 31 to 54 FPS on the PCF8574.
Just having to do the send for the extra byte for the address register on the MCP23008 drops the 54FPS to 45FPS.

--- bill

In the latest 1602 LCD I2C optimization and tuning the bus clock was changed from standard 100khz to 400khz. The device driver performs correctly on a MJKDZ module connected to an Arduino Nano/Iteadstudio Nano IO Shield.

The frame rate was pushed to 133 fps giving 4.2X performance improvement compared to the original LiquidCrystal library (31 fps). The improvement compared to 100khz is "only" 2.5X (from 53 fps) even if the bus clock frequency is increased with 4X.

More details on the blog Cosa: Object-Oriented LCD management

Cheers!

Interesting. I thought about trying that to see if worked.
but since it is way beyond the specs on the datasheets I've seen, I never actually tried it.
I wonder how stable it is, particularly with multiple devices on bus.

--- bill

bperrybap:
Interesting. I thought about trying that to see if worked.
but since it is way beyond the specs on the datasheets I've seen, I never actually tried it.
I wonder how stable it is, particularly with multiple devices on bus.

I understand your concern. I checked the PCF8574 spec and it seems like most of them sold can handle 400khz. As I ran the initial test on an Arduino Nano/Nano IO Board for a few hours without any problems I though it would be interesting to report.

Didn't think about additional device so I have now setup an Arduino Mega (cheap Chinese clone/Funduino with bad contacts :wink: to an I2C bus on a breadboard with a RTC DS1307, a Digital Compass HMC5883L and the IO expander to the LCD. The wiring is total length of 20 cm and 3 sections. Lots of pF and bad contacts.

Anyway it is running now and done so for the latest 45 minutes.

Get back to you later when I have stressed it some more,

Mikael

All the datasheets that I've been able to find (TI, Philips, & NXP) show 100khz as max
which is probably for the max for the lower voltages.
I'm curious which manufactures were you able to find that show 400khz?

--- bill

bperrybap:
All the datasheets that I've been able to find (TI, Philips, & NXP) show 100khz as max
which is probably for the max for the lower voltages.
I'm curious which manufactures were you able to find that show 400khz?

Hi again. Below are the data sheets I read. Philips/NXP. It is the PCA version that is 400 khz spec. Cannot really read the text on the MJKDZ module I am using so I would not say this works for all I2C IO expanders. Maybe I was just lucky.

Anyway the test is still going strong :wink:

Hum, there seems to be a 1Mhz version as well.

Mikael

http://ics.nxp.com/products/gpio.expanders/i2c/
http://www.nxp.com/documents/brochure/NXP_Journal_2012_0918.pdf
http://www.nxp.com/documents/data_sheet/PCA8574_PCA8574A.pdf

Changed the breadboard wires on the test setup (Arduino Mega with three I2C devices on the bus as above) to extra long (20+ cm) and continued the test run (CosaLCDspeed.ino). Still no hick-ups after nearly four hours so I think I can say it is stable at 400 kHz for at least hobby/education setups.

Would be fun to increase the I2C bus frequency until it breaks. Saving that for a rainy day :wink:

Cheers!

Good work kowalski, an interesting thread.

Although I think no one will use 130 frames per second on a character based display.
I would prefer to think of this optimization as minimizing the average time per char

31 fps == 33 millis/char
53 fps == 19 millis/char
130 fps == 7.7 millis/char
update: way off math eliminated (see below)

Faster times means more time to make measurements and to do math.
Note the recent divmod10() optimization discussion which decreased the time to print numbers substantially - divmod10() : a fast replacement for /10 and %10 (unsigned) - Libraries - Arduino Forum - As the print.cpp class is the base class for lcd.print combining these efforts could be very interesting.

For graphics displays the increased performance is evident.

robtillaart:
31 fps == 33 millis/char
53 fps == 19 millis/char
130 fps == 7.7 millis/char

Your math is way off here.
The times you calculated are not per character they are per
full frame/display of characters.
Example:
53fps is much faster than 19 ms/char.
53fps is 19 ms per "full frame" of characters which on a 16x2 display
is 32 characters and 2 set address commands (which take the same time as a data write).
So there are 34 bytes being transfered to a 16x2 display every frame.

The optimization of not disconnecting from the i2c slave between LCD nibble updates
changes the per byte/character transfer time from around 957us down to around 543us on a PCF8574.
Bumping the clock above the default 100khz rate to 400khz chops that down again by about 2.5x.

The LCDiSpeed sketch included as an example in fm's library is very useful
for getting all this kind of timing information in a real operating environment.
It also calculates and displays timing information that can be compared across any sized display.
The sketch displays:

 * - Single byte transfer speed (ByteXfer)
 *		This is the time it takes for a single character to be sent from
 *		the sketch to the LCD display.
 *
 * - Frame/Sec (FPS)
 *		This is the number of times the full display can be updated
 *		in one second. 
 *     
 * - Frame Time (Ftime)
 *		This is the amount of time it takes to update the full LCD display.
 *
 * The sketch will also report "independent" FPS and Ftime values.
 * These are timing values that are independent of the size of the LCD under test.
 * Currently they represent the timing for a 16x2 LCD
 * The value of always having numbers for a 16x2 display
 * is that these numbers can be compared to each other since they are
 * independent of the size of the actual LCD display that is running the test.
 * i.e. you also get 16x2 timing information even if the display is not 16x2
 *
 * All times & rates are measured and calculated from what a sketch "sees"
 * using the LiquidCrystal API.
 * It includes any/all s/w overhead including the time to go through the
 * Arduino Print class and LCD library.
 * The actual low level hardware times are obviously lower.

--- bill

robtillaart:
Good work kowalski, an interesting thread.

Although I think no one will use 130 frames per second on a character based display.
I would prefer to think of this optimization as minimizing the average time per char
...
Faster times means more time to make measurements and to do math.
Note the recent divmod10() optimization discussion which decreased the time to print numbers substantially - divmod10() : a fast replacement for /10 and %10 (unsigned) - Libraries - Arduino Forum - As the print.cpp class is the base class for lcd.print combining these efforts could be very interesting.

For graphics displays the increased performance is evident.

robtillaart, thanks for your interest and encouragement!

Making more processing time available is exactly one of my intentions with the LCD optimization. Also carefull design of the Cosa IOStream::Device and LCD abstract interface together and allowing many more devices within that interface. The topic title is not very good ;-/

I have been following the divmod10 optimization thread with interest. In Cosa I have simply used the AVR standard functions for binary-to-string conversion; itoa, ltoa, utoa, ultoa.

http://www.nongnu.org/avr-libc/user-manual/group__avr__stdlib.html#ga4f6b3dd51c1f8519d5b8fce1dbf7a665

This is where the optimization should go I believe but while waiting and seeing the improvements it would be interesting to adapt that solution to Cosa/IOStream class.

http://dl.dropboxusercontent.com/u/993383/Cosa/doc/html/dd/d83/classIOStream.html

I have not yet pushed "the ultimate optimization" of the LCD driver for I2C. This is when the output to the device becomes asynchronous and works in the background. The Cosa TWI device driver supports this and then the application will be allowed to continue with other work while data is transfered to the display.

The benchmark that writes characters to the display will not show any improvements as it saturates the I2C bus. There is only a very small fraction of the benchmark that could run concurrently with the transfer.

The benchmark that writes numbers could run the binary-to-textual conversion in parallel.

There is also yet another I2C level optimization possible for string output. Currently the Cosa LCD driver implements only IOStream::putchar() and handles puts() and write() as a sequence of putchar(). The function writes the character to the LCD but also handles form-feed, carriage-return-line-feed and a few other control characters (something that LiquidCrystal does not). It is possible to write numbers directly as the string will not contain any control characters. The only issue is text clipping or wrapping. This implies that the whole string could be translated to a single larger I2C block and written as one transaction. This removes the I2C addressing per digit character. This is the same as the nibble optimization only on the next transaction level.

Cheers!

Your math is way off here.

:blush: :blush: :blush:
Thanks for the correction, I updated my post and striked through the faulty math.

Still I like to think of it in time/char as a frame/time is dependant on the size of the frame where time/char is not.

There is also yet another I2C level optimization possible for string output. Currently the Cosa LCD driver implements only IOStream::putchar() and handles puts() and write() as a sequence of putchar(). The function writes the character to the LCD but also handles form-feed, carriage-return-line-feed and a few other control characters (something that LiquidCrystal does not). It is possible to write numbers directly as the string will not contain any control characters. The only issue is text clipping or wrapping. This implies that the whole string could be translated to a single larger I2C block and written as one transaction. This removes the I2C addressing per digit character. This is the same as the nibble optimization only on the next transaction level.

that would really speed up things!

robtillaart:

Your math is way off here.

:blush: :blush: :blush:
Thanks for the correction, I updated my post and striked through the faulty math.

Still I like to think of it in time/char as a frame/time is dependant on the size of the frame where time/char is not.

That is why I wrote LCDiSpeed to report timing 3 different ways:

  • per byte, which is not dependent of frame size
  • per frame, which is dependent on frame size. (reported in both FPS and actual time)
  • per "iFrame" which is what the frame time is on a 16x2 display regardless of the actual size of the display in use.

That way you get what ever you want/need.

--- bill

Some further development with I2C LCD adapters.

I recently updated the Cosa I2C driver and did a refactoring of the TWI::Slave class for ATtiny. As a spin-off I created a Virtual LCD class that sends "commands" via TWI to an ATtiny84 running the LCD driver. This allows reducing the number of bytes transmitted even further. From the original 4 transmissions with 2 bytes (address and port value), to the optimization for the IO expander with a single 5 byte message (address and four port values) and now down to a single 2 byte message (address and character to print on the LCD).

Running the LCD driver on the ATtiny84 at 8 Mhz and 4-bit parallel mode gives a frame rate of 413. And running the I2C Slave Virtual LCD on the ATtiny84 gives approx. 72 fps. This includes the Cosa I2C driver ISR pushing an event and the dispatching of the event to the adapter. Current max with the I2C IO expander is 53 fps @ 100 khz. Another 35+ % improvement.

Further improvements are possible (when using an ATtiny as LCD slave) as the IOStream::Device functions puts() and write() can use single messages. Also number conversion could be moved to the slave by sending binary numbers instead of characters.

Cheers!

Below is the LCD/TWI slave sketch which is running on the ATtiny84. This is a simple command interpretor to handle the LCD operations. The design is event driven where the ISR pushes an event for incoming TWI requests. These end up in the implementation of the method on_request().

#include "Cosa/TWI.hh"
#include "Cosa/Watchdog.hh"
#include "Cosa/LCD/Driver/HD44780.hh"

HD44780::Port port;
HD44780 lcd(&port);

class LCDslave : public TWI::Slave {
private:
  static const uint8_t BUF_MAX = 64;
  uint8_t m_buf[BUF_MAX];

public:
  LCDslave() : TWI::Slave(0x5A) 
  {
    set_write_buf(m_buf, sizeof(m_buf));
    set_read_buf(m_buf, sizeof(m_buf));
  }

  virtual void on_request(void* buf, size_t size);
};

void
LCDslave::on_request(void* buf, size_t size)
{
  char c = (char) m_buf[0];
  if (c != 0) {
    lcd.putchar(c);
    for (size_t i = 1; i < size; i++)
      lcd.putchar(m_buf[i]);
    return;
  }
  if (size == 2) {
    uint8_t cmd = m_buf[1];
    switch (cmd) {
    case 0: lcd.backlight_off(); return;
    case 1: lcd.backlight_on(); return;
    case 2: lcd.display_off(); return;
    case 3: lcd.display_on(); return;
    }
  }
  else if (size == 3) {
    uint8_t x = m_buf[1];
    uint8_t y = m_buf[2];
    lcd.set_cursor(x, y);
  }
}

LCDslave slave;

void setup()
{
  Watchdog::begin();
  lcd.begin();
  lcd.puts_P(PSTR("CosaLCDslave"));
  slave.begin();
}

void loop()
{
  Event event;
  Event::queue.await(&event);
  event.dispatch();
}

If you looking for a fast low pin count interface to an LCD (can't be lower than a single pin),
you might be interested in this recent activity:
https://bitbucket.org/fmalpartida/new-liquidcrystal/pull-request/1/adding-an-optimized-implementation-of/diff#comment-366944
Although the interface uses a single pin, it can transfer bytes in 92us for a frame rate close to 320 FPS,
which is about 3.6 times faster than the standard LiquidCrystal library using 6 pins!
This is a great example of how inefficient the Arduino core routines like digitalWrite() are.
It is about 6 times faster than the optimized i2c i/o expander interface.

While more components and a bit more complex than using something like a PCF8574 i/o expander chip,
the total component cost should be lower given
595s can be had for about (USD) 20cents and transistors are about 2-3 cents
and caps and resistors are about 1 cent - all quantity 1 from places like tayda.

--- bill

bperrybap:
If you looking for a fast low pin count interface to an LCD (can't be lower than a single pin),
you might be interested in this recent activity:
https://bitbucket.org/fmalpartida/new-liquidcrystal/pull-request/1/adding-an-optimized-implementation-of/diff#comment-366944
Although the interface uses a single pin, it can transfer bytes in 92us for a frame rate close to 320 FPS,
which is about 3.6 times faster than the standard LiquidCrystal library using 6 pins!
This is a great example of how inefficient the Arduino core routines like digitalWrite() are.
It is about 6 times faster than the optimized i2c i/o expander interface.

Hi Bill.

I have followed some of the development on the New LiquidCrystal library and the hardware support. Great job!! Very inspiring.

I thought of doing a version with 595 connected to SPI. Would require two more pins but at full speed the transfer rate could be 4 Mhz giving 4-5 us per byte. That is hard to beat that in cost/performance. Using an ATtiny at a dollar is more expensive but gives a lot of interesting options. An interesting challenge.

The poor performance of Arduino/Wiring and the lack of abstraction/structure was actually what got me started on what became the Cosa project. By chance I stumbled upon Arduino last year during the summer vacation. The work with Cosa started in late November.

Anyway, the latest LCD slave is more a test run of the TWI slave, LCD driver and event framework on an ATtiny84. I needed a test example and pushing I2C further seemed like fun. Also moving an interface between two micro-controllers is also an interesting challenge. I hope to add some tooling for this so that it becomes easier. Something in the line of IDL/Corba, etc.

Cheers!