Cosa/Boosting LCD 1602 performance (7X SR4W, 6.5X 4-bit parallel, 1.7-2.3X I2C)

Good work kowalski, an interesting thread.

Although I think no one will use 130 frames per second on a character based display.
I would prefer to think of this optimization as minimizing the average time per char

31 fps == 33 millis/char
53 fps == 19 millis/char
130 fps == 7.7 millis/char
update: way off math eliminated (see below)

Faster times means more time to make measurements and to do math.
Note the recent divmod10() optimization discussion which decreased the time to print numbers substantially - divmod10() : a fast replacement for /10 and %10 (unsigned) - Libraries - Arduino Forum - As the print.cpp class is the base class for lcd.print combining these efforts could be very interesting.

For graphics displays the increased performance is evident.

robtillaart:
31 fps == 33 millis/char
53 fps == 19 millis/char
130 fps == 7.7 millis/char

Your math is way off here.
The times you calculated are not per character they are per
full frame/display of characters.
Example:
53fps is much faster than 19 ms/char.
53fps is 19 ms per "full frame" of characters which on a 16x2 display
is 32 characters and 2 set address commands (which take the same time as a data write).
So there are 34 bytes being transfered to a 16x2 display every frame.

The optimization of not disconnecting from the i2c slave between LCD nibble updates
changes the per byte/character transfer time from around 957us down to around 543us on a PCF8574.
Bumping the clock above the default 100khz rate to 400khz chops that down again by about 2.5x.

The LCDiSpeed sketch included as an example in fm's library is very useful
for getting all this kind of timing information in a real operating environment.
It also calculates and displays timing information that can be compared across any sized display.
The sketch displays:

 * - Single byte transfer speed (ByteXfer)
 *		This is the time it takes for a single character to be sent from
 *		the sketch to the LCD display.
 *
 * - Frame/Sec (FPS)
 *		This is the number of times the full display can be updated
 *		in one second. 
 *     
 * - Frame Time (Ftime)
 *		This is the amount of time it takes to update the full LCD display.
 *
 * The sketch will also report "independent" FPS and Ftime values.
 * These are timing values that are independent of the size of the LCD under test.
 * Currently they represent the timing for a 16x2 LCD
 * The value of always having numbers for a 16x2 display
 * is that these numbers can be compared to each other since they are
 * independent of the size of the actual LCD display that is running the test.
 * i.e. you also get 16x2 timing information even if the display is not 16x2
 *
 * All times & rates are measured and calculated from what a sketch "sees"
 * using the LiquidCrystal API.
 * It includes any/all s/w overhead including the time to go through the
 * Arduino Print class and LCD library.
 * The actual low level hardware times are obviously lower.

--- bill

robtillaart:
Good work kowalski, an interesting thread.

Although I think no one will use 130 frames per second on a character based display.
I would prefer to think of this optimization as minimizing the average time per char
...
Faster times means more time to make measurements and to do math.
Note the recent divmod10() optimization discussion which decreased the time to print numbers substantially - divmod10() : a fast replacement for /10 and %10 (unsigned) - Libraries - Arduino Forum - As the print.cpp class is the base class for lcd.print combining these efforts could be very interesting.

For graphics displays the increased performance is evident.

robtillaart, thanks for your interest and encouragement!

Making more processing time available is exactly one of my intentions with the LCD optimization. Also carefull design of the Cosa IOStream::Device and LCD abstract interface together and allowing many more devices within that interface. The topic title is not very good ;-/

I have been following the divmod10 optimization thread with interest. In Cosa I have simply used the AVR standard functions for binary-to-string conversion; itoa, ltoa, utoa, ultoa.

http://www.nongnu.org/avr-libc/user-manual/group__avr__stdlib.html#ga4f6b3dd51c1f8519d5b8fce1dbf7a665

This is where the optimization should go I believe but while waiting and seeing the improvements it would be interesting to adapt that solution to Cosa/IOStream class.

http://dl.dropboxusercontent.com/u/993383/Cosa/doc/html/dd/d83/classIOStream.html

I have not yet pushed "the ultimate optimization" of the LCD driver for I2C. This is when the output to the device becomes asynchronous and works in the background. The Cosa TWI device driver supports this and then the application will be allowed to continue with other work while data is transfered to the display.

The benchmark that writes characters to the display will not show any improvements as it saturates the I2C bus. There is only a very small fraction of the benchmark that could run concurrently with the transfer.

The benchmark that writes numbers could run the binary-to-textual conversion in parallel.

There is also yet another I2C level optimization possible for string output. Currently the Cosa LCD driver implements only IOStream::putchar() and handles puts() and write() as a sequence of putchar(). The function writes the character to the LCD but also handles form-feed, carriage-return-line-feed and a few other control characters (something that LiquidCrystal does not). It is possible to write numbers directly as the string will not contain any control characters. The only issue is text clipping or wrapping. This implies that the whole string could be translated to a single larger I2C block and written as one transaction. This removes the I2C addressing per digit character. This is the same as the nibble optimization only on the next transaction level.

Cheers!

Your math is way off here.

:blush: :blush: :blush:
Thanks for the correction, I updated my post and striked through the faulty math.

Still I like to think of it in time/char as a frame/time is dependant on the size of the frame where time/char is not.

There is also yet another I2C level optimization possible for string output. Currently the Cosa LCD driver implements only IOStream::putchar() and handles puts() and write() as a sequence of putchar(). The function writes the character to the LCD but also handles form-feed, carriage-return-line-feed and a few other control characters (something that LiquidCrystal does not). It is possible to write numbers directly as the string will not contain any control characters. The only issue is text clipping or wrapping. This implies that the whole string could be translated to a single larger I2C block and written as one transaction. This removes the I2C addressing per digit character. This is the same as the nibble optimization only on the next transaction level.

that would really speed up things!

robtillaart:

Your math is way off here.

:blush: :blush: :blush:
Thanks for the correction, I updated my post and striked through the faulty math.

Still I like to think of it in time/char as a frame/time is dependant on the size of the frame where time/char is not.

That is why I wrote LCDiSpeed to report timing 3 different ways:

  • per byte, which is not dependent of frame size
  • per frame, which is dependent on frame size. (reported in both FPS and actual time)
  • per "iFrame" which is what the frame time is on a 16x2 display regardless of the actual size of the display in use.

That way you get what ever you want/need.

--- bill

Some further development with I2C LCD adapters.

I recently updated the Cosa I2C driver and did a refactoring of the TWI::Slave class for ATtiny. As a spin-off I created a Virtual LCD class that sends "commands" via TWI to an ATtiny84 running the LCD driver. This allows reducing the number of bytes transmitted even further. From the original 4 transmissions with 2 bytes (address and port value), to the optimization for the IO expander with a single 5 byte message (address and four port values) and now down to a single 2 byte message (address and character to print on the LCD).

Running the LCD driver on the ATtiny84 at 8 Mhz and 4-bit parallel mode gives a frame rate of 413. And running the I2C Slave Virtual LCD on the ATtiny84 gives approx. 72 fps. This includes the Cosa I2C driver ISR pushing an event and the dispatching of the event to the adapter. Current max with the I2C IO expander is 53 fps @ 100 khz. Another 35+ % improvement.

Further improvements are possible (when using an ATtiny as LCD slave) as the IOStream::Device functions puts() and write() can use single messages. Also number conversion could be moved to the slave by sending binary numbers instead of characters.

Cheers!

Below is the LCD/TWI slave sketch which is running on the ATtiny84. This is a simple command interpretor to handle the LCD operations. The design is event driven where the ISR pushes an event for incoming TWI requests. These end up in the implementation of the method on_request().

#include "Cosa/TWI.hh"
#include "Cosa/Watchdog.hh"
#include "Cosa/LCD/Driver/HD44780.hh"

HD44780::Port port;
HD44780 lcd(&port);

class LCDslave : public TWI::Slave {
private:
  static const uint8_t BUF_MAX = 64;
  uint8_t m_buf[BUF_MAX];

public:
  LCDslave() : TWI::Slave(0x5A) 
  {
    set_write_buf(m_buf, sizeof(m_buf));
    set_read_buf(m_buf, sizeof(m_buf));
  }

  virtual void on_request(void* buf, size_t size);
};

void
LCDslave::on_request(void* buf, size_t size)
{
  char c = (char) m_buf[0];
  if (c != 0) {
    lcd.putchar(c);
    for (size_t i = 1; i < size; i++)
      lcd.putchar(m_buf[i]);
    return;
  }
  if (size == 2) {
    uint8_t cmd = m_buf[1];
    switch (cmd) {
    case 0: lcd.backlight_off(); return;
    case 1: lcd.backlight_on(); return;
    case 2: lcd.display_off(); return;
    case 3: lcd.display_on(); return;
    }
  }
  else if (size == 3) {
    uint8_t x = m_buf[1];
    uint8_t y = m_buf[2];
    lcd.set_cursor(x, y);
  }
}

LCDslave slave;

void setup()
{
  Watchdog::begin();
  lcd.begin();
  lcd.puts_P(PSTR("CosaLCDslave"));
  slave.begin();
}

void loop()
{
  Event event;
  Event::queue.await(&event);
  event.dispatch();
}

If you looking for a fast low pin count interface to an LCD (can't be lower than a single pin),
you might be interested in this recent activity:
https://bitbucket.org/fmalpartida/new-liquidcrystal/pull-request/1/adding-an-optimized-implementation-of/diff#comment-366944
Although the interface uses a single pin, it can transfer bytes in 92us for a frame rate close to 320 FPS,
which is about 3.6 times faster than the standard LiquidCrystal library using 6 pins!
This is a great example of how inefficient the Arduino core routines like digitalWrite() are.
It is about 6 times faster than the optimized i2c i/o expander interface.

While more components and a bit more complex than using something like a PCF8574 i/o expander chip,
the total component cost should be lower given
595s can be had for about (USD) 20cents and transistors are about 2-3 cents
and caps and resistors are about 1 cent - all quantity 1 from places like tayda.

--- bill

bperrybap:
If you looking for a fast low pin count interface to an LCD (can't be lower than a single pin),
you might be interested in this recent activity:
https://bitbucket.org/fmalpartida/new-liquidcrystal/pull-request/1/adding-an-optimized-implementation-of/diff#comment-366944
Although the interface uses a single pin, it can transfer bytes in 92us for a frame rate close to 320 FPS,
which is about 3.6 times faster than the standard LiquidCrystal library using 6 pins!
This is a great example of how inefficient the Arduino core routines like digitalWrite() are.
It is about 6 times faster than the optimized i2c i/o expander interface.

Hi Bill.

I have followed some of the development on the New LiquidCrystal library and the hardware support. Great job!! Very inspiring.

I thought of doing a version with 595 connected to SPI. Would require two more pins but at full speed the transfer rate could be 4 Mhz giving 4-5 us per byte. That is hard to beat that in cost/performance. Using an ATtiny at a dollar is more expensive but gives a lot of interesting options. An interesting challenge.

The poor performance of Arduino/Wiring and the lack of abstraction/structure was actually what got me started on what became the Cosa project. By chance I stumbled upon Arduino last year during the summer vacation. The work with Cosa started in late November.

Anyway, the latest LCD slave is more a test run of the TWI slave, LCD driver and event framework on an ATtiny84. I needed a test example and pushing I2C further seemed like fun. Also moving an interface between two micro-controllers is also an interesting challenge. I hope to add some tooling for this so that it becomes easier. Something in the line of IDL/Corba, etc.

Cheers!

@kowalski

I've actually did this with 2 595's to be able to utilize all 8 bits of the LCD.
Since the transfer rate is so high, a delay must be added to the code of about 30us, this resulted in an average write speed of about 38-40us per byte sent to the LCD, as I have had troubles with missed letters when I tried to go down to the lowest spec of 37us delay (in total).

Here's my schematic (please ignore the resistor net as it wasn't tested. R3 resistor was also unnecessary as my LCD already has a 100ohm built in resistor):
*Click to enlarge.

Running prints of 80 chars starting with a random number:

And you can find a simple/limited library attached:

LiquidCrystal_SPI.zip (1.76 KB)

@TheCoolest

That was exactly what I was considering :wink: Great job!! See if I can get my hands on a few 595s and do a prototype board.

Added a picture of my setup with an Arduino Nano talking over TWI with an ATtiny84 running the LCD driver.

Cheers!

Thanks. I got my 20x4 LCD thinking it supported SPI (as the ebay title said it did, and I still had no idea what's what)
And I found that the LiquidCrystal_I2C library I downloaded was awfully slow. About 1ms to send a complete byte or a command, and that is after I optimized it a little bit by removing the unnecessary delays and an extra expander write which wasn't needed.
Filling the screen with 80 chars takes about 78ms, that's insane. With the SPI method it takes just over 3ms for 80 chars, that's a huge improvement.
Frees up a ton of processor time for other important tasks :slight_smile:
I too want to build a small backpack for this LCD to go into the project I'm making right now. The only benefit to I2C I can think of right now is that it is probably less susceptible to interference and long wires than the SPI.

TheCoolest:
The only benefit to I2C I can think of right now is that it is probably less susceptible to interference and long wires than the SPI.

I think the biggest benefit to I2C is if you need to interface to multiple devices since no additional pins are needed.

The Cosa I2C slave LCD driver is now completed. The initial design has been refactored to a new Virtual LCD class (VLCD) which allows any Cosa LCD device driver to be connected (not just the HD44780 driver). The VLCD class contains two parts; 1) the client part acts as a LCD proxy, translating LCD API calls to I2C messages, 2) the server part acts as an adapter that decodes the I2C messages and calls the LCD implementation.

Below is the CosaLCDslave sketch. It uses the new Virtual LCD class and binding to the HD44780 driver with the 4-bit parallel port IO. This sketch is compiled for an ATtiny84 in the example above but may be compiled for any Cosa supported Arduino.

#include "Cosa/Watchdog.hh"
#include "Cosa/LCD/Driver/HD44780.hh"
#include "Cosa/VLCD.hh"

// Use a 4-bit parallel port for the HD44780 LCD (16X2 default)
HD44780::Port port;
HD44780 lcd(&port);

// And use the LCD for the implementation of the Virtual LCD slave
VLCD::Slave vlcd(&lcd);

void setup()
{
  Watchdog::begin();
  lcd.begin();
  vlcd.begin();
}

void loop()
{
  Event event;
  Event::queue.await(&event);
  event.dispatch();
}

The benchmark CosaLCDspeed.ino binds to the Virtual LCD and runs the measurements. It is the Arduino Nano in the picture above that runs this sketch.. See the code on github.

https://github.com/mikaelpatel/Cosa/blob/master/examples/TWI/CosaLCDslave/CosaLCDslave.ino

By implementing the IOStream::Device methods puts(), puts_P() and write() the performance can be boosted to 50-98% of the performance of the I2C IO expander at 400kHz. Below are some results from the benchmarking. The first table shows the performance (operations per second/frames per second), and compares the 4-bit and I2C IO expander implementations (at 100khz and 400 khz).

The above results are used as the baseline for the comparison with the second table below which is the ATtiny84 (internal clock 8Mhz) compiled version and the VLCD version. The comparison is between the 4-bit implementation and then the VLCD implementation (with optimizations).

VLCD may be viewed as a "template" for how to construct I2C slave devices.
http://dl.dropboxusercontent.com/u/993383/Cosa/doc/html/d1/d1f/classVLCD.html

Cheers!

The next step is to implement a Cosa USI based TWI master for ATtiny and porting the LCD support. Below is the LCD benchmark running on a LCD with I2C IO expander and an ATtiny85 (internal clock 8 MHz, internal pull-up).

The picture shows 39 operations per second (32 characters plus 2 set cursor per op). The result for standard Arduino (Uno, Nano, etc) is 53 fps.

The latest I2C optimizations include packaging larger I2C block (32 IO expander commands for 8 characters) on puts() and write().

Cheers!

Here are the numbers from the latest improvements of the Cosa LCD device driver. The table also contains the ratio compared to the New LiquidCrystal Library benchmark.
https://bitbucket.org/fmalpartida/new-liquidcrystal/wiki/Home#!performance-and-benchmakrs
Please note that the ATtiny84/85 benchmarking uses the internal 8Mhz clock.

The following I2C optimizations are included:

  1. Packaging I2C IO expander updates to a single TWI message for putchar(). To send a byte (data or command) to the LCD four TWI messages (address and 1 byte data) was previously sent (LiquidCrystal_I2C). This is compressed to a single TWI message with address and the four bytes needed to send the byte (via the 4-bit parallel interface) to the LCD.

  2. Packaging multiple encoded bytes into a single message for puts(). Applying the first optimization to a sequence of characters sent to the I2C IO expander. This allows (again) the TWI address to be removed. The default internal buffer size is 32 bytes. This gives 7 byte address reduction for an 8 byte string.

The second optimization shows up in the puts() to puts_P() ratio as program strings may contain control characters and are not compressed. Ratio 77/60 = 1.28X further improvement. This also shows up as an improvement when printing numbers (dec/bin in benchmark).

For ATmega with TWI hardware the processor will go into sleep mode during the wait for the completion of the I2C operation (write). A further optimization would be to allow the processor to continue and only sync when a new operation is issued. This would require some additional buffering. The Cosa TWI driver allows asynchronous calls but this feature is not yet used by the LCD driver. The current ATtiny USI based TWI is a bit-banging implementation with micro-second level delays. A redesign of the Cosa RTC (micro second level timer) for ATtiny is necessary to allow asynchronous TWI operation with ISR. This is due to a timer conflict.

The last column in the table above contains the results when using an ATtiny84 as an I2C LCD adapter and reducing the I2C message communication even further. The improvement is then 2.3X.

Read more on the blog Cosa: Object-Oriented LCD management

A port adapter for SR (74HC595) will be added to the Cosa LCD device driver library soon. Waiting for some more hardware to play with :wink:

Cheers!

Cool stuff.
I'm curious what core and i2c library you are using for the attiny.

--- bill

bperrybap:
Cool stuff.
I'm curious what core and i2c library you are using for the attiny.

@bperrybap

Thanks for your interest in this project.

I use the MIT ATtiny core by David Mellis. It is more or less only for the compiler settings, fuse bits, and build in the Arduino IDE. All code is Cosa. Non of the Arduino "core" code or libraries are used (except for main() and init() ;-). Same goes for "Mighty".

Cosa is an OO-framework. It supports the major Arduino ATmega/ATtiny within the framework itself with a Board abstraction. Cosa contains a newly written SPI and TWI class library. For ATtiny the implementation is USI based. It supports all SPI modes and both TWI master and slave devices. I find the standard Arduino/Wiring/dtools/AVR TWI a bit difficult to work with :wink: and too low level and slow. Cosa InputPin and OutputPin operations are between 3-5X faster than Arduino/Wiring. They are also object-oriented and symbolic which makes configuration and reuse much easier.

I post Cosa updates and improvements on Cosa: An Object-Oriented Platform for Arduino programming - Libraries - Arduino Forum

Cheers!

Received a bunch of 74HC595's today (ebay: $2 for 10 pcs) so now I could add and benchmark a shift register based port version for the Cosa LCD support. It uses basically the same method as suggested by @Nadir and above by @TheCoolest. It uses three pins; data, clock and latch. And where the latch signal is also used for the LCD enable. Below is the 3-wire schematics from the codegoogle arduinoshiftreglcd project page Google Code Archive - Long-term storage for Google Code Project Hosting..


http://forum.arduino.cc/index.php/topic,15364.msg112755.html#msg112755

The port is used a bit different for further optimization (later on ;-). Below are the updated LCD benchmark with the initial result for the SR3W support added.

The table values are operations per second. For the putchar, puts and puts_P this corresponds to frames per second on a 16X2 LCD with two set_cursor. The uint16_t dec benchmark is 4 digit decimal print plus set_cursor per operations second. And uint16_t bin benchmark is 14 digit binary number print (total 16 characters with 0x-prefix) plus set_cursor operations per second.

This SR3W implementation uses the Cosa OutputPin serialization function and is "high-level" (i.e. not PORT direct) as the 4-bit parallel version optimization. SPI could be used to boost performance further.

void 
HD44780::SR3W::write4b(uint8_t data)
{
  m_port.data = data;
  m_sda.write(m_port.as_uint8, m_scl);
  m_en.toggle();
  m_en.toggle();
}

Using the different LCD port adapters is easy. The LCD driver is a single source for all versions. It is only the port adapter that needs implementing. This is one of the great OOP design pattern; delegation. Below is a snippet from the LCD benchmark.

// Select the LCD device for the benchmark
#include "Cosa/LCD/Driver/HD44780.hh"
// HD44780::Port port;
HD44780::SR3W port;
// HD44780::MJKDZ port;
// HD44780::DFRobot port;
HD44780 lcd(&port);

The HD44780 LCD device driver implements the abstract class LCD and can be replaced by any other Cosa LCD device driver implementations in the benchmark source code. Again by changing only a few lines. Below is yet another snippet:

// #include "Cosa/LCD/Driver/PCD8544.hh"
// PCD8544 lcd;
// #include "Cosa/LCD/Driver/ST7565.hh"
// ST7565 lcd;
// #include "Cosa/VLCD.hh"
// VLCD lcd;

These are all implementations of the LCD interface and are all benchmarked with the same code. Basically be commenting in/out the LCD to test. Below is a link to the benchmark sketch.

After benchmarking the different LCD port alternatives we can conclude that the Shift Register method has the best cost/performance and can match parallel access methods with a much lower pin count. It would be interesting to see this as part of future Arduino boards/shields.

Cheers!

That's looking very cool, kowalski.
I'm not 100% sure how to read the values in your comparison charts. Do the numbers represent milliseconds?
With a shift register connected to an SPI bus you should be able to write a byte to the LCD (if it's in 4bit mode it'll take 2 writes) in about 4-8µs. In my (very barebones) library I'm adding a 30µs delay after each write. Another solution would be to record micros, and on the following write make sure you wait the necessary 38µs. Your code should not take any longer than 38µs per write.

BTW, I've just ordered 10 PCBs of the design in this post minus the resistor ladder, and I'm waiting to get them (very excited :)).

TheCoolest:
That's looking very cool, kowalski.
I'm not 100% sure how to read the values in your comparison charts. Do the numbers represent milliseconds?

Thanks! Updated the post with more details on the table. The numbers are operations per second.

TheCoolest:
With a shift register connected to an SPI bus you should be able to write a byte to the LCD (if it's in 4bit mode it'll take 2 writes) in about 4-8µs. In my (very barebones) library I'm adding a 30µs delay after each write. Another solution would be to record micros, and on the following write make sure you wait the necessary 38µs. Your code should not take any longer than 38µs per write.

When the port implementation starts to take shorter time than the LCD command execution time a delay becomes necessary. I believe the SPI hardware is better used for other modules. The LCD execution time allows us to use "pipelining" without delays. Filling the SR with the next value is allowed to take as long as the execution time. The Cosa SR3W solution above takes advantage of this "non-delay pipeline" variant.
There is basically no difference between the parallel and serial solution if the serialization is performed during the execution of the previous command.

TheCoolest:
BTW, I've just ordered 10 PCBs of the design in this post minus the resistor ladder, and I'm waiting to get them (very excited :)).

Keep us posted!