Fast ST7735R SPI rendering via Uno.

Hi there,

I got into Arduino about a week ago, and found some nice recreational programming from digging into the depths of the TFT screen that I got with it. The result is a test repository of different rendering samples that try to drive the screen as fast as possible. See here:

I presume this is a common topic among Arduino developers since the reference implementations are not particularly optimized, so I decided to give it a closer look, and found it to be quite an educational weekend project and an intro to Arduino.

If there are experienced developers out there who'd be able to improve the SD card and TFT library performance further from that of what is posted in the github repository code, I'd be very interested to know! As of now, I seem to find myself being stuck with the limits of the 8MHz/divider 2 hardware SPI communication, and I wonder if it could be made any faster? Is there a computable theoretical limit on how much data can be transferred via SPI to a device? Or is that device specific? Is there a way to remove the divider, and drive the SPI bus at 16MHz?

As of now I seem to be able to peak 1601282 bytes/frame * 18.5 frames/second = 740 KBytes/second worth of SPI communication to the TFT screen when filling the screen with a single color. How does that sound like? I wonder if there are any low level knobs that could be leveraged to squeeze the bandwidth even higher?

I'm not really sure what your benchmark does exactly. Maximum SPI clock on an Uno is half the master oscillator frequency, which is usally 16MHz. So you're limited to 8MHz SPI clock which in turn sets an upper limit of 1000 976 kbytes/sec transfer rate. Of course there is overhead so you'll never reach that level.

With the monochrome display I've been using it takes 2.08ms to clear the 1K of screen memory. That works out to about 480 kbytes/sec.

You could use a different processor (or overclock an Uno) to get a higher SPI clock frequency but you're ultimately limited by the SPI interface specifications of the ST7735 itself. It requires a minimum clock period of 66ns, which is about 15MHz.

The ST7735 supports an 8-bit parallel interface which would be faster. But an Uno doesn't lend itself to that sort of connection.

I'm not very experienced though; I only discovered Arduino on a random internet search about nine months ago. I'm still enjoying a steep learning curve.

Thanks for the reply. It does sounds like I'm pretty close to the limit then.

I'm interested in the knowing more about the 8-bit parallel interface. There are a lot of pins on the Uno, so physically I could wire those up even to a 8-bit wide data communication (not using them for anything else really). Do you know if there are other limitations what would make that infeasible with the Uno? Would some other Arduino board be able to do a parallel interface?

How did you do your benchmark? I'm wondering what you're doing that is so much more efficient than the simple code I have to clear a screen. Surely pulling data from an SD card and writing it to a color TFT is more complicated than sending out a sequence of zeros. So what is the reason for this difference?

clbri:
I'm interested in the knowing more about the 8-bit parallel interface.

Start reading the datasheet.

The parallel interfaces (ST7735R supports a number of them) are intended for processors that expose their data buses. The Uno doesn't do this; with the Atmega328 you interface with the outside world only through port data pins. It's possible to use digital output pins to mimic the D0-D7, RD and WR lines. You'd have to toggle the WR line yourself but on the other hand you'd be writing 8 bits at a time so I it would probably be faster. The ST7735 actually supports 8, 9, 16 and 18 bit parallel interfaces. But with only 20 digital pins on an Uno you're not going to be able to take advantage of all of those, unless you add some external hardware. There are other Arduino processors that have more digital pins though.

Another consideration is the display itself. Are the parallel interface lines accessible? There are also pins on the controller for setting the interface. Can you get to those on whatever TFT board you have?

I'm probably missing something though.

In the Github URL, every test is a benchmark of its own. The benchmark structure is a simple

uint32_t x = micros();
// do the drawing in the test.
uint32_t y = micros();
totalTime = (y-x);

The first test (Test1_FillScreen) fills the whole screen with a single color. That test takes about 54 msecs to do the clear. The code for the test is here: ST7735R/Test1_FillScreen.h at master · juj/ST7735R · GitHub and the underlying fill implementation is here: ST7735R/ST7735R_TFT.h at master · juj/ST7735R · GitHub

For the SD card streaming case, the test code is here: ST7735R/Test7_Draw565.h at master · juj/ST7735R · GitHub, and the underlying implementation here: ST7735R/ST7735R_TFT.h at master · juj/ST7735R · GitHub . Streaming a full screen image takes about 187 milliseconds.

Not sure if I do anything special in either of the benchmarks there, the code in both cases boils down to a hot loop that does the per-pixel work. There is the "NOP trick" with the SPI, which is documented here: ST7735R/ST7735R_TFT.h at master · juj/ST7735R · GitHub, although it's probably not particularly new (some code snippets I googled off the web gave the idea).

It sounds like using the 8 bit parallel interface might then be technically feasible, except that instead of having a hardware to run the SPI, it would be effectively the same as the software bit banging on SPI is, except it'd be run 8-wide instead. Indeed the physical packaging may be of some trouble. The display looks like this on the front: https://dl.dropboxusercontent.com/u/40949268/dump/IMG_20150508_221839.jpg, and on the back: https://dl.dropboxusercontent.com/u/40949268/dump/IMG_20150508_221908.jpg. The parallel interface from the data sheet is not visibly accessible, although the whole left column of pins is unlabeled and undocumented as far as I can tell (they do not clone the right column of pins at least e.g. for symmetry, that I tried). Not sure if those are there just for structure, or if those are connected to somewhere.

Thanks for the response. I had just finished figuring out what the difference is. Not waiting for SPI completion -- that's something I had thought about but hadn't pursued. I had already improved the speed so much and didn't really need more. But it was still in the back of my mind.

I just tried it quick and dirty with my display code and with little effort bumped it up from 480 kb/sec to 668 kb/sec. But I think there is some nuance to how this is done that I would have to understand better. Nice work.

There may be naked TFT modules available that you could play with. If you like soldering, that is. Something like this perhaps.

Heh, I think I fried an Adafruit 0.96" monochrome OLED display just earlier today by soldering. I've touched a soldering iron last time some 20 years ago, and now I realize that's not a learn-once skill like driving a bicycle! Probably should practice this out on some old circuitry first.. o_O

Check out NewhavenDisplay. They have TFTs with bare cables for parallel interface. It looks like they use a different controller, but it appears very similar. They also sell adapter boards with ZIF connectors so you don't have to try your luck with soldering 0.5mm pitch connectors. Mouser carries their products with what I presume is less expensive shipping.

Thanks, NewhavenDisplay has some very interesting items. I think I'll check out the parallel interface route once I find a nice display to try it on.

To try to answer the original question:

The fastest SPI write that can be achieved on an 16MHz UNO is 1 byte every 18 processor clock cycles. This equates to 2.25us per 16 bit colour value for each pixel. Thus a 320 x 240 display can be cleared at best in about 175ms, or for a smaller display a 160 x 128 pixel area in about 46ms. This is 889 kBytes per second.

As the size of area to be cleared decreases the overhead of setting up the plotting window area starts to dominate.

This is the most optimised library I have seen, it has been hand tuned for performance with the UNO/ATmega328 processors.

Better performance can be obtained with 16 bit parallel TFT's but then a Mega is needed to drive all those lines.

I think it's actually 17:

for (row=; row<240; row=row+1){
SPDR = dataArray[(x+320)+0]; nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;
SPDR = dataArray[(x+320)+1]; nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;
SPDR = dataArray[(x+320)+2]; nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;
SPDR = dataArray[(x+320)+3]; nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;
:
:
SPDR = dataArray[(x+320)+319]; nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;nop; 
}

Will take about (1/16000000 * 15 * 320) + 6uS for each pass thru loop (306uS), x 240 rows, total 73.44mS

Now the question is, how much memory do you have available for byte dataArray[ ]?
320 x 240 = 76800 bytes
16 bit color, twice as much.

CrossRoads:
I think it's actually 17:

The SPI write takes up one clock cycle, then I found by experimenting that there must be at least 17 clock cycles before the next SPI write, thus 18 cycles per SPI byte transmit.

A 320 x 240 screen with 16 bit colour pixels can therefore theoretically be cleared in 172.8 ms.

inline void TFT_ILI9341::spiWrite16(uint16_t data, int16_t count)
{
  uint8_t temp;
  __asm__ __volatile__
  (
    " sbiw %[count],0\n" // test count
    " breq 2f\n" // if == 0 then done

    "1: out %[spi],%[hi]\n" // write SPI data

    " adiw r24,0\n" // 2
    " adiw r24,0\n" // 4
    " adiw r24,0\n" // 6
    " adiw r24,0\n" // 8
    " adiw r24,0\n" // 10
    " adiw r24,0\n" // 12
    " adiw r24,0\n" // 14
    " adiw r24,0\n" // 16
    " nop \n" // 17

    " out %[spi],%[lo]\n" // write SPI data

    " adiw r24,0 \n" // 2
    " adiw r24,0 \n" // 4
    " adiw r24,0 \n" // 6
    " adiw r24,0 \n" // 8
    " adiw r24,0 \n" // 10
    " adiw r24,0 \n" // 12
    " nop \n" // 13

    " sbiw %[count],1\n" // 15 decrement count
    " brne 1b\n" // 17 if != 0 then loop

    "2:\n"

    : [temp] "=d" (temp), [count] "+w" (count)
    : [spi] "i" (_SFR_IO_ADDR(SPDR)), [lo] "r" ((uint8_t)data), [hi] "r" ((uint8_t)(data>>8))
    :
  );
}

I ran this code in a test with an ILI9341 based display, and the 320 x 240 screen is reported as cleared in 174.0ms (this includes some loop and range checking overhead). If the delays between the SPI transmits are reduced by one cycle then the SPI transaction gets corrupted.

So the maximum frame rate for screen clears on a 160 x 128 screen will be 21.7 fps, thus the clbri code (18.5fps) is running at 85% of the maximum rate which it pretty good and is probably the best that can be achieved when checking the SPIF flag in the SPSR register.

P.S. In the example I suspect that the "x+320" probably should have read "row*320"...

Thanks for correcting my bad arithmetic.

rowboteer:
This is the most optimised library I have seen, it has been hand tuned for performance with the UNO/ATmega328 processors.

How fast is it?

jboyton:
How fast is it?

really fast :
PDQ_GFX

That looks like it's at the limit for SPI on a 16MHz board. 5.7 FPS is essentially the same as 1/173ms.

rowboteer:
Better performance can be obtained with 16 bit parallel TFT's but then a Mega is needed to drive all those lines.

Those controllers also support 8 and 9 bit parallel interfaces which aren't out of the question for a processor that has fewer I/O pins.

I'm currently working with a display and using it's 8-bit parallel interface. It's connected to a stock Uno. Granted, it uses 11 of the Uno's usually available 18 I/O pins, but it's feasible to do. It takes 3 processor cycles to write a byte. Clearing the screen the data rate was 2278 x 10^3 bytes/sec, about 2.5 times as fast as the theoretical maximum for SPI.

jboyton:
I'm currently working with a display and using it's 8-bit parallel interface.

I think you will be quite disappointed at the performance with an 8 bit interface. The good point about SPI is the data is sent in the"background" after a single write instruction, so the processor can go and do something else instead of waiting. This is how the PDQ_GFX library gets such good line plotting performance, an 8 bit parallel version will not get the same performance here.

Most 8 bit display shields for the UNO split the byte across two AVR 8 bit ports so then performance really drops due to the byte manipulation required, the UNO only has one complete 8 bit port mapped to the I/O pins and two of those pins are inconveniently the serial Tx+Rx...

The way to go to get better than SPI performance with an AVR processor is a 16 bit interface on a Mega, then the screen can be cleared or rectangles filled at 2Mega pixels per second (setup 16 bit colour on the ports and then toggle write strobe in a fast loop).

rowboteer:
I think you will be quite disappointed at the performance with an 8 bit interface.

I understand what you're saying. The parallel interface I have requires three processor cycles per write whereas an SPI write is only one. The clear screen is kind of a special case. But in general I'm only getting better performance because I haven't fully optimized my display code.

It's a moot point. Although this controller (ST7565) has a native SPI interface mode, the display manufacturer decided, for some unknown reason, not to expose the pin that allows selection of SPI mode. It's particularly annoying since there is an unused pin in the FFC cable. Regardless, I am forced to use 8-bit parallel. I could add an external shift register to save pins. To get SPI performance would require additional logic.

I wish I could find a display like this that has an (available) SPI interface.

I really would not worry too much about the smaller TFT displays. Just make the best use of your compiler. e.g. use inline functions or macros.

The theoretical best speeds can be obtained with USART_MSPI for serial and "whole port" for 8-bit or 16-bit parallel. Obviously you can only get whole ports with a PROMINI or MEGA2560. Most displays want to run at 3.3V so you need level shifters for 5V 16MHz operation. Mind you, most mega328p will run at 3.3V and 16MHz quite happily.

Look at the Adafruit libraries. They tend to be well written.

The UNO needs a split port for 8-bit. Hence SPI is a realistic competitor for speed.
The MEGA2560 has been carefully designed to prevent you using USART_MSPI on any of its 4 usarts.

David.

david_prentice:
The UNO needs a split port for 8-bit.

Not if you aren't using the UART in the sketch.