New Jpeg decoder library - based on tjpgd

@David,
I am interested in the performance of the STM32F767 for decoding PNG compressed images. IT seemed to do well on jpegs.

Can you try the attached and tell me the time for rendering the panda.png image from SPIFFS.

ESP32 takes 432ms

The PNG decoder uses lots of RAM, at least 40kbytes total but does not need to buffer the whole image.

Edit: I am guessing that the STM does not support SPIFFS so you will need an array based version...

png_test_2x.zip (265 KB)

@David,

I have created a FLASH array version to test. See attached.

P.S. ESP32 draws test image in 300ms

Edit: Change arraySize type to 32 bits at line 49 in support functions to render larger arrays:

void load_file(const uint8_t* arrayData, uint32_t arraySize)

png_test_3.zip (59.8 KB)

I made a regular SD.h prog. Read the PNGs from the SD card.

// regular Arduino prog with <SD.h>
//

#define USE_ADAFRUIT_GFX // Comment out to use TFT_eSPI

#define USE_LINE_BUFFER  // Enable for faster rendering

#if defined(USE_ADAFRUIT_GFX)
#define TFT_CS  10 //5
#define TFT_DC  9  //26
#define TFT_RST 8  //27
#define SD_CS   4
#include <Adafruit_ILI9341.h>
Adafruit_ILI9341 tft(TFT_CS, TFT_DC, TFT_RST);
#else
#include <TFT_eSPI.h>              // Hardware-specific library
TFT_eSPI tft = TFT_eSPI();         // Invoke custom library
#endif

#include <SD.h>
#include <math.h>

#include "pngle.h"

#define LINE_BUF_SIZE 64  // pixel = 524, 16 = 406, 32 = 386, 64 = 375, 128 = 368, 240 = 367, no draw = 324 (51ms v 200ms)
int16_t px = 0, sx = 0;
int16_t py = 0, sy = 0;
uint8_t pc = 0;
uint16_t lbuf[LINE_BUF_SIZE];

int16_t png_dx = 0, png_dy = 0;

// Define corner position
void setPngPosition(int16_t x, int16_t y)
{
    png_dx = x;
    png_dy = y;
}

// Draw pixel - called by pngle
void pngle_on_draw(pngle_t *pngle, uint32_t x, uint32_t y, uint32_t w, uint32_t h, uint8_t rgba[4])
{
    uint16_t color = (rgba[0] << 8 & 0xf800) | (rgba[1] << 3 & 0x07e0) | (rgba[2] >> 3 & 0x001f);

#if !defined(USE_ADAFRUIT_GFX) && defined(USE_LINE_BUFFER)
    color = (color << 8) | (color >> 8);
#endif

    if (rgba[3] > 127) { // Transparency threshold (no blending yet...)

#ifdef USE_LINE_BUFFER // This must handle skipped pixels in transparent PNGs
        if ( pc >= LINE_BUF_SIZE) {
#ifdef USE_ADAFRUIT_GFX
            tft.drawRGBBitmap(png_dx + sx, png_dy + sy, lbuf, LINE_BUF_SIZE, 1);
#else
            tft.pushImage(png_dx + sx, png_dy + sy, LINE_BUF_SIZE, 1, lbuf);
#endif
            px = x; sx = x; sy = y; pc = 0;
        }

        if ( (x == px) && (sy == y) && (pc < LINE_BUF_SIZE) ) {
            px++;
            lbuf[pc++] = color;
        }
        else {
#ifdef USE_ADAFRUIT_GFX
            tft.drawRGBBitmap(png_dx + sx, png_dy + sy, lbuf, pc, 1);
#else
            tft.pushImage(png_dx + sx, png_dy + sy, pc, 1, lbuf);
#endif
            px = x; sx = x; sy = y; pc = 0;
            px++; lbuf[pc++] = color;
        }
#else
        tft.drawPixel(x, y, color);
#endif
    }
}

void setup()
{
    Serial.begin(115200);
    tft.begin();

    if (!SD.begin(SD_CS)) {
        Serial.println("SD Mount Failed");
        return;
    }

}

void loop()
{
    load_file_SD(0, 0, "/PNGs/panda2.png");
    load_file_SD(0, 0, "/PNGs/PngSuite.png");
    load_file_SD(0, 0, "/PNGs/M81.png");
    load_file_SD(0, 0, "/PNGs/test.png");
}

void load_file_SD(int x, int y, const char *path)
{
    File file = SD.open(path);
    if (!file) {
        Serial.print(path);
        Serial.println(": Failed to open file for reading");
        return ;
    }
    int32_t sz = file.size();
    
    tft.fillScreen(0);
    uint32_t t = millis();

    setPngPosition(x, y);
    pngle_t *pngle = pngle_new();
    pngle_set_draw_callback(pngle, pngle_on_draw);

    // Feed data to pngle
    uint8_t buf[1024];

    int remain = 0;
    int len;
#if !defined(USE_ADAFRUIT_GFX) && !defined(USE_LINE_BUFFER)
    tft.startWrite(); // Crashes Adafruit_GFX
#endif

    while ((len = file.read(buf + remain, sizeof(buf) - remain)) > 0) {
        int fed = pngle_feed(pngle, buf, remain + len);
        if (fed < 0) {
            Serial.print("ERROR: ");
            Serial.println(pngle_error(pngle));
            break;
        }

        remain = remain + len - fed;
        if (remain > 0) memmove(buf, buf + fed, remain);
    }
#ifdef USE_LINE_BUFFER
    // Draw any remaining pixels - had no warning that image has ended...
    if (pc) {
#ifdef USE_ADAFRUIT_GFX
        tft.drawRGBBitmap(png_dx + sx, png_dy + sy, lbuf, pc, 1);
#else
        tft.pushImage(png_dx + sx, png_dy + sy, pc, 1, lbuf);
#endif
        pc = 0;
    }
#endif
#if !defined(USE_ADAFRUIT_GFX) && !defined(USE_LINE_BUFFER)
    tft.endWrite();
#endif
    pngle_destroy(pngle);
    file.close();
    t = millis() - t;
    Serial.print(path);
    Serial.print(": (");
    Serial.print(sz); 
    Serial.print(" bytes) : ");
    Serial.print(t); 
    Serial.println(" ms");
    delay(2000);
}

Your png_test3 sketch just needed print() instead of printf() and my Nucleo-144 pin defines.
And uint32_t in void load_file(const uint8_t* arrayData, uint32_t arraySize)

png_test3 took 464ms for PngSuite_png. 508ms for panda2_png. -O3 is about 20% faster than -Os
png_test_SD reported:

/PNGs/panda2.png: (113485 bytes) : 1003 ms
/PNGs/PngSuite.png: (2262 bytes) : 481 ms
/PNGs/M81.png: (87182 bytes) : 825 ms
/PNGs/test.png: (9468 bytes) : 214 ms

I used IrfanView to save the PNG files as JPG. (80% Quality). The resultant files are tiny.

    Directory: C:\Users\David Prentice\Pictures\jpegs2


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        05-Nov-19     18:56           9203 M81.jpg
-a----        05-Nov-19     18:57          12901 panda2.jpg
-a----        05-Nov-19     18:57           5180 PngSuite.jpg
-a----        05-Nov-19     18:57           6530 test.jpg

I have saved them as 95% quality and compare the sizes:

    Directory: C:\Users\David Prentice\Pictures\jpegs3


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        05-Nov-19     20:27          22849 M81.jpg
-a----        05-Nov-19     20:27          27473 panda2.jpg
-a----        05-Nov-19     20:27           9123 PngSuite.jpg
-a----        05-Nov-19     20:28          11525 test.jpg

Using my tjpgd_SD program with the 95% JPG files, I get 413ms, 474ms, 162ms, 264ms.

David.

If PNG decoding needs 40kB SRAM, I have no chance of running this on an Xmega, mega4809 or mega2560 or even Cortex-M0.

I will put all the PNGs into Flash tomorrow. I expected PNGLE to be similar speed to TJPG.
And I will put the JPEG3 (95%) into Flash too.

It will be interesting to see the F767 performance.

David.

David, Thanks for running the test and the results.

Which processor does the Nuleo board have, I believe they are available with different processors?

The motivation for decoding PNGs is actually not for size, speed or quality but because weather image overlays are only available in PNG format. To overlay cloud or rain radar imagery data on a Google map I need the transparency capability. The test.png image should render the icon without a background.

There are many Nucleo boards. I have several Nucleo-64. The F767 board is a Nucleo-100 i.e. TQFP100

All Nucleo boards have an on-board STLink debugger-programmer.
All the Nucleo-64 boards have the same Arduino pinout.
The Nucleo-100 boards tend to have a few extras e.g. Ethernet, USB-OTG, ...

It makes it very easy to compare M4, M3, M0 chips. The ST Arduino Core supports most chips. And it is relatively easy to add an unsupported chip.

I will have a look at the PNG algorithm. I presume that it comes down to a Lempel-Ziv or equivalent.

Incidentally, my show_tjpg_P program does not seem to be giving very good figures today. I don't usually use Adafruit_ILI9341

David.

I have run the PNG with MCUFRIEND parallel TFT on 240MHz ESP32, 216MHz STM32F767, 168MHz STM32F446

240MHz ESP32
PngSuite_png 256x256 319 ms [2262]    
panda2_png 240x270 358 ms [113485]    
M81_png 240x240 309 ms [87182]    
test_png 200x200 189 ms [9468]    

M81_jpg 240x240 1/1 129 ms [22849]
panda 240x320 1/1 137 ms [12659]
panda2_jpg 240x270 1/1 147 ms [27473]
panda2_70_jpg 240x270 1/1 120 ms [12901]
PngSuite_jpg 256x256 1/1 112 ms [9123]
test_jpg 200x200 1/1 81 ms [11525]


216MHz STM32F767
PngSuite_png 256x256 183 ms [2262]    
panda2_png 240x270 206 ms [113485]    
M81_png 240x240 174 ms [87182]    
test_png 200x200 108 ms [9468] 

M81_jpg 240x240 1/1 67 ms [22849]
panda 240x320 1/1 70 ms [12659]
panda2_jpg 240x270 1/1 76 ms [27473]
panda2_70_jpg 240x270 1/1 61 ms [12901]
PngSuite_jpg 256x256 1/2 45 ms [9123]
test_jpg 200x200 1/1 42 ms [11525]
   

168MHz STM32F446:
PngSuite_png 256x256 413 ms [2262]    
panda2_png 240x270 454 ms [113485]    
M81_png 240x240 389 ms [87182]    
test_png 200x200 241 ms [9468]

I am surprised by the STM32F767 performance relative to the Cortex-M4 STM32F446.
It outperforms the ESP32 too.

Note that all of the JPG files are compressed. The sheer number of input bytes makes a difference to the throughput.

Oops. The F446 and F767 were using 240x320 TFTs. Hence the PngSuite_jpg is scaled 1/2.
The ESP32 was using a 320x480 TFT. Hence 1/1.

I suspect that the PNG decompression algorithm could improve.
Chan's TJPG certainly works well on 32-bit Targets.
PICOJPEG is faster on 8-bit Targets.

David.

@David, thanks for the results. The STM32F767 has very good performance. I see it has a hardware Jpeg codec so would handle jpeg extremely fast if that could be employed.

The ESP32 is adequate in my application because it can use WiFi to fetch the rain radar images and they only update every 10 minutes, so speed is not an issue.

The PNG decoder decompresses from a streaming source so there is not need for saving the complete images prior to decompression. This is ideal when pulling the images from the web.

I tried a few performance tweaks to get PngSuite.png down to 273ms but visually the speed improvement is not noticeable when streaming from the web.

I updated to the latest miniz release but see no difference. At some point I will turn the png decoder into a library.

Thanks again for your testing.

Hi @Bodmer, and all users,

I've tried to implement a TJpg inside my ESPFastSSD1331 library for ESP8266 and ESP32 but I can't compile it because there are some static functions on what the compiler can't see variables declared on the header file, can you help me please fugure how to compile it?

I've tried to create a pointer to the class and use it as static but without success, I'm not expert on this and Callbacks.

My library is very long code, around 5000 lines the .cpp file and 1000 lines the .h file, so I've reproduced it in a new sketch witn just my class and nothing regarding my library apart setup and initialization in the main sketch (that can be replaced with Adafruit library). My library is big because I've added a lot of functionalities mantaining the high speed and hardware optimizations, it contain even a bitmap render, a class called RawVideoBitmap and another class called RawVideoJpeg. With these two classes I can play (using ESP8266 @ 160 Mhz) videos based on bitmap file sequences at very high framerate, 96x64 16 bit colors about 100 fps, so need to insert a delay to slow down the video like the original framerate that is 24 fps.

The RawVideoJpeg class can do the same with jpeg images, currently I"ve used your JPEGDecoder and embedded inside my library, with this class I can play videos based on jpeg images at framerate of about 38-40 fps.

Now I think an explanation is mandatory..... at start of this project (my library) I've managed to read a sequence of bitmap (.bmp) images from SPIFFS and SDCard to form a video, using SPIFFS I had a framerate of about 30 fps, but using SDCard a low framerate of 12 fps, this because ESP8266 need to open, read and close a lot of files, so I've decided to create an encoder that create a file with an header containing all infos to play a video, then starting from byte 100 insert all images sequentially and already converted in the right format, eg. RGB565, RGB666, RGB88, BlackAndWhite (for monochrome oleds like SSD1306).

On the ESP side, (with RawVideoBitmap) it do no need to convert anything, just read a full frame (inside an array) to the ram, set the Address Window of the display, and then send a full frame with just one line of code, pushColors(intPixel, intPixelLen) command. This increased a lot a framerate, fom 12 to 100+ fps.

The RawVideoBitmap is capable of very high framerate because it read one fame at time and write it to the display, the RawVideoJpeg is slower because need to read and write MCUs separately and this slow down even if there are less data to read/write.

To Encode RawVideoBitmap and RawVideoJpeg I've created 2 iteractive sketches with serial input on witch the User set a name of and frame and a name of Output encoded video file that is saved on the SDCard itself, these have .rvb extension for the RawVideoBitmap and .rvj extension for RawVideoJpeg.

After a lots of problems with ESP8266 crash because WDT reset, finally I've maked it both to work, these are capable to encode even long videos, the max I've tried is 20000 frames, but because the ESP need to open, read and close a lots of files it is pretty slow, to encode videos like 500-1000 frames no problem, just some minutes, but 20000 frames required 4 hours and is a very long time.

So I've decided to develop both the encoders on my PC using B4J, and after a lot of tests both works well, are very fast even if my PC is so old, just a dualcore with 2GB of ram. Both encoders show a Preview in realtime while encodes the videos and as I says are very fast, to encode 20000 frames 96x54 required just 5-6 minutes.

RawVideoEncoder works only with regular JFIF files, Exif not supported, but I think even the arduino encoder do not support it like 'progressive' files. Right?

Because I've even developed an Audio library for ESP8266 and ESP32, I've tried to play video with audio out of the box by connecting my Raspberry Zero W Pimoroni PhatDac to ESP8266, playing RawVideoBitmap video without audio I've 104 fps, with audio 22050 16 Bit it work good and need to add some small delays after any frame, note that I read from SDCard a .wav file extracted from the video, so to synchronize Audio and video require a fine tuning (microseconds) delay. With audio 44100 16 Bit it work but with some clips, the video framerate is lower than original video file, is about 20-22 fps, so I need to speed up a bit to have audio CD quality without clips.

The RawVideoBitmap files are very large files, the RawVideoJpeg are small files depending on Jpegs image quality.
The same video using bitmaps require 82MB, using Jpegs require 11MB depending on it's compression ratio, this data I've wrote use a quality of 2 in scale 0-32 (0 mean best quality).

That I"ve missed is that both encoders need a serie of BITMAP or JPEG files numbered in sequene, eg. Frame1.bmp, Frame2.bmp etc.... The encoder will use these numbers to know a frame order. All frames must be the same size in pixels, for bitmaps 24bit colour not compressed, for jpegs only JFIF (no Exif).

To do this my steps are:

  • download a video from Youtube, or other sources, because Exif not supported camera jpegs can't be used without manage it
  • on Linux I use avconv or ffmpeg, or mencoder to extract (and resize) from downloaded .mp4 video all frames, just one command on terminal extract all frames in .bmp or .jpg format, for latest the quality can be choosen. The same is valid to extract audio than can be resampled to a slower sample rate.
  • put all extracted images on SDCard, possibily format it with SDCARD Formatter.
  • put on PC the SDCard, launch the right encoder, select from encoder the first frame and last frame, eg MyVideo-1.jpg and MyVideo-1000.jpg to encode 1000 frames, is possible even to encode just some part of video, eg, from frame 280 to 400, is even possible encode in reverse a full video or just some parts, eg from frame 800 to frame 50, the video frames will be reversed, the encoder even has a stop button to stop the encoding process, is this is a case, the resulting video is not corrupt,, just contain some frame that already encoded and the header will be refreshed.

While read back to Arduino or ESP side the first thing to do is open a file, read the header, using the Header structure this happen on just one line of code, after this I have the file size, width, height, frame count and other useful data to play a video file.

With my classes after I open the file it verify it and mantain the file open, the class has a play() command on witch I set the first and last frame to play, even there are a setFrame(framenumber) to show the exact frame, the frames can be rendered in any order, from begin to end and viceversa, even just a section of video, even I can select frames randomly, using RawBitmapVideo draw a frame on my small SSD1331 0.96 Inch 96x64 16bit Color oled require just 9,8 ms to be drawn.

I Will release both encoders on github next months if I solve some problems, the B4J code is not opensource for now, maybe in future. I just release the .jar files that works on Win, Linux and Mac, but I even release the ESP8266 RawVideoJpegEncoder sketch as opensource. For a RawVideoBitmapEncorer is a mistake that I need to know, but after several days developing it after I opened last time on Arduino IDE it is deleted, zero bytes, so I don't have it, completely deleted, tried with some utilities to recover deleted files but with no luck at all.

Please help me to make the TJpg encoder inside my library so I can increase a bit a framerate and play both with audio. Many Thanks

Sorry for my Bad English but is not my native language, I'm Italian

david_prentice:
I have always understood that

%d is native int

%ld is int32_t
%lld is int64_t



It seems unambiguous for (signed) `int`
Less clear for `unsigned int`

If `unsigned int` follows the same style as native `int`
Which means that the (native 16-bit int) AVR compiler expects `uint16_t` for %u
And the (native 32-bit int) PC, ARM, ESP32, ESP8266 compilers expects `uint32_t` for %u

Likewise, a 64-bit int Compiler would expect `uint64_t`. And would probably push `uint64_t` for every variadic argument to printf() e.g %u, %lu, %llu

I have never used a 64-bit int compiler.

Of course the sizeof(int) is implementation dependent.
The implementation of printf() is probably more historic convention than strict language rules.

David.

It may be easier to remember these rules (since these are the actual types expected by the printf family):

%u   %d   (unsigned) int
%lu  %ld  (unsigned) long int
%llu %lld (unsigned) long long int

C has nice macros for each of the types defined in stdint.h/inttypes.h, such as PRIu16 to print a uint16_t:

printf("%" PRIu16 "\n", (uint16_t)x);

The macros simply expand to a format string with a native printf format type; PRIu32 might expand to "lu" or to "u" (or to something else), depending on the platform.

@maxmeli123
Post your library to Github as Open Source and I will have a look at it and advise.

Hi @Bodmer, many thanks for your reply,

my library is incomplete, I cannnot post it on github, but I've already create a repository to post it and other libraries I've developed for ESP8266 and these days I will adapt to ESP32, all these libraries are OpenSource and i will post a full code, only the Desktop encoder with GUI developed with B4J is not OpenSource, all other codes are completely Open.

Here you can find my github repositories, no more content because I've to release Oled library before other libraries I've developed from various devices and using various languages.

If you want I can send my sketch to read encoded video files on witch I've included the RawVideoJpeg.cpp, RawVideoJpeg.h, tjpgd.c and tjpgd.h, but in the sketch and in RawVideoJpeg class there are some instances of ESPFastSSD1331 library that works all but actually do not show images, so disassembled, and not videos because I cannot finish the Jpeg video part. I've even do the same with bitmaps images, so I've a sketch to read a RawVideoBitmap, this is working all at very high framerate about 100 fps on my small oled, 320x240 TFT sure require more time, but another time in the RawVideoBitmap class I pass from the sketch an istance of my Oled library.

I even can send you the encoders that runs on Win, Linux and Mac, (you need Java installed) on the Mac not yet tested, use these is very simple, just set image for start and end frame from disk or sdcard then press the button ENCODE and it ask you the video name and start to encode a serie of images inside a video file and show you a realtime preview of the video while encodes, for now there's a limit of max 320x240 video output, but this limit can be removed. And note that RawVideoJpeg encoder do not support Exif, only JFIF.

I can send you if you want to help me, but you will to change oled declarations, initializations and the reference passed to these classes with your Oled or TFT library.

Finally If you have an SSD1331 Oled I can send even my ESPFastSSD1331 library but I says is still uncomplete and with a lots of commented lines.

Now I found a way to compile, created a Pointer of the RawVideoJpeg class and externalized the class, so now I can point to class variables inside the static functions used by the TJPGD decoder, now I can access and read but I cannot write to it, so when I increase the pointer it is always zero and finally the decoder will return not JDR_OK, but will return JDR_INP (error code 2) because I can't fill the decoder input array.

Because I use inside a class I removed the tft_output callback completely and use directly the jd_output function.
The array containing a Jpeg image inside the video file is good, I've tried to print on serial log any element, it starts correctly with SOI bytes 0xFF, 0xD8, 0xFF, 0xE0 and from byte 5 (zero notation) it contains correctly the JFIF signature and the array correctly ends with right EOI OxFF, 0xD9, even I can extract Width and Height without the TJPGD decoder, from my code that parse the array (or file) to validate it and returns Width and Height.

Note that all this already work with the old JPEGDecoder embedded directly inside the RawVideoJpeg class, but I've only 38-40 fps, this is high framerate comparated to the original 24 fps video, but because I even read a .wav audio file from SD and send to external DAC this is too slow and need to speed up.

Please help me, my library required very hard work, I started it about 2 ages ago and I worked a lot of days (and nights) around it searching to improve the speed, now it is very very fast, from this it's name, but currently can't render images, nor videos.

Many thanks

@Bodmer, you can see my library and my small color oled in action here:

and here:

But it is a very old version on witch I archiver about 29-30 fps reading from SPIFFS (that is faster than SD), this even do not use my RawVideoJpeg and RawVideoBitmap classes, so the library just open, read and close about 30 files any second, now the framerate for the same video increased from 30 fps to 104 fps and I read it nor from SPIFFS, but from SD using SdFat library.

Because on my old library release I archived about 30 fps from SPIFFS reading images one by one and with same code using SD I archived a framerate of 12 fps, I'm pretty sure that with my new classes I can archieve about double framerate reading from SPIFFS, so about 200 fps, but the Flash memory is limited to space, putting bitmaps, but jpegs require 1/8 of space mantaining a good quality near the original bitmaps, but just for small videos.

I noticed that using ESP with 4M SPIFFS is very fast and using 16M is 4 times slower, so the last can't be used to play videos, we can use but just with relatively low framerate.

Many thanks

@maxmeli
As you have has success with the JPEG_Decoder library I would stick with that and complete the rest of the library. The frame rate performance improvement with TJPG_Decoder is unlikely to be significant on such a small screen.

In your project case it is probably best to eventually integrate the underlying TJpgDec C code directly into you library. See documentation on that website.

I am not able to help further as my time is now allocated to some urgent paid work. Good luck with your project.

I have added a new "ESP32_Dual_Core_Flash_Jpg" example to the TJpg_Decoder that will only run on the ESP32.

The ESP32 has 2 processor cores (0 and 1). The Arduino sketches run on processor 1, however it is possible to assign tasks to processor 0 from the sketch.

The new example uses the ESP32 core 0 to decode a Jpeg image while processor 1 then renders it to screen. This means the two activities run in parallel, reducing the time it takes to get a Jpeg image displayed.

The new example is a copy of an existing single core library sketch called "Flash_Jpg" which has been modified to run on the two processors. The original sketch took 103ms to decode and render a 240x320 image on a SPI ILI9341 at 40MHz clock rate, this total is composed 66ms to decode the image and 37ms to transfer the decode image to the TFT.

The dual core sketch takes 66ms total to both decode and render the Jpeg because the decoding and rendering are done on different processors.

Clearly other graphics operations could also be sped up. Using two processors in this way requires an understanding of how tasks flows are managed by mutex flags. The rendering performance gains are also going to be dependent on other tasks allocated to processor 0, for example WiFi and BLE management.

That is impressive.

Especially the simplicity of handling the two cores.

David.