The jpeg format of ESP32 - Camera

Does the esp32 camera have an embedded compressor? in its hardware chip to create jpeg image format ?

A JPEG image is always compressed.

How? can you give me more information ?


A note about the JPEG decoding algorithm.
Copyright 1999 Cristi Cuturicu.

You get this file for free, so you cannot have any legal requests from me.
If you don't agree, read no more.
No warranty is provided with this doc, there might be bugs or errors in it
(although I've tried to avoid them), so use the information contained in this
file at your own risk.
This is NOT an official documentation, for further information please refer
to the JPEG ISO standard.
All product names mentioned in this file are trademarks or registered trademarks
of their respective owners.
You are free to distribute it, as long as you do not modify it.

First, a word about this doc
This doc tries to explain the JPEG compression algorithm. I'm not an expert in
this field, I just needed this info for my own JPEG decoder.
Long ago, I wanted to write my own JPEG decoder, so I've been looking on the net
a good doc which could have explained to me the JPEG compression, and particularly
the JPG file format. And except the standard I couldn't find one.
The ISO-ITU JPEG standard = ISO standard 10918-1 or CCITT standard recommendation
"Information Technology - Digital compression and coding of continuous-tone
still images - Requirements and guidelines")
Though this standard is quite complete, it has a lot of not interesting parts
in its 186 pages, and I had to dig in it, and then write my own JPG viewer,
to get from this standard the main stuff I needed :
    The Baseline Sequential DCT JPG compression.
So I thought that a short (but with enough details) doc might be useful to others.

Mainly because of the fact that the majority of the JPG files are
Baseline Sequential JPGS, this doc concerns only the Baseline Sequential JPG
compression and particularly the JFIF implementation of it.
It DOES NOT cover the JPG Progresive or Hierarchical compression.
(For more details about these read the itu-1150 standard.
 It can be found at or somewhere at

I've thought that it would be easier for the reader to understand the JPG
compression if I'll explain the steps of the JPG encoder.
(The decoder steps will be the inverse of the encoder's steps, but in reverse
order, of course)


1) The afine transformation in colour space :  [R G B] -> [Y Cb Cr]

(It is defined in the CCIR Recommendation 601)

(R,G,B are 8-bit unsigned values)

	| Y  |     |  0.299       0.587       0.114 |   | R |     | 0 |
	| Cb |  =  |- 0.1687    - 0.3313      0.5   | * | G |   + |128|
	| Cr |     |  0.5       - 0.4187    - 0.0813|   | B |     |128|

The new value Y = 0.299*R + 0.587*G + 0.114*B  is called the luminance.
It is the value used by the monochrome monitors to represent an RGB colour.
Physiologically, it represents the intensity of an RGB colour perceived by
the eye.
You see that the formula for Y it's like a weighted-filter with different weights
for each spectral component: the eye is most sensitive to the Green component
then it follows the Red component and the last is the Blue component.

The values Cb =  - 0.1687*R - 0.3313*G + 0.5   *B + 128
	   Cr =    0.5   *R - 0.4187*G - 0.0813*B + 128
are called the chromimance values and represent 2 coordinates in a system
which measures the nuance and saturation of the colour ([Approximately], these
values indicate how much blue and how much red is in that colour).
These 2 coordinates are called shortly the chrominance.

[Y,Cb,Cr] to [R,G,B] Conversion (The inverse of the previous transform)
RGB can be computed directly from YCbCr ( 8-bit unsigned values) as follows:

R = Y                    + 1.402  *(Cr-128)
G = Y - 0.34414*(Cb-128) - 0.71414*(Cr-128)
B = Y + 1.772  *(Cb-128)

A note relating Y,Cb,Cr to the human visual system
The eye, particulary the retina, has as visual analyzers two kind of cells :
Cells for night view which perceive only nuances of gray ranging from intense
white to the darkest black and cells for the day view which perceive the color
The first cells, given an RGB colour, detect a gray level similar to that given
by the luminance value.
The second cells, responsible for the perception of the colour nuance, are the
cells which detects a value related to that of the chrominance.

2) Sampling

The JPEG standard takes into account the fact that the eye seems to be more
sensitive at the luminance of a colour than at the nuance of that colour.
(the white-black view cells have more influence than the day view cells)

So, on most JPGS, luminance is taken in every pixel while the chrominance is
taken as a medium value for a 2x2 block of pixels.
Note that it is not neccessarily that the chrominance to be taken as a medium
value for a 2x2 block , it could be taken in every pixel, but good compression
results are achieved this way, with almost no loss in visual perception of the
new sampled image.

A note : The JPEG standard specifies that for every image component (like, for
example Y) must be defined 2 sampling coefficients: one for the horizontal
sampling and one for vertical sampling.
These sampling coefficients are defined in the JPG file as relative to the
maximum sampling coefficient (more on this later).

3) Level shift
All 8-bit unsigned values (Y,Cb,Cr) in the image are "level shifted": they are
converted to an 8-bit signed representation, by subtracting 128 from their value.

4) The 8x8 Discrete Cosine Transform (DCT)

The image is break into 8x8 blocks of pixels, then for each 8x8 block is
applied the DCT transform. Note that if the X dimension of the original image
is not divisible by 8, the encoder should make it divisible, by completing the
remaining right columns (until X becomes a multiple of 8) with the right-most
column of the original image.
Similar, if the Y dimension is not divisible by 8, the encoder should complete
the remaining lines with the bottom-most line of the original image.
The 8x8 blocks are processed from left to right and from top to bottom.

A note: Since a pixel in the 8x8 block has 3 components (Y,Cb,Cr) the DCT
is applied separately to 3 blocks 8x8:
  The first 8x8 block is the block which contains the luminance of the pixels
   in the original 8x8 block
  The second 8x8 block is the block which contains the Cb value of the pixels
   in the original 8x8 block
  And, similar, the third 8x8 block contains the Cr values.

The purpose of the DCT transform is that instead of processing the original
samples, you work with the spatial frequencies present in the original image.
These spatial frequencies are very related to the level of detail present in an
image. High spatial frequencies corresponds to high levels of detail, while
lower frequencies corresponds to lower levels of detail.

The DCT transform is very similar to the 2D Fourier transform which shifts from
the time domain (the original 8x8 block) to the frequency domain (the new 8x8=
64 coefficients which represents the amplitudes of the spatial frequencies

The mathematical definition of Forward DCT (FDCT) and Inverse DCT (IDCT) is :

	   c(u,v)     7   7                 2*x+1                2*y+1
F(u,v) = --------- * sum sum f(x,y) * cos (------- *u*PI)* cos (------ *v*PI)
	     4       x=0 y=0                 16                   16

 u,v = 0,1,...,7

	  { 1/2 when u=v=0
 c(u,v) = { 1/sqrt(2) when  u=0, v!=0
          { 1/sqrt(2) when u!=0, v=0
	  {  1 otherwise

	   1     7   7                      2*x+1                2*y+1
f(x,y) =  --- * sum sum c(u,v)*F(u,v)*cos (------- *u*PI)* cos (------ *v*PI)
	   4    u=0 v=0                      16                   16


Applying these formulas directly is computationally expensive, especially
when there have been developed faster algorithms for implementing forward or
inverse DCT. A notable one called AA&N leaves only 5 multiplies and 29 adds
to be done in the DCT itself. More info and an implementation of it can be 
found in the free software for JPEG encoders/decoders made by Independent JPEG 
Group (IJG), their C source can be found at

5) The zig-zag reordering of the 64 DCT coefficients

So, after we performed the DCT transform over a block of 8x8 values, we have
a new 8x8 block.
Then, this 8x8 block is traversed in zig-zag like this :

(The numbers in the 8x8 block indicate the order in which we traverse the
bidimensional 8x8 matrix)
		  0, 1, 5, 6,14,15,27,28,
		  2, 4, 7,13,16,26,29,42,
		  3, 8,12,17,25,30,41,43,

As you see , first is the upper-left corner (0,0), then the value at (0,1),
then (1,0) then (2,0), (1,1), (0,2), (0,3), (1,2),  (2,1), (3,0) etc.

After we are done with traversing in zig-zag the 8x8 matrix we have now a vector
with 64 coefficients (0..63)
The reason for this zig-zag traversing is that we traverse the 8x8 DCT coefficients
in the order of increasing the spatial frequencies. So, we get a vector sorted
by the criteria of the spatial frequency:  The first value in the vector (at
index 0) corresponds to the lowest spatial frequency present in the image -
It's called the DC term. As we increase the index in the vector, we get values
corresponding to higher frequencies (The value at index 63 corresponds to the
amplitude of the highest spatial frequency present in the 8x8 block).
The rest of the DCT coefficients are called AC terms.

6) Quantization

At this stage, we have a sorted vector with 64 values corresponding to the
amplitudes of the 64 spatial frequencies present in the 8x8 block.

These 64 values are quantized: Each value is divided by a dividend specified
in a vector with 64 values --- the quantization table , then it's rounded to
the nearest integer.

 for (i = 0 ; i<=63; i++ )
   vector[i] = (int) (vector[i] / quantization_table[i] + 0.5)

Here is the example of the quantization table for luminance(Y) given in an
annex of the JPEG standard.(It is given in a form of an 8x8 block; in order to
obtain a 64 vector it should be zig-zag reordered)
 16 11 10 16 24  40  51  61
 12 12 14 19 26  58  60  55
 14 13 16 24 40  57  69  56
 14 17 22 29 51  87  80  62
 18 22 37 56 68  109 103 77
 24 35 55 64 81  104 113 92
 49 64 78 87 103 121 120 101
 72 92 95 98 112 100 103 99
This table is based upon "psychovisual thresholding" , it has "been used with
good results on 8-bit per sample luminance and chrominance images".
Most existing encoders use simple multiples of this example, but the values are
not claimed to be optimal (An encoder can use ANY OTHER quantization table)
The table is specified in the JPEG file with the DQT(Define Quantization Table)
marker.Most commonly there is one table for Y, and another one for the
chrominance (Cb and Cr).

The quantization process has the key role in the JPEG compression.
It is the process which removes the high frequencies present in the original
image -- in consequence the high detail.
We do this because of the fact that the eye is much more sensitive to lower
spatial frequencies than to higher frequencies, so we can remove, with very
little visual loss, higher frequencies.
This is done by dividing values at high indexes in the vector (the amplitudes
of higher frequencies) with larger values than the values by which are divided
the amplitudes of lower frequencies.
The bigger the values in the quantization table are, the bigger is the error
(in consequence the visual error) introduced by this lossy process, and the
smaller is the visual quality.

Another important fact is that in most images the colour varies slow from one
pixel to another, so most images will have a small quantity of high detail
-> a small amount (small amplitudes) of high spatial frequencies - but they have
a lot of image information contained in the low spatial frequencies.

In consequence in the new quantized vector, at high spatial frequencies, we'll
have a lot of consecutive zeroes.

7)  The Zero Run Length Coding (RLC)

Now we have the quantized vector with a lot of consecutive zeroes. We can exploit
this by run length coding the consecutive zeroes.
IMPORTANT: You'll see later why, but here we skip the encoding of the first
 coefficient of the vector (the DC coefficient) which is coded a bit different.
(I'll present its coding later on this doc)
Let's consider the original 64 vector a 63 vector (it's the 64 vector without
the first coefficient)

Say that we have 57,45,0,0,0,0,23,0,-30,-16,0,0,1,0,0,0, 0 , 0 ,0 , only 0,..,0

Here it is how the RLC JPEG compression is done for this example :

(0,57) ; (0,45) ; (4,23) ; (1,-30) ; (0,-16) ; (2,1) ; EOB

As you see, we encode for each value different by 0 the number of consecutive
zeroes PRECEDING that value, then we add the value.
Another note : EOB is the short form for End of Block, it's a special coded
value (a marker). If we've reached in a position in the vector from which 
we have till the end of the vector only zeroes, we'll mark that position
with EOB and finish the RLC compression of the quantized vector.

[Note that if the quantized vector doesn't finishes with zeroes (has the last
element not 0) we'll not have the EOB marker.]

ACTUALLY, EOB has as an equivalent (0,0) and it will be (later) Huffman coded
like (0,0), so we'll encode :
 (0,57) ; (0,45) ; (4,23) ; (1,-30) ; (0,-16) ; (2,1) ; (0,0)

Another MAJOR thing: Say that somewhere in the quantized vector
 we have: 57, eighteeen zeroes, 3, 0,0 ,0,0 2, thirty-three zeroes, 895, EOB

The JPG Huffman coding makes the restriction (you'll see later why) that
the number of previous 0's to be coded as a 4-bit value, so it can't overpass
the value 15 (0xF).

So, the previous example would be coded as :
    (0,57) ; (15,0) (2,3) ; (4,2) ; (15,0) (15,0) (1,895) , (0,0)

(15,0) is a special coded value which indicates that there follows 16 consecutive
zeroes.Note : 16 zeroes not 15 zeroes.

8) The final step === Huffman coding

First an IMPORTANT note : Instead of storing the actual value , the JPEG standard
specifies that we store the minimum size in bits in which we can keep that value
(it's called the category of that value) and then a bit-coded representation
of that value like this:

	     Values             Category        Bits for the value
		0                   0                   -
	      -1,1                  1                  0,1
	   -3,-2,2,3                2              00,01,10,11
     -7,-6,-5,-4,4,5,6,7            3    000,001,010,011,100,101,110,111
       -15,..,-8,8,..,15            4       0000,..,0111,1000,..,1111
      -31,..,-16,16,..,31           5     00000,..,01111,10000,..,11111
      -63,..,-32,32,..,63           6                   .
     -127,..,-64,64,..,127          7                   .
    -255,..,-128,128,..,255         8                   .
    -511,..,-256,256,..,511         9                   .
   -1023,..,-512,512,..,1023       10                   .
  -2047,..,-1024,1024,..,2047      11                   .
  -4095,..,-2048,2048,..,4095      12                   .
  -8191,..,-4096,4096,..,8191      13                   .
 -16383,..,-8192,8192,..,16383     14                   .
-32767,..,-16384,16384,..,32767    15                   .

In consequence for the previous example:
    (0,57) ; (0,45) ; (4,23) ; (1,-30) ; (0,-8) ; (2,1) ; (0,0)

let's encode ONLY the right value of these pairs, except the pairs that are
special markers like (0,0) or (if we would have) (15,0)

    57 is in the category 6 and it is bit-coded 111001 , so we'll encode it
like (6,111001)
    45 , similar, will be coded as (6,101101)
    23  ->  (5,10111)
   -30  ->  (5,00001)
    -8  ->  (4,0111)
     1  ->  (1,1)

And now , we'll write again the string of pairs:

   (0,6), 111001 ; (0,6), 101101 ; (4,5), 10111; (1,5), 00001; (0,4) , 0111 ;
       (2,1), 1 ; (0,0)

The pairs of 2 values enclosed in bracket paranthesis, can be represented on a
byte because of the fact that each of the 2 values can be represented on a nibble
(the counter of previous zeroes is always smaller than 15 and so it is the
category of the numbers [numbers encoded in a JPG file are in range -32767..32767]).
In this byte, the high nibble represents the number of previous 0s, and the
lower nibble is the category of the new value different by 0.

The FINAL step of the encoding consists in Huffman encoding this byte, and then
writing in the JPG file, as a stream of bits, the Huffman code of this byte,
followed by the remaining bit-representation of that number.

For example, let's say that for byte 6 ( the equivalent of (0,6) ) we have a
Huffman code = 111000;
    for byte 69 = (4,5) (for example) we have 1111111110011001
             21 = (1,5)    ---  11111110110
             4  = (0,4)    ---  1011
             33 = (2,1)    ---  11011
              0 = EOB = (0,0) ---  1010

The final stream of bits written in the JPG file on disk for the previous example
of 63 coefficients (remember that we've skipped the first coefficient ) is
      111000 111001  111000 101101  1111111110011001 10111   11111110110 00001
         1011 0111   11011 1   1010

The encoding of the DC coefficient
DC is the coefficient in the quantized vector corresponding to the lowest
frequency in the image (it's the 0 frequency) , and (before quantization) is
mathematically = (the sum of 8x8 image samples) / 8 .
(It's like an average value for that block of image samples).
It is said that it contains a lot of energy present in the original 8x8 image
block. (Usually it gets large values).
The authors of the JPEG standard noticed that there's a very close connection
between the DC coefficient of consecutive blocks, so they've decided to encode
in the JPG file the difference between the DCs of consecutive 8x8 blocks
(Note: consecutive 8x8 blocks of the SAME image component, like consecutive
8x8 blocks for Y , or consecutive blocks for Cb , or for Cr)

Diff = DC  - DC
         i     (i-1)
So DC of the current block (DC  ) will be equal to :  DC  = DC    + Diff
                              i                         i     i-1

And in JPG decoding you will start from 0 -- you consider that the first
DC coefficient = 0 ;  DC  = 0
And then you'll add to the current value the value decoded from the JPG
(the Diff value)

SO, in the JPG file , the first coefficient = the DC coefficient is actually
the difference, and it is Huffman encoded DIFFERENTLY from the encoding of AC coefficients.

Here it is how it's done:
(Remember that we now code the Diff value)

Diff corresponds as you've seen before to a representation made by category and
it's bit coded representation.
In the JPG file it will be Huffman encoded only the category value, like this:

Diff = (category, bit-coded representation)
Then Diff will be coded as (Huffman_code(category) , bit-coded representation)

For example, if Diff is equal to -511 , then Diff  corresponds to
                    (9, 000000000)
Say that 9 has a Huffman code = 1111110
(In the JPG file, there are 2 Huffman tables for an image component: one for DC
and one for AC)

In the JPG file, the bits corresponding to the DC coefficient will be:
	       1111110 000000000
And,applied to this example of DC and to the previous example of ACs, for this
vector with 64 coefficients, THE FINAL STREAM OF BITS written in the JPG file
will be:

   1111110 000000000 111000 111001  111000 101101  1111111110011001 10111
       11111110110 00001 1011 0111   11011 1   1010

(In the JPG file , first it's encoded DC then ACs)

THE HUFFMAN DECODER (A brief summary) for the 64 coefficients (A Data Unit)
of an image component (For example Y)

So when you decode a stream of bits from the image in the JPG file, you'll do:

Init DC with 0.

1) First the DC coefficient decode :
	 a) Fetch a valid Huffman code (you check if it exists in the Huffman
                                           DC table)
         b) See at what category this Huffman code corresponds
         c) Fetch N = category bits  , and determine what value is represented
           by (category, the N bits fetched) = Diff
         d) DC + = Diff
         e) write DC in the 64 vector :      " vector[0]=DC "

2) The 63 AC coefficients decode :

------- FOR every AC coefficient UNTIL (EOB_encountered OR AC_counter=64)

       a) Fetch a valid Huffman code (check in the AC Huffman table)
       b) Decode that Huffman code : The Huffman code corresponds to
[Remember: EOB_encountered = TRUE if (nr_of_previous_0,category) = (0,0) ]

       c) Fetch N = category bits, and determine what value is represented by
              (category,the N bits fetched) = AC_coefficient
       d) Write in the 64 vector, a number of zeroes = nr_of_previous_zero
       e) increment the AC_counter with nr_of_previous_0
       f) Write AC_coefficient in the vector:
                  " vector[AC_counter]=AC_coefficient "

Next Steps
So, now we have a 64 elements vector.We'll do the reverse of the steps presented
in this doc:

1) Dequantize the 64 vector : "for (i=0;i<=63;i++) vector[i]*=quant[i]"
2) Re-order from zig-zag the 64 vector into an 8x8 block
3) Apply the Inverse DCT transform to the 8x8 block

Repeat the upper process [ Huffman decoder, steps 1), 2) and 3)]  for every
8x8 block of every image component (Y,Cb,Cr).

4) Up-sample if it's needed
5) Level shift samples (add 128 to the all 8-bit signed values in the 8x8 blocks
resulting from the IDCT transform)
6) Tranform YCbCr to RGB

7--- And VOILA ... the JPG image

The JPEG markers and/or how it's organized the image information in the JPG file
(The Byte level)
NOTE: The JPEG/JFIF file format uses Motorola format for words, NOT Intel format,
i.e. : high byte first, low byte last -- (ex: the word FFA0 will be written in
the JPEG file in the order : FF at the low offset , A0 at the higher offset)

The JPG standard specifies that the JPEG file is composed mostly of pieces called
A segment is a stream of bytes with length <= 65535.The segment beginning is
specified with a marker.
A marker = 2 bytes beginning with 0xFF ( the C hexadecimal notation for 255),
and ending with a byte different by 0 and 0xFF.
Ex: 'FFDA' , 'FFC4', 'FFC0'.
Each marker has a meaning: the second byte (different by 0 and 0xFF) specifies
what does that marker.
For example, there is a marker which specifies that you should start the decoding
process , this is called (the JPG standard's terminology):
        SOS=Start Of Scan = 'FFDA'

Another marker called DQT = Define Quantization Table = 0xFFDB does what this
name says: specifies that in the JPG file, after the marker (and after 3 bytes,
more on this later) it will follow 64 bytes = the coefficients of the quantization

If, during the processing of the JPG file, you encounter an 0xFF, then again a
a byte different by 0 (I've told you that the second byte for a marker is not 0)
and this byte has no marker meaning (you cannot find a marker corresponding to
that byte) then the 0xFF byte you encountered must be ignored and skipped.
(In some JPGS, sequences of consecutive 0xFF are for some filling purposes and
must be skipped)

You see that whenever you encounter 0xFF , you check the next byte and see if
that 0xFF you encountered has a marker meaning or must be skipped.
What happens if we actually need to encode the 0xFF byte in the JPG file
as an *usual* byte (not a marker, or a filling byte) ?
(Say that we need to write a Huffman code which begins with 11111111 (8 bits of
1) at a byte alignment)
The standard says that we simply make the next byte 0 , and write the sequence
'FF00' in the JPG file.
So when your JPG decoder meets the 2 byte 'FF00' sequence, it should consider
just a byte: 0xFF as an usual byte.

Another thing: You realise that these markers are byte aligned in the JPG file.
What happens if during your Huffman encoding and inserting bits in the JPG file's
bytes you have not finished to insert bits in a byte, but you need to write a
marker which is byte aligned ?
For the byte alignment of the markers, you SET THE REMAINING BITS UNTIL THE
BEGINNING OF THE NEXT BYTE TO 1, then you write the marker at the next byte.

A short explanation of some important markers found in a JPG file.

SOI = Start Of Image = 'FFD8'
 This marker must be present in any JPG file *once* at the beginning of the file.
(Any JPG file starts with the sequence FFD8.)
EOI = End Of Image = 'FFD9'
  Similar to EOI: any JPG file ends with FFD9.

RSTi = FFDi (where i is in range 0..7)  [ RST0 = FFD0, RST7=FFD7]
     = Restart Markers
These restart markers are used for resync. At regular intervals, they appear
in the JPG stream of bytes, during the decoding process (after SOS)
(They appear in the order: RST0 -- interval -- RST1 -- interval -- RST2 --...
                      ...-- RST6 -- interval -- RST7 -- interval -- RST0 --...
(Obs: A lot of JPGs don't have restart markers)

The problem with these markers is that they interrupt the normal bit order in
the JPG's Huffman encoded bitstream.
Remember that for the byte alignment of the markers the remaining bits are set
to 1, so your decoder has to skip at regular intervals the useless filling
bits (those set with 1) and the RST markers.

At the end of this doc, I've included a very well written technical explanation
of the JPEG/JFIF file format, written by Oliver Fromme, the author of the QPEG
There you'll find a pretty good and complete definition for the markers.

But, anyway, here is a list of markers you should check:

SOF0 = Start Of Frame 0 = FFC0
SOS  = Start Of Scan    = FFDA
APP0 = it's the marker used to identify a JPG file which uses the JFIF
    specification       = FFE0
COM  = Comment          = FFFE
DNL  = Define Number of Lines    = FFDC
DRI  = Define Restart Interval   = FFDD
DQT  = Define Quantization Table = FFDB
DHT  = Define Huffman Table      = FFC4

The Huffman table stored in a JPG file
Here it is how JPEG implements the Huffman tree: instead of a tree, it defines
a table in the JPG file after the DHT (Define Huffman Table) marker.
NOTE: The length of the Huffman codes is restricted to 16 bits.

Basically there are 2 types of Huffman tables in a JPG file : one for DC and
one for AC (actually there are 4 Huffman tables: 2 for DC,AC of luminance
       and 2 for DC,AC of chrominance)

They are stored in the JPG file in the same format which consist of:
1) 16 bytes :

byte i contains the number of Huffman codes of length i (length in bits)
 i ranges from 1 to 16
2) A table with the length (in bytes) =  sum nr_codes_of_length_i

which contains at location [k][j]  (k in 1..16, j in 0..(nr_codes_with_length_k-1))
the BYTE value associated to the j-th Huffman code of length k.
(For a fixed length k, the values are stored sorted by the value of the Huffman

From this table you can find the actual Huffman code associated to a particular
Here it is an example of how the actual code values are generated:

Ex:  (Note: The number of codes for a given length are here for this particular
      example to figure it out, they can have any other values)
SAY that,

         For length 1 we have nr_codes[1]=0, we skip this length
         For length 2 we have 2 codes  00
         For length 3 we have 3 codes  100
         For length  4 we have 1 code  1110
         For length  5 we have 1 code  11110
         For length  6 we have 1 code  111110
         For length  7 we have 0 codes  -- skip
 (if we had 1 code for length 7,
          we would have                1111110)
         For length  8 we have 1 code  11111100 (You see that the code is still
                                                 shifted to left though we skipped
                                                 the code value for 7)
         For length 16, .... (the same thing)

I've told you that in the Huffman table in the JPG file are stored the BYTE values
for a given code.

For this particular example of Huffman codes:
Say that in the Huffman table in the JPG file on disk we have (after that 16 bytes
which contains the nr of Huffman codes with a given length):
    45 57 29 17 23 25 34 28
These values corressponds , given that particular lengths I gave you before ,
to the Huffman codes like this :

    there's no value for  code of length 1
    for codes of length 2 : we have 45 57
    for codes of length 3 : 3 values (ex : 29,17,23)
    for codes of length 4 : only 1 value (ex: 25)
    for codes of length 5 : 1 value ( ex: 34)
    for code of length 7, again no value, skip to code with length 8
    for code of length 8 : 1 value 28

  For codes of length 2:
      the value 45 corresponds to code 00
                57             to code 01
  For codes of length 3:
      the value 29 corresponds to code  100
                17       ----||---      101
                23       ----||---      110

(I've told you that for a given length the byte values are stored in the order
of increasing the value of the Huffman code.)

Four Huffman tables corresponding to DC and AC tables of the luminance, and
DC and AC tables for the chrominance, are given in an annex of the JPEG
standard as a suggestion for the encoder.
 The standard says that these tables have been tested with good compression
results on a lot of images and reccommends them, but the encoder can use any
other Huffman table. A lot of JPG encoders use these tables. Some of them offer
you an option: entropy optimization - if it's enabled they'll use Huffman
tables optimized for that particular image.

The JFIF (Jpeg Format Interchange File) file
	The JPEG standard (that in the file) is somehow very general,
the JFIF implementation is a particular case of this standard (and it is, of course,
compatible with the standard) .
	  The JPEG standard specifies some markers reserved for applications
(by applications I mean particular cases of implementing the standard)
 Those markers are called APPn , where n ranges from 0 to 0xF ; APPn = FFEn
 The JFIF specification uses the APP0 marker (FFE0) to identify a JPG file which
uses this specification.
 You'll see in the JPEG standard that it refers to "image components".
These image components can be (Y,Cb,Cr) or (YIQ) or whatever.
 The JFIF implementations uses only (Y,Cb,Cr) for a truecolor JPG, or only Y for
a monochrome JPG.
 You can get the JFIF specification from

The sampling factors

Note: The following explanation covers the encoding of truecolor (3 components)
JPGS; for gray-scaled JPGs there is one component (Y) which is usually no
down-sampled at all, and does not require any inverse transformation like the
inverse (Y,Cb,Cr) -> (R,G,B). In consequence, the gray-scaled JPGS are the
simplest and easiest to decode: for every 8x8 block in the image you do the
Huffman decoding of the RLC coded vector then you reorder it from zig-zag,
dequantize the 64 vector and finally you apply to it the inverse DCT and add
128 (level shift) to the new 8x8 values.

I've told you that image components are sampled. Usually Y is taken every pixel,
and Cb, Cr are taken for a block of 2x2 pixels.
But there are some JPGs in which  Cb , Cr are taken in every pixel, or some
JPGs where Cb, Cr are taken every 2 pixels (a horizontal sampling at 2 pixels,
and a vertical sampling in every pixel)
The sampling factors for an image component in a JPG file are defined in respect
(relative) to the highest sampling factor.

Here are the sampling factors for the most usual example:
		 Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels
(The JFIF specification gives a formula for sampling factors which I think that
works only when the maximum sampling factor for each dimension X or Y is <=2)
The JPEG standard does not specify the sampling factors , it's more general).

You see that Y will have the highest sampling rate :
		   Horizontal sampling factor = 2  = HY
			   Vertical sampling factor   = 2  = VY
	 For Cb ,  Horizontal sampling factor = 1  = HCb
		   Vertical sampling factor   = 1  = VCb
	 For Cr    Horizontal sampling factor = 1  = HCr
			   Vertical sampling factor   = 1  = VCr
Actually this form of defining the sampling factors is quite useful.
The vector of 64 coefficients for an image component, Huffman encoded, is called
	DU = Data Unit (JPEG's standard terminology)

In the JPG file , the order of encoding Data Units is :
	 1) encode Data Units for the first image component:
			 for  (counter_y=1;counter_y<=VY;counter_y++)
				  for (counter_x=1;counter_x<=HY;counter_x++)
					 {  encode Data Unit for Y }

	 2) encode Data Units for the second image component:
			 for  (counter_y=1;counter_y<=VCb ;counter_y++)
				  for (counter_x=1;counter_x<=HCb;counter_x++)
					 {  encode Data Unit for Cb }

	 3) finally, for the third component, similar:
			 for  (counter_y=1;counter_y<=VCr;counter_y++)
				  for (counter_x=1;counter_x<=HCr;counter_x++)
					 {  encode Data Unit for Cr }

For the example I gave you (HY=2, VY=2 ; HCb=VCb =1, HCr,VCr=1)
here it is a figure ( I think it will clear out things for you) :
	  YDU YDU    CbDU   CrDU
( YDU is a Data unit for Y , and similar CbDU a DU for Cb, CrDU = DU for Cr )
This usual combination of sampling factors is referred as 2:1:1 for both
vertical and horizontal sampling factors.
And, of course, in the JPG file the encoding order will be :

You know that a DU (64 coefficients) defines a block of 8x8 values , so here
we specified the encoding order for a block of 16x16 image pixels
(An image pixel = an (Y,Cb,Cr) pixel [my notation]) :
  Four 8x8 blocks of Y values (4 YDUs), one 8x8 block of Cb values (1 CbDU)
and one 8x8 block of Cr values (1 CrDU)

(Hmax = the maximum horizontal sampling factor , Vmax = the maximum vertical
sampling factor)
In consequence for this example of sampling factors (Hmax = 2, Vmax=2), the
encoder should process SEPARATELY every 16x16 = (Hmax*8 x Vmax*8) image pixels
block in the order mentioned.

This block of image pixels with the dimensions (Hmax*8,Vmax*8) is called, in
the JPG's standard terminology, an MCU = Minimum Coded Unit
For the previous example : MCU = YDU,YDU,YDU,YDU,CbDU,CrDU

Another example of sampling factors :
	  HY =1, VY=1
	  HCb=1, VCb=1
	  HCr=1, VCr=1
Figure/order :  YDU CbDU CrDU
You see that here is defined an 8x8 image pixel block (MCU) with 3 8x8 blocks:
	 one for Y, one for Cb and one for Cr (There's no down-sampling at all)
Here (Hmax=1,Vmax=1) the MCU has the dimension (8,8), and MCU = YDU,CbDU,CrDU

For gray-scaled JPGs you don't have to worry about the order of encoding
data units in an MCU. For these JPGs, an MCU = 1 Data Unit (MCU = YDU)

In the JPG file, the sampling factors for every image component are defined
after the marker SOF0 = Start Of Frame 0 = FFC0

A brief scheme of decoding a JPG file
The decoder reads from the JPG file the sampling factors, it finds out the
dimensions of an MCU (Hmax*8,Vmax*8) => how many MCUs are in the whole image,
then decodes every MCU present in the original image (a loop for all these
blocks, or until the EOI marker is found [it should be found when the loop
finishes, otherwise you'll get an incomplete image]) - it decodes an MCU
by decoding every Data Unit in the MCU in the order mentioned before, and
finally, writes the decoded (Hmax*8 x Vmax*8) truecolor pixel block into the
(R,G,B) image buffer.

MPEG-1 video and JPEG
The interesting part of the MPEG-1 specification (and probably MPEG-2) is that
it relies heavily on the JPEG specification.
It uses a lot of concepts presented here. The reason is that every 15 frames ,
or when it's needed, there's an independent frame called I-frame (Intra frame)
which is JPEG coded.
(By the way, that 16x16 image pixels block example I gave you, is called,in the
MPEG's standard terminology, a macroblock)
Except the algorithms for motion compensation, MPEG-1 video relies a lot on the
JPG specifications (the DCT transform , quantization, etc.)

Hope you're ready now to start coding your JPG viewer or encoder.

About the author of this doc
My name is Cristi Cuturicu.
I'm a student at University Politehnica in Bucharest (UPB), Department of
Computer Science.
I'm not an expert in compression, I made a JPEG encoder/decoder because I
needed it for a project.

You can contact me by e-mail: (school email)
			or (preferably)

A technical explanation of the JPEG/JFIF file format,
written by Oliver Fromme, the author of the QPEG viewer
Legal NOTE: The legal rules mentioned in the Disclaimer in top of this file
apply also to the following informations so neither Oliver Fromme, neither I
can be held responsible for errors or bugs in the following informations.

The author of the following informations is:
   Oliver Fromme
   Leibnizstr. 18-61
   38678 Clausthal

JPEG/JFIF file format:

  - header (2 bytes):  $ff, $d8 (SOI) (these two identify a JPEG/JFIF file)
  - for JFIF files, an APP0 segment is immediately following the SOI marker,
	see below
  - any number of "segments" (similar to IFF chunks), see below
  - trailer (2 bytes): $ff, $d9 (EOI)

Segment format:

  - header (4 bytes):
	   $ff     identifies segment
		n      type of segment (one byte)
	   sh, sl  size of the segment, including these two bytes, but not
			   including the $ff and the type byte. Note, not Intel order:
			   high byte first, low byte last!
  - contents of the segment, max. 65533 bytes.

  - There are parameterless segments (denoted with a '*' below) that DON'T
	have a size specification (and no contents), just $ff and the type byte.
  - Any number of $ff bytes between segments is legal and must be skipped.

Segment types:

   *TEM   = $01   usually causes a decoding error, may be ignored

	SOF0  = $c0   Start Of Frame (baseline JPEG), for details see below
	SOF1  = $c1   dito
	SOF2  = $c2   usually unsupported
	SOF3  = $c3   usually unsupported

	SOF5  = $c5   usually unsupported
	SOF6  = $c6   usually unsupported
	SOF7  = $c7   usually unsupported

	SOF9  = $c9   for arithmetic coding, usually unsupported
	SOF10 = $ca   usually unsupported
	SOF11 = $cb   usually unsupported

	SOF13 = $cd   usually unsupported
	SOF14 = $ce   usually unsupported
	SOF14 = $ce   usually unsupported
	SOF15 = $cf   usually unsupported

	DHT   = $c4   Define Huffman Table, for details see below
	JPG   = $c8   undefined/reserved (causes decoding error)
	DAC   = $cc   Define Arithmetic Table, usually unsupported

   *RST0  = $d0   RSTn are used for resync, may be ignored
   *RST1  = $d1
   *RST2  = $d2
   *RST3  = $d3
   *RST4  = $d4
   *RST5  = $d5
   *RST6  = $d6
   *RST7  = $d7

	SOI   = $d8   Start Of Image
	EOI   = $d9   End Of Image
	SOS   = $da   Start Of Scan, for details see below
	DQT   = $db   Define Quantization Table, for details see below
	DNL   = $dc   usually unsupported, ignore

	SOI   = $d8   Start Of Image
	EOI   = $d9   End Of Image
	SOS   = $da   Start Of Scan, for details see below
	DQT   = $db   Define Quantization Table, for details see below
	DNL   = $dc   usually unsupported, ignore
	DRI   = $dd   Define Restart Interval, for details see below
	DHP   = $de   ignore (skip)
	EXP   = $df   ignore (skip)

	APP0  = $e0   JFIF APP0 segment marker, for details see below
	APP15 = $ef   ignore

	JPG0  = $f0   ignore (skip)
	JPG13 = $fd   ignore (skip)
	COM   = $fe   Comment, for details see below

 All other segment types are reserved and should be ignored (skipped).

SOF0: Start Of Frame 0:

  - $ff, $c0 (SOF0)
  - length (high byte, low byte), 8+components*3
  - data precision (1 byte) in bits/sample, usually 8 (12 and 16 not
	supported by most software)
  - image height (2 bytes, Hi-Lo), must be >0 if DNL not supported
  - image width (2 bytes, Hi-Lo), must be >0 if DNL not supported
  - number of components (1 byte), usually 1 = grey scaled, 3 = color YCbCr
	or YIQ, 4 = color CMYK)
  - for each component: 3 bytes
	 - component id (1 = Y, 2 = Cb, 3 = Cr, 4 = I, 5 = Q)
	 - sampling factors (bit 0-3 vert., 4-7 hor.)
	 - quantization table number

  - JFIF uses either 1 component (Y, greyscaled) or 3 components (YCbCr,
	sometimes called YUV, colour).

APP0: JFIF segment marker:

  - $ff, $e0 (APP0)
  - length (high byte, low byte), must be >= 16
  - 'JFIF'#0 ($4a, $46, $49, $46, $00), identifies JFIF
  - major revision number, should be 1 (otherwise error)
  - minor revision number, should be 0..2 (otherwise try to decode anyway)
  - units for x/y densities:
	 0 = no units, x/y-density specify the aspect ratio instead
	 1 = x/y-density are dots/inch
	 2 = x/y-density are dots/cm
  - x-density (high byte, low byte), should be <> 0
  - y-density (high byte, low byte), should be <> 0
  - thumbnail width (1 byte)
  - thumbnail height (1 byte)
  - n bytes for thumbnail (RGB 24 bit), n = width*height*3

  - If there's no 'JFIF'#0, or the length is < 16, then it is probably not
	a JFIF segment and should be ignored.
  - Normally units=0, x-dens=1, y-dens=1, meaning that the aspect ratio is
	1:1 (evenly scaled).
  - JFIF files including thumbnails are very rare, the thumbnail can usually
	be ignored.  If there's no thumbnail, then width=0 and height=0.
  - If the length doesn't match the thumbnail size, a warning may be
	printed, then continue decoding.

DRI: Define Restart Interval:

  - $ff, $dd (DRI)
  - length (high byte, low byte), must be = 4
  - restart interval (high byte, low byte) in units of MCU blocks,
	meaning that every n MCU blocks a RSTn marker can be found.
	The first marker will be RST0, then RST1 etc, after RST7
	repeating from RST0.

DQT: Define Quantization Table:

  - $ff, $db (DQT)
  - length (high byte, low byte)
  - QT information (1 byte):
	 bit 0..3: number of QT (0..3, otherwise error)
	 bit 4..7: precision of QT, 0 = 8 bit, otherwise 16 bit
  - n bytes QT, n = 64*(precision+1)

  - A single DQT segment may contain multiple QTs, each with its own
	information byte.
  - For precision=1 (16 bit), the order is high-low for each of the 64 words.

DAC: Define Arithmetic Table:
 Current software does not support arithmetic coding for legal reasons.
 JPEG files using arithmetic coding can not be processed.

DHT: Define Huffman Table:

  - $ff, $c4 (DHT)
  - length (high byte, low byte)
  - HT information (1 byte):
	 bit 0..3: number of HT (0..3, otherwise error)
	 bit 4   : type of HT, 0 = DC table, 1 = AC table
	 bit 5..7: not used, must be 0
  - 16 bytes: number of symbols with codes of length 1..16, the sum of these
	bytes is the total number of codes, which must be <= 256
  - n bytes: table containing the symbols in order of increasing code length
	(n = total number of codes)

  - A single DHT segment may contain multiple HTs, each with its own
	information byte.

COM: Comment:

  - $ff, $fe (COM)
  - length (high byte, low byte) of the comment = L+2
  - The comment = a stream of bytes with the length = L

SOS: Start Of Scan:

  - $ff, $da (SOS)
  - length (high byte, low byte), must be 6+2*(number of components in scan)
  - number of components in scan (1 byte), must be >= 1 and <=4 (otherwise
	error), usually 1 or 3
  - for each component: 2 bytes
	 - component id (1 = Y, 2 = Cb, 3 = Cr, 4 = I, 5 = Q), see SOF0
	 - Huffman table to use:
	- bit 0..3: AC table (0..3)
	- bit 4..7: DC table (0..3)
  - 3 bytes to be ignored (???)

  - The image data (scans) is immediately following the SOS segment.

Thanks, as I understood, do these steps apply to the captured image? Therefore, the resulting image is with the extension jpeg? .

As far as I know, the ESP32-cam snapshots are in JPEG format,
which is, as shown by the above ancient document, non-trivial.

I wonder why you are asking.

If you want to shrink the size of the snapshots, further compression is not worth the trouble,
because JPEG data is compressed on the outermost layer,
so most repetitions and unnecessary information were just removed.

If you need smaller snapshots, decrease the quality or the dimensions.

if I used another format like PNG ? is there any capability to compress this type or other type ?

So you want to change the JPEG compression to PNG compression?
Judging from your basic understanding of the matter, that will be a major task.

Why do you need smaller image files?

JPEG and PNG are pretty good at compression ratios, so the only way to get smaller pictures is

  • lower the quality
  • lower the size of the picture

It might help us if you would say what you want to use the ESP32CAM for, what is the application etc .......................

There are other ways you can access the captured image, including raw pixels, BMP format, grayscale etc.

Partly described here with examples: GitHub - espressif/esp32-camera
This link is helpful, too: esp32_cam acces and process image

This code captures an image and prints it out on the serial port, in grayscale:

#if !defined ESP32
#error This sketch is only for an ESP32Cam module

#include "esp_camera.h"       //
// #include "camera_pins.h"

// ---------------------------------------------------------------
//                           -SETTINGS
// ---------------------------------------------------------------
// from
const char* stitle = "ESP32Cam-demo-gs";               // title of this sketch
const char* sversion = "10Jul21";                      // Sketch version

// Camera related
const bool flashRequired = 1;                        // If flash to be used when capturing image (1 = yes)
const framesize_t FRAME_SIZE_IMAGE = FRAMESIZE_QVGA;// Image resolution: was QQVGA 160x120
//               default = "const framesize_t FRAME_SIZE_IMAGE = FRAMESIZE_VGA"
//               160x120 (QQVGA), 128x160 (QQVGA2), 176x144 (QCIF), 240x176 (HQVGA),
//               320x240 (QVGA), 400x296 (CIF), 640x480 (VGA, default), 800x600 (SVGA),
//               1024x768 (XGA), 1280x1024 (SXGA), 1600x1200 (UXGA)
#define PIXFORMAT PIXFORMAT_GRAYSCALE;               // image format, Options =  YUV422, GRAYSCALE, RGB565, JPEG, RGB888                                                         
#define WIDTH 320                                    // image size
#define HEIGHT 240

int cameraImageExposure = 0;                         // Camera exposure (0 - 1200)   If gain and exposure both set to zero then auto adjust is enabled
int cameraImageGain = 0;                             // Image gain (0 - 30)

const int TimeBetweenStatus = 600;                     // speed of flashing system running ok status light (milliseconds)

const int indicatorLED = 33;                           // onboard small LED pin (33)

const int brightLED = 4;                               // onboard Illumination/flash LED pin (4)

const int iopinA = 13;                                 // general io pin 13
const int iopinB = 12;                                 // general io pin 12 (must not be high at boot)
const int iopinC = 16;                                 // input only pin 16 (used by PSRam but you may get away with using it for a button)

const int serialSpeed = 115200;                        // Serial data speed to use

// camera settings (for the standard - OV2640 - CAMERA_MODEL_AI_THINKER)
// see:
// set camera resolution etc. in 'initialiseCamera()' and 'cameraImageSettings()'
#define PWDN_GPIO_NUM     32      // power to camera (on/off)
#define RESET_GPIO_NUM    -1      // -1 = not used
#define XCLK_GPIO_NUM      0
#define SIOD_GPIO_NUM     26      // i2c sda
#define SIOC_GPIO_NUM     27      // i2c scl
#define Y9_GPIO_NUM       35
#define Y8_GPIO_NUM       34
#define Y7_GPIO_NUM       39
#define Y6_GPIO_NUM       36
#define Y5_GPIO_NUM       21
#define Y4_GPIO_NUM       19
#define Y3_GPIO_NUM       18
#define Y2_GPIO_NUM        5
#define VSYNC_GPIO_NUM    25      // vsync_pin
#define HREF_GPIO_NUM     23      // href_pin
#define PCLK_GPIO_NUM     22      // pixel_clock_pin

// ******************************************************************************************************************

#include "driver/ledc.h"      // used to configure pwm on illumination led

// Used to disable brownout detection
#include "soc/soc.h"
#include "soc/rtc_cntl_reg.h"

// sd-card
#include "SD_MMC.h"                         // sd card - see
#include <SPI.h>
#include <FS.h>                             // gives file access 
#define SD_CS 5                             // sd chip select pin = 5

// Define some global variables:
uint32_t lastStatus = millis();           // last time status light changed status (to flash all ok led)
uint32_t lastCamera = millis();           // timer for periodic image capture
bool sdcardPresent;                       // flag if an sd card is detected
int imageCounter;                         // image file name on sd card counter
uint32_t illuminationLEDstatus;           // current brightness setting of the illumination led

void setup() {

  Serial.begin(serialSpeed);                     // Start serial communication

  Serial.println("\n");                      // line feeds
  Serial.printf("Starting - %s - %s \n", stitle, sversion);
  // Serial.print("Reset reason: " + ESP.getResetReason());

  WRITE_PERI_REG(RTC_CNTL_BROWN_OUT_REG, 0);     // Turn-off the 'brownout detector'

  // small indicator led on rear of esp32cam board
  pinMode(indicatorLED, OUTPUT);

  digitalWrite(indicatorLED, LOW);              // small indicator led on
  digitalWrite(indicatorLED, HIGH);             // small indicator led off

  // set up camera
  Serial.print(("\nInitialising camera: "));
  if (initialiseCamera()) {
  else {

  // define i/o pins
  pinMode(indicatorLED, OUTPUT);            // defined again as sd card config can reset it
  digitalWrite(indicatorLED, HIGH);         // led off = High
  pinMode(iopinA, OUTPUT);                  // pin 13 - free io pin, can be used for input or output
  pinMode(iopinB, OUTPUT);                  // pin 12 - free io pin, can be used for input or output (must not be high at boot)
  pinMode(iopinC, INPUT);                   // pin 16 - free input only pin

  // startup complete
  Serial.println("\nSetup complete...");

}  // setup

void loop() {

  //  //  demo to Capture an image and save to sd card every 5 seconds (i.e. time lapse)
  //      if ( ((unsigned long)(millis() - lastCamera) >= 5000) && sdcardPresent ) {
  //        lastCamera = millis();     // reset timer
  //        storeImage();              // save an image to sd card
  //      }
  if ( (unsigned long)(millis() - lastCamera) >= 15000UL) { //15 sec
    lastCamera = millis();     // reset timer
    capture_still();              // collect and send image out serial port
  // flash status LED to show sketch is running ok
  if ((unsigned long)(millis() - lastStatus) >= TimeBetweenStatus) {
    lastStatus = millis();                                               // reset timer
    digitalWrite(indicatorLED, !digitalRead(indicatorLED));              // flip indicator led status

}  // loop

// ******************************************************************************************************************

// ----------------------------------------------------------------
//                        Initialise the camera
// ----------------------------------------------------------------
// returns TRUE if successful

bool initialiseCamera() {

  camera_config_t config;

  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sscb_sda = SIOD_GPIO_NUM;
  config.pin_sscb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  config.xclk_freq_hz = 20000000;               // XCLK 20MHz or 10MHz for OV2640 double FPS (Experimental)
  config.pixel_format = PIXFORMAT;              // Options =  YUV422, GRAYSCALE, RGB565, JPEG, RGB888
  config.frame_size = FRAME_SIZE_IMAGE;         // Image sizes: 160x120 (QQVGA), 128x160 (QQVGA2), 176x144 (QCIF), 240x176 (HQVGA), 320x240 (QVGA),
  //              400x296 (CIF), 640x480 (VGA, default), 800x600 (SVGA), 1024x768 (XGA), 1280x1024 (SXGA),
  //              1600x1200 (UXGA)
  config.jpeg_quality = 10;                     // 0-63 lower number means higher quality
  config.fb_count = 1;                          // if more than one, i2s runs in continuous mode. Use only with JPEG

  // check the esp32cam board has a psram chip installed (extra memory used for storing captured images)
  //    Note: if not using "AI thinker esp32 cam" in the Arduino IDE, PSRAM must be enabled
  if (!psramFound()) {
    Serial.println("Warning: No PSRam found so defaulting to image size 'CIF'");
    config.frame_size = FRAMESIZE_CIF;

  //#if defined(CAMERA_MODEL_ESP_EYE)
  //  pinMode(13, INPUT_PULLUP);
  //  pinMode(14, INPUT_PULLUP);

  esp_err_t camerr = esp_camera_init(&config);  // initialise the camera
  if (camerr != ESP_OK) {
    Serial.printf("ERROR: Camera init failed with error 0x%x", camerr);

  cameraImageSettings();                        // apply custom camera settings

  return (camerr == ESP_OK);                    // return boolean result of camera initialisation

// ----------------------------------------------------------------
//                   -Change camera image settings
// ----------------------------------------------------------------
// Adjust image properties (brightness etc.)
// Defaults to auto adjustments if exposure and gain are both set to zero
// - Returns TRUE if successful
// BTW - some interesting info on exposure times here:

bool cameraImageSettings() {

  sensor_t *s = esp_camera_sensor_get();
  // something to try?:     if (s->id.PID == OV3660_PID)
  if (s == NULL) {
    Serial.println("Error: problem reading camera sensor settings");
    return 0;

  // if both set to zero enable auto adjust
  if (cameraImageExposure == 0 && cameraImageGain == 0) {
    // enable auto adjust
    s->set_gain_ctrl(s, 1);                       // auto gain on
    s->set_exposure_ctrl(s, 1);                   // auto exposure on
    s->set_awb_gain(s, 1);                        // Auto White Balance enable (0 or 1)
  } else {
    // Apply manual settings
    s->set_gain_ctrl(s, 0);                       // auto gain off
    s->set_awb_gain(s, 1);                        // Auto White Balance enable (0 or 1)
    s->set_exposure_ctrl(s, 0);                   // auto exposure off
    s->set_agc_gain(s, cameraImageGain);          // set gain manually (0 - 30)
    s->set_aec_value(s, cameraImageExposure);     // set exposure manually  (0-1200)

  return 1;
}  // cameraImageSettings

//    // More camera settings available:
//    // If you enable gain_ctrl or exposure_ctrl it will prevent a lot of the other settings having any effect
//    // more info on settings here:
//    s->set_gain_ctrl(s, 0);                       // auto gain off (1 or 0)
//    s->set_exposure_ctrl(s, 0);                   // auto exposure off (1 or 0)
//    s->set_agc_gain(s, cameraImageGain);          // set gain manually (0 - 30)
//    s->set_aec_value(s, cameraImageExposure);     // set exposure manually  (0-1200)
//    s->set_vflip(s, cameraImageInvert);           // Invert image (0 or 1)
//    s->set_quality(s, 10);                        // (0 - 63)
//    s->set_gainceiling(s, GAINCEILING_32X);       // Image gain (GAINCEILING_x2, x4, x8, x16, x32, x64 or x128)
//    s->set_brightness(s, cameraImageBrightness);  // (-2 to 2) - set brightness
//    s->set_lenc(s, 1);                            // lens correction? (1 or 0)
//    s->set_saturation(s, 0);                      // (-2 to 2)
//    s->set_contrast(s, cameraImageContrast);      // (-2 to 2)
//    s->set_sharpness(s, 0);                       // (-2 to 2)
//    s->set_hmirror(s, 0);                         // (0 or 1) flip horizontally
//    s->set_colorbar(s, 0);                        // (0 or 1) - show a testcard
//    s->set_special_effect(s, 0);                  // (0 to 6?) apply special effect
//    s->set_whitebal(s, 0);                        // white balance enable (0 or 1)
//    s->set_awb_gain(s, 1);                        // Auto White Balance enable (0 or 1)
//    s->set_wb_mode(s, 0);                         // 0 to 4 - if awb_gain enabled (0 - Auto, 1 - Sunny, 2 - Cloudy, 3 - Office, 4 - Home)
//    s->set_dcw(s, 0);                             // downsize enable? (1 or 0)?
//    s->set_raw_gma(s, 1);                         // (1 or 0)
//    s->set_aec2(s, 0);                            // automatic exposure sensor?  (0 or 1)
//    s->set_ae_level(s, 0);                        // auto exposure levels (-2 to 2)
//    s->set_bpc(s, 0);                             // black pixel correction
//    s->set_wpc(s, 0);                             // white pixel correction

// ----------------------------------------------------------------
//      -access image as greyscale data - i.e. http://x.x.x.x/
// ----------------------------------------------------------------

bool capture_still() {
  uint16_t x, y;

  Serial.print("***** greyscale\n");
  camera_fb_t *frame = esp_camera_fb_get();

  if (!frame)
    return false;

  int npix=0;
  // for each pixel in image
  for (size_t i = 0; i < frame->len; i++) {
    x = i % WIDTH;                   // x position in image
    y = floor(i / WIDTH);            // y position in image
    byte pixel = frame->buf[i];                     // pixel value

    // show data
    Serial.print((unsigned int)pixel);
//  Serial.print("***** ");  //separator
//  Serial.print(npix);
//  Serial.print(" data lines\n");
  esp_camera_fb_return(frame);                      // return storage space
  return true;

// ******************************************************************************************************************
// end

Do you mean downsampling? how can do that is there any functions or code that I can started with it ?

Image scaling is a different can of worms.

(post deleted by author)

Why do you open yet another thread for the same problem?

Sorry, did you mean to add this question to the previous question?!

From my understanding, this is exactly the same question as last time,
but that post has disappeared.

It is an extension of your initial post, so I think one thread would be better.
You can change the title, if your focus changes, but there is no reason to
duplicate any information in a new thread.

But that is just my opinion.

Perhaps decide which post you want replies to ?

The subject seems to be the same.



Could you take a few moments to Learn How To Use The Forum.

Other general help and troubleshooting advice can be found here.
It will help you get the best out of the forum.

That line of code passes the array, that contains the image, to the routine that writes the image to disk.

This line of code, just after the file write;

printarrayHEX(fb->buf, fb->len);

Would print the image to serial monitor as a series of HEX bytes with this routine;

void printarrayHEX(uint8_t *buff, uint32_t len)
  uint32_t index;
  uint8_t buffdata;

  for (index = 0; index < len); index++)
    buffdata = buff[index];
    if (buffdata < 16)
    Serial.print(buffdata, HEX);
    Serial.print(F(" "));


So that could be adpted to read or do whatever you want with the image.

Why you choose HEX bytes ?