Help with optimizing code

Hi guys,

I’d like to optimize this piece of code so it executes as fast as possible.
It basically converts a 8bit color into 12bit.
In 8bit, the RGB is as follow:
R = 3bit
G = 3 bit
B = 2 bit
And I’m upscaling it to 12bit:
R = 4bit
G = 4bit
B = 4bit
Here is the code:

	int R=(color&0xE0)>>5;
	if (R>=4) R=((R*2)+1); else R=R<<1;
	int G=(color&0x1C)>>2;
	if (G>=4) G=((G*2)+1)<<4; else G=G<<5;
	int B=(color&0x03)*5;

How can I optimize this to work more efficiently and thus faster?

Thank you,
RI

The most optimum way is to not do it. Why do you think you need to?

It's hard to beat bit shifting for speed.

Table lookup.

I need to convert the colors to make it backward compatible with my old code/LCD screens. My old LCD screens used 8bit color mode and the new ones use 12bit. I guess table lookup would work. It would consume 512bytes of RAM, right? If I made the table PROGMEM, would I be taking a penalty in speed vs using RAM? I am using a ATMega2560, so I do have a little bit of breathing on RAM, but would rather not use it if PROGMEM would give me the same results.

newhobby:
I guess table lookup would work.
It would consume 512bytes of RAM, right?

Not 512. There are only 7 possible values for each of Red and Green, so the table only needs to be 7 bytes. You can use it to generate the values for Red, then for the values of Green. The table for Blue only needs to be 4 bytes, as there are only two bits for the Green value. Then you can compine each byte to form a 12 bit RGB value.

Or, you could multiply the R and G values by 2 (shift left 1), giving you possible values of 0, 2, 4, 6, 8, 10, 12, 14.

Then mutiply the G value by 5., giving you possible values of 0, 5, 10, 15.

   int temp = G;
   G = g << 1;
   G = G + temp;

lar3ry: Or, you could multiply the R and G values by 2 (shift left 1), giving you possible values of 0, 2, 4, 6, 8, 10, 12, 14.

Then mutiply the G value by 5., giving you possible values of 0, 5, 10, 15.

This is actually what I am doing, if you look at the code I posted in the original code. Except I'm doing one more check to be able to add 1 to get the 15 for R and G or the colors would not be correct with the result being 14. If I separate into 3 tables, I would still have to apply the masks and shift right to get the pure value of each color, so the calculation would still be intense and I would only be trading the shift left and 1 comparison for a table lookup. I'll test and see if there is any improvement.

newhobby:
This is actually what I am doing, …

You’re right! Silly me. Anyway, I came up with another method you can test. This runs in 997 milliseconds on an Arduino Uno R3. If you uncomment the Serial.print lines, you can check for accuracy of conversion (you will probably want to tweak the Blue).

int col;
int rgb;

void setup() {
  Serial.begin(19200);
}

void loop() {
  int elapsed = millis();
  int r;
  int g;
  int b;

  for (int j = 1; j <1000; j++) { // run this 1000 times
    for (int i=0;i<256;i++) {      // convert every possible value
      col = i;
//      Serial.print(col,HEX);
//      Serial.print("   ");
      r = (col &0xE0) * 16;
//      Serial.print(r,HEX);
//      Serial.print("   ");
      g = (col & 0x1C) * 8;
//      Serial.print(g,HEX);
//      Serial.print("   ");
      col = (col & 0x03) * 5;
//      Serial.print(col,HEX);
//      Serial.print("   ");
      col = col + r + g;
//      Serial.print(col,HEX);
//      Serial.println();
    }
  }

  Serial.println (millis() - elapsed); // show the elapsed time
  while (1) {                              // stop while you check the values.
  }
}

newhobby: If I made the table PROGMEM, would I be taking a penalty in speed vs using RAM?

As I recall there is a one clock cycle penalty (62.5 nS) to access a byte of PROGMEM vs RAM. Since the lookup will be faster than what you are currently doing this probably won't matter too much.

do you have timing results of your code?

use uint8_t instead of int (aka int16_t ) will make the biggest difference as then the compiler can use single instructions instead of two to do most of the calcs.

ps. for speed and size i'd go with the lookup table as well.

your original code timed 6.85 us, optimized timed 4.21 us in test program below. (-38%)

volatile int color;
int R,G,B;
byte RR,GG,BB;

void setup()
{
  Serial.begin(115200);
  Serial.println("start");

  // ORIGINAL CODE
  uint32_t start = micros();
  for (int i = 0; i < 30000; i++)
  {
    color = i & 0xFF;

    R=(color&0xE0)>>5;
    if (R>=4) R=((R*2)+1); 
    else R=R<<1;

    G=(color&0x1C)>>2;
    if (G>=4) G=((G*2)+1)<<4; 
    else G=G<<5;

    B=(color&0x03)*5;
  }
  uint32_t duration = micros() - start;
  Serial.print("org: ");
  Serial.println(duration/30000.0);
  Serial.println();

  // OPTIMIZED VERSION
  start = micros();
  for (int i = 0; i < 30000; i++)
  {
    color = i & 0xFF;

    RR = (color & 0xE0) >> 4;
    if (RR >= 8) RR++;

    GG = (color & 0x1C) << 3;
    if (GG >= 0x70) GG += 16;

    BB = (color & 0x03) * 5;
  }
  duration = micros() - start;
  Serial.print("new: ");
  Serial.println(duration/30000.0);
  Serial.println();


  // COMPARE LOOP
  for (int i = 0; i < 256; i++)
  {
    color = i & 0xFF;

    RR = (color & 0xE0) >> 4;
    if (RR >= 8) RR++;

    GG = (color & 0x1C) << 3;
    if (GG >= 0x70) GG += 16;

    BB = (color&0x03) *5;

    R=(color&0xE0)>>5;
    if (R>=4) R=((R*2)+1); 
    else R=R<<1;

    G=(color&0x1C)>>2;
    if (G>=4) G=((G*2)+1)<<4; 
    else G=G<<5;

    B= (color & 0x03) * 5;

    if ( (R != RR) || (G != GG ) || (B !=BB))
    {
      Serial.println("error");
    }

    Serial.print(i);
    Serial.print('\t');
    Serial.print(R, DEC);
    Serial.print('\t');
    Serial.print(RR, DEC);
    Serial.print('\t');
    Serial.print(G, DEC);
    Serial.print('\t');
    Serial.print(GG, DEC);
    Serial.print('\t');
    Serial.print(B, DEC);
    Serial.print('\t');
    Serial.print(BB, DEC);
    Serial.print('\n');
  }

  Serial.println("done");
}

void loop()
{
}

robtillaart: your original code timed 6.85 us, optimized timed 4.21 us in test program below. (-38%)

using uint8_t instead of int

orig 6.85, swap to uint8_t = 3.07

win ;-)

haven't tried Robs optimised code, or using a lookup, but using the smallest sized var you can get away with helps somuch on our poor 8bit micros.

and Rob’s 4.21us goes down to 2.32us when using smaller vars.

having peered at the assembly code, i’m not sure the lookup will be quicker now.

orig 16bit ints, doing the lines

R=(color&0xE0)>>5;
if (R>=4) R=((R*2)+1);
else R=R<<1;

gives

 1c6:   a0 91 6d 01     lds     r26, 0x016D
 1ca:   b0 91 6e 01     lds     r27, 0x016E
 1ce:   a0 7e           andi    r26, 0xE0       ; 224
 1d0:   b0 70           andi    r27, 0x00       ; 0
 1d2:   55 e0           ldi     r21, 0x05       ; 5
 1d4:   b6 95           lsr     r27
 1d6:   a7 95           ror     r26
 1d8:   5a 95           dec     r21
 1da:   e1 f7           brne    .-8             ; 0x1d4 <setup+0x68>
 1dc:   9d 01           movw    r18, r26
 1de:   22 0f           add     r18, r18
 1e0:   33 1f           adc     r19, r19
 1e2:   a4 30           cpi     r26, 0x04       ; 4
 1e4:   b1 05           cpc     r27, r1
 1e6:   1c f0           brlt    .+6             ; 0x1ee <setup+0x82>
 1e8:   f9 01           movw    r30, r18
 1ea:   31 96           adiw    r30, 0x01       ; 1
 1ec:   01 c0           rjmp    .+2             ; 0x1f0 <setup+0x84>
 1ee:   f9 01           movw    r30, r18

and using uint8_t for the same snippet of three lines above gives

 1be:   80 91 6d 01     lds     r24, 0x016D
 1c2:   82 95           swap    r24
 1c4:   86 95           lsr     r24
 1c6:   87 70           andi    r24, 0x07       ; 7
 1c8:   84 30           cpi     r24, 0x04       ; 4
 1ca:   20 f0           brcs    .+8             ; 0x1d4 <setup+0x68>
 1cc:   58 2f           mov     r21, r24
 1ce:   55 0f           add     r21, r21
 1d0:   5f 5f           subi    r21, 0xFF       ; 255
 1d2:   02 c0           rjmp    .+4             ; 0x1d8 <setup+0x6c>
 1d4:   58 2f           mov     r21, r24
 1d6:   55 0f           add     r21, r21

notice how the optimiser can swap the top and bottom half of reg 24 ( in the second chunk of assembly at line 1c2, then the single logical shift right on 1c4, those two lines are doing the same as the small loop between 1d2 and 1da in the first code ( which loops 5 times !).
the &0xE0 can now be done in the second bit of code with the line at 1c6, as the bits of the colour info we want are now in the lower end of reg 24. a single instruction as against the lines 1ce and 1d0 in the first bit of code.

edits to correct my bad spelling and lack of spaces due to dodgy keyboard !

Thanks all for the help!!!
Unfortunately, I should have done my homework before I asked the question :frowning:
I noticed after I edited the code that the bottleneck isn’t really on the color calculation.
Although it helped a bit, it didn’t help really substantially.
I did a quick comparison and it took 420ms to display 131x131 pixels with the original code and 391ms with Rob’s optmized using byte size.
So, my bottleneck is really the SPI transmission.
I’m using soft SPI. So, how can I improve?
I did try hardware SPI some time ago and there was not much difference back then. Maybe I was doing something wrong back then too.
Here is the code I’m using to test:

#define BL0 cbi(PORTE,4);
#define BL1 sbi(PORTE,4);
#define CS0 cbi(PORTE,5);
#define CS1 sbi(PORTE,5);
#define CLK0 cbi(PORTG,5);
#define CLK1 sbi(PORTG,5);
#define SDA0 cbi(PORTE,3);
#define SDA1 sbi(PORTE,3);
#define RESET0 cbi(PORTH,3);
#define RESET1 sbi(PORTH,3);

#define sbi(port,bitnum)  port |= _BV(bitnum)
#define cbi(port,bitnum)  port &= ~(_BV(bitnum))

void setup()
{
  Serial.begin(57600);
  unsigned long start = millis();
  for (byte i=0;i<132;i++)
    for (byte j=0;j<132;j++)
      SendColor12BitOptimized(0xff);
  Serial.println(millis()-start);
  start = millis();
  for (byte i=0;i<132;i++)
    for (byte j=0;j<132;j++)
      SendColor12Bit(0xff);
  Serial.println(millis()-start);
  
}

void loop()
{
}

void SendData(byte data)
{
  CLK0;
  SDA1;
  CLK1;
  ShiftBits(data);
}

void SendCMD(byte data)
{
  CLK0;
  SDA0;
  CLK1;
  ShiftBits(data);
}

void ShiftBits(byte b)
{
  byte Bit;
  
  for (Bit = 0; Bit < 8; Bit++)
  {
    CLK0;
    if((b&0x80)>>7)
    {
      SDA1;
    }
    else
    {
      SDA0;
    }
    CLK1;
    b <<= 1;
  }
}

void SendColor12BitOptimized(byte color)
{
  byte RR = (color & 0xE0) >> 4;
  if (RR >= 8) RR++;
  byte GG = (color & 0x1C) << 3;
  if (GG >= 0x70) GG += 16;
  byte BB = (color & 0x03) * 5;
  SendData(RR);
  SendData(GG+BB);
}

void SendColor12Bit(byte color)
{
  int R=(color&0xE0)>>5;
  if (R>=4) R=((R*2)+1); else R=R<<1;
  int G=(color&0x1C)>>2;
  if (G>=4) G=((G*2)+1)<<4; else G=G<<5;
  int B=(color&0x03)*5;
  SendData(R);
  SendData(G+B);
}

Definitely use hardware SPI. It should be much faster. If it isn't post that code.

I did try the hardware SPI way back then, but this is a 9-bit SPI, so in order to make it work, I need to send the 1st bit manually and then shift the rest of the 8bit through hardware SPI. In order to manually use MOSI and SCLK pins, I have to set SPCR=0, manipulate the pin status, sett SPCR=50 and shift the bits through hardware SPI, which at the end of the day, was takes just as long if not longer. I'm going to scope the soft SPI when I get home.

What device is this? It is unusual to have 9-bit SPI. Any reason you can't send 16 bits for example?

  for (Bit = 0; Bit < 8; Bit++)
  {
    CLK0;
    if((b&0x80)>>7)

Well, you don’t need the shift there. "if (b & 0x80) " will work just fine, because of the way that C defines “if.”
However, I’d be surprised if this isn’t already omitted by the optimizer…

It's one of those small Nokia LCD screens. I know... unusual, but they use 9-bit SPI :( I would hate to change my hardware at this point.

See if this is any faster:

#define BL0 cbi(PORTE, 0x10);
#define BL1 sbi(PORTE, 0x10);
#define CS0 cbi(PORTE, 0x20);
#define CS1 sbi(PORTE, 0x20);
#define CLK0 cbi(PORTG, 0x20);
#define CLK1 sbi(PORTG, 0x20);
#define SDA0 cbi(PORTE, 0x08);
#define SDA1 sbi(PORTE, 0x08);
#define RESET0 cbi(PORTH, 0x08);
#define RESET1 sbi(PORTH, 0x08);

#define sbi(port, bv)  port |= bv
#define cbi(port, bv)  port &= ~bv

void setup()
{
  Serial.begin(57600);
  unsigned long start = millis();
  for (byte i=0;i<132;i++)
    for (byte j=0;j<132;j++)
      SendColor12BitOptimized(0xff);
  Serial.println(millis()-start);
  start = millis();
  for (byte i=0;i<132;i++)
    for (byte j=0;j<132;j++)
      SendColor12Bit(0xff);
  Serial.println(millis()-start);
  
}

void loop()
{
}

static void SendData(const byte data)
{
  CLK0;
  SDA1;
  CLK1;
  ShiftBits(data);
}

static void SendCMD(const byte data)
{
  CLK0;
  SDA0;
  CLK1;
  ShiftBits(data);
}

static void ShiftBits(const byte b)
{
  for (byte bit = 0; bit != 8; ++bit, b <<= 1)
  {
    CLK0;
    if(b & 0x80)
      SDA1;
    else
      SDA0;
    CLK1;
  }
}

void SendColor12BitOptimized(const byte &color)
{
  byte RR = (color & 0xE0) >> 4;
  if (RR >= 8)
    ++RR;
  SendData(RR);

  RR = (color & 0x1C) << 3;
  if (RR >= 0x70)
    RR += 16;

  RR += (color & 0x03) * 5;
  SendData(RR);
}

void SendColor12Bit(byte color)
{
  int R=(color&0xE0)>>5;
  if (R>=4) R=((R*2)+1); else R=R<<1;
  int G=(color&0x1C)>>2;
  if (G>=4) G=((G*2)+1)<<4; else G=G<<5;
  int B=(color&0x03)*5;
  SendData(R);
  SendData(G+B);
}