and Rob's 4.21us goes down to 2.32us when using smaller vars.
having peered at the assembly code, i'm not sure the lookup will be quicker now.
orig 16bit ints, doing the lines
R=(color&0xE0)>>5;
if (R>=4) R=((R*2)+1);
else R=R<<1;
gives
1c6: a0 91 6d 01 lds r26, 0x016D
1ca: b0 91 6e 01 lds r27, 0x016E
1ce: a0 7e andi r26, 0xE0 ; 224
1d0: b0 70 andi r27, 0x00 ; 0
1d2: 55 e0 ldi r21, 0x05 ; 5
1d4: b6 95 lsr r27
1d6: a7 95 ror r26
1d8: 5a 95 dec r21
1da: e1 f7 brne .-8 ; 0x1d4 <setup+0x68>
1dc: 9d 01 movw r18, r26
1de: 22 0f add r18, r18
1e0: 33 1f adc r19, r19
1e2: a4 30 cpi r26, 0x04 ; 4
1e4: b1 05 cpc r27, r1
1e6: 1c f0 brlt .+6 ; 0x1ee <setup+0x82>
1e8: f9 01 movw r30, r18
1ea: 31 96 adiw r30, 0x01 ; 1
1ec: 01 c0 rjmp .+2 ; 0x1f0 <setup+0x84>
1ee: f9 01 movw r30, r18
and using uint8_t for the same snippet of three lines above gives
1be: 80 91 6d 01 lds r24, 0x016D
1c2: 82 95 swap r24
1c4: 86 95 lsr r24
1c6: 87 70 andi r24, 0x07 ; 7
1c8: 84 30 cpi r24, 0x04 ; 4
1ca: 20 f0 brcs .+8 ; 0x1d4 <setup+0x68>
1cc: 58 2f mov r21, r24
1ce: 55 0f add r21, r21
1d0: 5f 5f subi r21, 0xFF ; 255
1d2: 02 c0 rjmp .+4 ; 0x1d8 <setup+0x6c>
1d4: 58 2f mov r21, r24
1d6: 55 0f add r21, r21
notice how the optimiser can swap the top and bottom half of reg 24 ( in the second chunk of assembly at line 1c2, then the single logical shift right on 1c4, those two lines are doing the same as the small loop between 1d2 and 1da in the first code ( which loops 5 times !).
the &0xE0 can now be done in the second bit of code with the line at 1c6, as the bits of the colour info we want are now in the lower end of reg 24. a single instruction as against the lines 1ce and 1d0 in the first bit of code.
edits to correct my bad spelling and lack of spaces due to dodgy keyboard !