# quickly "reversing" a byte?

Is there a very fast way to "reverse" a byte? Like turning 01110010 into 01001110 where bit 0 goes where bit 7 was, bit 1 goes where bit 6 was etc. I could do it in a loop of course but I'm looking for an alternative that only takes a few clock cycles. I'm sure that one of the many instructions can do this in just a few cycles, but I have no idea which one it is.

Throw memory at it use a lookup table!

Something like this?

``````char rev[256];

rev[1] = 128;
rev[2] = 64;
rev[3] = 192;
...

var = rev[var];
...
``````

I hadn't thought about doing it that way. I don't know a single thing about assembly or how a compiler would assemble that code, but my guess would be that it would need to put rev[y] into a register out of memory (LDS? 2 cycles?), and then store it into var (STS? 2 cycles?). There seem to be a few version of STS and LDS which take either 1 or 2 cycles. Not sure which one is used in this case. 256 bytes of RAM is no problem but this solution isn't very elegant. I found an instruction called SWAP which swaps the nibbles of a byte so I'm sure there's an instruction to swap every single bit. Can't find it though...

I found this function in my sketch directory, but it’s been a long time sense I played with it so I don’t even know if it works. Feel free to play with it.

``````byte flipByte(byte c)
{
c = ((c>>1)&0x55)|((c<<1)&0xAA);
c = ((c>>2)&0x33)|((c<<2)&0xCC);
c = (c>>4) | (c<<4) ;

return c;
}
``````

It does indeed work, although I haven't a clue as to how you came up with it :). However, bitshifts and bitwise operators take a clock cycle each so this would take at least 13 cycles. I still don't know how long the code I posted above would take so this might actually be faster. I need an assembly guru to look at the code.

On a 16MHz Arduino, you normally have more time than memory. Even if you put the transpose table into PROGMEM.
If you need to be fast, just use wires instead of a controller.

It would take 75 cycles to complete, but generates a heck of a lot of program code.

``````  b6:	9e 01       	movw	r18, r28
b8:	22 0f       	add	r18, r18
ba:	33 1f       	adc	r19, r19
bc:	2a 7a       	andi	r18, 0xAA	; 170
be:	ce 01       	movw	r24, r28
c0:	95 95       	asr	r25
c2:	87 95       	ror	r24
c4:	85 75       	andi	r24, 0x55	; 85
c6:	28 2b       	or	r18, r24
c8:	82 2f       	mov	r24, r18
ca:	90 e0       	ldi	r25, 0x00	; 0
cc:	9c 01       	movw	r18, r24
ce:	22 0f       	add	r18, r18
d0:	33 1f       	adc	r19, r19
d2:	22 0f       	add	r18, r18
d4:	33 1f       	adc	r19, r19
d6:	2c 7c       	andi	r18, 0xCC	; 204
d8:	95 95       	asr	r25
da:	87 95       	ror	r24
dc:	95 95       	asr	r25
de:	87 95       	ror	r24
e0:	83 73       	andi	r24, 0x33	; 51
e2:	28 2b       	or	r18, r24
c = (c>>4) | (c<<4) ;
e4:	02 2f       	mov	r16, r18
e6:	10 e0       	ldi	r17, 0x00	; 0
e8:	c8 01       	movw	r24, r16
ea:	a4 e0       	ldi	r26, 0x04	; 4       +28
ec:	95 95       	asr	r25
ee:	87 95       	ror	r24
f0:	aa 95       	dec	r26
f2:	e1 f7       	brne	.-8      	; 0xec <loop+0x42>   +5*3 + 4*1
f4:	f4 e0       	ldi	r31, 0x04	; 4                   + 1
f6:	00 0f       	add	r16, r16
f8:	11 1f       	adc	r17, r17
fa:	fa 95       	dec	r31
fc:	e1 f7       	brne	.-8      	; 0xf6 <loop+0x4c>     +5*3 + 4*1
fe:	08 2b       	or	r16, r24 +1
``````

That comes to 68, and if you add the rcall and ret instructions you get a total of 75.

This on the other hand:

``````byte flipByte(byte c){
char r=0;
for(byte i = 0; i < 8; i++){
r <<= 1;
r |= c & 1;
c >>= 1;
}
return r;
}
``````

Generates much less code, and as such the compiler optimises it as an inline function (or at least it did in my test), giving it a slightly smaller footprint of 72 cycles

``````  b0:	60 e0       	ldi	r22, 0x00	; 0
b2:	90 e0       	ldi	r25, 0x00	; 0
b4:	66 0f       	add	r22, r22
b6:	82 2f       	mov	r24, r18
b8:	81 70       	andi	r24, 0x01	; 1
ba:	68 2b       	or	r22, r24
bc:	26 95       	lsr	r18
be:	9f 5f       	subi	r25, 0xFF	; 255
c0:	98 30       	cpi	r25, 0x08	; 8
c2:	c1 f7       	brne	.-16     	; 0xb4 <loop+0xa>
``````

If it doesn’t put it inline, then you can force it to be by using this: (which due to the small size wouldn’t be a problem).

``````inline byte flipByte(byte c); __attribute__((always_inline))
inline byte flipByte(byte c){
``````

A 256byte lookup table would however be far faster, however at 4.5uS for either of the other two suggestions, it is done faster than a digitalWrite()!

Cool, maybe someday I'll be able to figure all that out by myself. So a lookup table it is then. The way I currently have my setup wired, it takes 2.2 seconds to do a certain operation. In order to make the wiring far neater and easier, I'm going to have to reverse the way they connect to the ports, hence this post. In one of these 2.2 second operations I'm going to have to reverse 153600 bytes, so it better be fast! If you say that a lookup table is faster than 4.5 uS then that will add less than half a second onto the time, which is acceptable.

With one of those functions you are looking at 0.7s extra to do that many bytes.

A lookup table in the SRAM would take 2cycles per byte or 0.0192s to do 153600 bytes.

A lookup table in the Program Memory would take 7 cycles/byte, or 0.0672s for all bytes.

A Quick example of using a progmem lookup ttable

``````void setup(){
Serial.begin(38400);
}

byte reverse[256] PROGMEM = {
0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,
0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,
0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,
0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,
0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,
0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,
0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,
0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,
0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,
0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,
0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,
0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,
0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,
0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,
0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,
0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF
};

byte data = 0;
void loop() {
for(byte i = 0; i < 256;i++){
Serial.println(data);
delay(100);
}
}
``````

I found these nifty versions. Not tried on Arduino platforms though.

``````unsigned char b; // reverse this (8-bit) byte

b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
``````
``````unsigned int v;     // input bits to be reversed
unsigned int r = v; // r will be reversed bits of v; first get LSB of v
int s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at end

for (v >>= 1; v; v >>= 1)
{
r <<= 1;
r |= v & 1;
s--;
}
r <<= s; // shift when v's highest bits are zero
``````
``````[quote author=pYro_65 link=topic=117966.msg888087#msg888087 date=1344593439]
Not tried on Arduino platforms though.

[code]unsigned char b; // reverse this (8-bit) byte

b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
``````

[/code]
[/quote]

That one would be horrendously slow as it is designed for a 64bit system. It takes the arduino very many instructions to do just one calculation with unsigned long longs, let alone that one. It works well for 32bit computers though.
The second one uses 16bit math which is again not advantageous for an 8 bit system to use.
Good tricks for a computer, or even an ARM, not so good for an 8bit microcontroller.

I once worked on a bit-slice DSP (the technique is core to the FFT agorithm) that simply had bit-reverse as a separate data path, with the data lines physically swapped over; effectively a sixteen bit output port wired to another sixteen bit input port.

Worth a try on a Mega, if you had two eight bit ports to spare.

That one would be horrendously slow as it is designed for a 64bit system. It takes the arduino very many instructions to do just one calculation with unsigned long longs, let alone that one. It works well for 32bit computers though.
The second one uses 16bit math which is again not advantageous for an 8 bit system to use.
Good tricks for a computer, or even an ARM, not so good for an 8bit microcontroller.

True, but as for the second loop, it uses sizeof to help determine number of bits, replace int with char.

Unsigned long long's? I didn't even know they existed! Also, thanks for the code Tom, now I won't have to write the whole lookup table.

EDIT: @AWOL My original thought was that maybe there was a command that could do just that, connect pin 0 of a port to pin 7, pin 1 to pin 6 etc. I'm actually using an ATxmega256D3 rather than an ATmega which has a lot of stuff built in that I don't even understand, so maybe it's still possible?

..some bit reversing tricks for 8bitters:
http://www.piclist.com/techref/microchip/math/bit/revbits.htm