quickly "reversing" a byte?

Is there a very fast way to "reverse" a byte? Like turning 01110010 into 01001110 where bit 0 goes where bit 7 was, bit 1 goes where bit 6 was etc. I could do it in a loop of course but I'm looking for an alternative that only takes a few clock cycles. I'm sure that one of the many instructions can do this in just a few cycles, but I have no idea which one it is.

Throw memory at it use a lookup table!

Something like this?

char rev[256];

rev[1] = 128;
rev[2] = 64;
rev[3] = 192;
...

var = rev[var];
...

I hadn't thought about doing it that way. I don't know a single thing about assembly or how a compiler would assemble that code, but my guess would be that it would need to put rev[y] into a register out of memory (LDS? 2 cycles?), and then store it into var (STS? 2 cycles?). There seem to be a few version of STS and LDS which take either 1 or 2 cycles. Not sure which one is used in this case. 256 bytes of RAM is no problem but this solution isn't very elegant. I found an instruction called SWAP which swaps the nibbles of a byte so I'm sure there's an instruction to swap every single bit. Can't find it though...

I found this function in my sketch directory, but it’s been a long time sense I played with it so I don’t even know if it works. Feel free to play with it.

byte flipByte(byte c)
     {
       c = ((c>>1)&0x55)|((c<<1)&0xAA);
       c = ((c>>2)&0x33)|((c<<2)&0xCC);
       c = (c>>4) | (c<<4) ;

       return c;
     }

It does indeed work, although I haven't a clue as to how you came up with it :). However, bitshifts and bitwise operators take a clock cycle each so this would take at least 13 cycles. I still don't know how long the code I posted above would take so this might actually be faster. I need an assembly guru to look at the code.

On a 16MHz Arduino, you normally have more time than memory. Even if you put the transpose table into PROGMEM.
If you need to be fast, just use wires instead of a controller.

It would take 75 cycles to complete, but generates a heck of a lot of program code.

  b6:	9e 01       	movw	r18, r28
  b8:	22 0f       	add	r18, r18
  ba:	33 1f       	adc	r19, r19
  bc:	2a 7a       	andi	r18, 0xAA	; 170
  be:	ce 01       	movw	r24, r28
  c0:	95 95       	asr	r25
  c2:	87 95       	ror	r24
  c4:	85 75       	andi	r24, 0x55	; 85
  c6:	28 2b       	or	r18, r24
  c8:	82 2f       	mov	r24, r18
  ca:	90 e0       	ldi	r25, 0x00	; 0
  cc:	9c 01       	movw	r18, r24
  ce:	22 0f       	add	r18, r18
  d0:	33 1f       	adc	r19, r19
  d2:	22 0f       	add	r18, r18
  d4:	33 1f       	adc	r19, r19
  d6:	2c 7c       	andi	r18, 0xCC	; 204
  d8:	95 95       	asr	r25
  da:	87 95       	ror	r24
  dc:	95 95       	asr	r25
  de:	87 95       	ror	r24
  e0:	83 73       	andi	r24, 0x33	; 51
  e2:	28 2b       	or	r18, r24
       c = (c>>4) | (c<<4) ;
  e4:	02 2f       	mov	r16, r18
  e6:	10 e0       	ldi	r17, 0x00	; 0
  e8:	c8 01       	movw	r24, r16                       
  ea:	a4 e0       	ldi	r26, 0x04	; 4       +28
  ec:	95 95       	asr	r25
  ee:	87 95       	ror	r24
  f0:	aa 95       	dec	r26
  f2:	e1 f7       	brne	.-8      	; 0xec <loop+0x42>   +5*3 + 4*1
  f4:	f4 e0       	ldi	r31, 0x04	; 4                   + 1
  f6:	00 0f       	add	r16, r16
  f8:	11 1f       	adc	r17, r17
  fa:	fa 95       	dec	r31
  fc:	e1 f7       	brne	.-8      	; 0xf6 <loop+0x4c>     +5*3 + 4*1
  fe:	08 2b       	or	r16, r24 +1

That comes to 68, and if you add the rcall and ret instructions you get a total of 75.

This on the other hand:

byte flipByte(byte c){
  char r=0;
  for(byte i = 0; i < 8; i++){
    r <<= 1;
    r |= c & 1;
    c >>= 1;
  }
  return r;
}

Generates much less code, and as such the compiler optimises it as an inline function (or at least it did in my test), giving it a slightly smaller footprint of 72 cycles

  b0:	60 e0       	ldi	r22, 0x00	; 0
  b2:	90 e0       	ldi	r25, 0x00	; 0
  b4:	66 0f       	add	r22, r22
  b6:	82 2f       	mov	r24, r18
  b8:	81 70       	andi	r24, 0x01	; 1
  ba:	68 2b       	or	r22, r24
  bc:	26 95       	lsr	r18
  be:	9f 5f       	subi	r25, 0xFF	; 255
  c0:	98 30       	cpi	r25, 0x08	; 8
  c2:	c1 f7       	brne	.-16     	; 0xb4 <loop+0xa>

If it doesn’t put it inline, then you can force it to be by using this: (which due to the small size wouldn’t be a problem).

inline byte flipByte(byte c); __attribute__((always_inline))
inline byte flipByte(byte c){

A 256byte lookup table would however be far faster, however at 4.5uS for either of the other two suggestions, it is done faster than a digitalWrite()!

Cool, maybe someday I'll be able to figure all that out by myself. So a lookup table it is then. The way I currently have my setup wired, it takes 2.2 seconds to do a certain operation. In order to make the wiring far neater and easier, I'm going to have to reverse the way they connect to the ports, hence this post. In one of these 2.2 second operations I'm going to have to reverse 153600 bytes, so it better be fast! :smiley: If you say that a lookup table is faster than 4.5 uS then that will add less than half a second onto the time, which is acceptable.

With one of those functions you are looking at 0.7s extra to do that many bytes.

A lookup table in the SRAM would take 2cycles per byte or 0.0192s to do 153600 bytes.

A lookup table in the Program Memory would take 7 cycles/byte, or 0.0672s for all bytes.

A Quick example of using a progmem lookup ttable

void setup(){
  Serial.begin(38400);
}

byte reverse[256] PROGMEM = {
  0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,  
  0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,  
  0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,  
  0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,  
  0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,  
  0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA, 
  0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,  
  0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE, 
  0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1, 
  0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,  
  0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5, 
  0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD, 
  0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,  
  0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB, 
  0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,  
  0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF 
}; 

byte data = 0;
void loop() {
  for(byte i = 0; i < 256;i++){
    data = pgm_read_byte(&(reverse[i]));
    Serial.println(data);
    delay(100);
  }
}

I found these nifty versions. Not tried on Arduino platforms though.

unsigned char b; // reverse this (8-bit) byte
 
b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
unsigned int v;     // input bits to be reversed
unsigned int r = v; // r will be reversed bits of v; first get LSB of v
int s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at end

for (v >>= 1; v; v >>= 1)
{   
  r <<= 1;
  r |= v & 1;
  s--;
}
r <<= s; // shift when v's highest bits are zero
[quote author=pYro_65 link=topic=117966.msg888087#msg888087 date=1344593439]
Not tried on Arduino platforms though.

[code]unsigned char b; // reverse this (8-bit) byte
 
b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;

[/code]
[/quote]

That one would be horrendously slow as it is designed for a 64bit system. It takes the arduino very many instructions to do just one calculation with unsigned long longs, let alone that one. It works well for 32bit computers though.
The second one uses 16bit math which is again not advantageous for an 8 bit system to use.
Good tricks for a computer, or even an ARM, not so good for an 8bit microcontroller.

I once worked on a bit-slice DSP (the technique is core to the FFT agorithm) that simply had bit-reverse as a separate data path, with the data lines physically swapped over; effectively a sixteen bit output port wired to another sixteen bit input port.

Worth a try on a Mega, if you had two eight bit ports to spare.

That one would be horrendously slow as it is designed for a 64bit system. It takes the arduino very many instructions to do just one calculation with unsigned long longs, let alone that one. It works well for 32bit computers though.
The second one uses 16bit math which is again not advantageous for an 8 bit system to use.
Good tricks for a computer, or even an ARM, not so good for an 8bit microcontroller.

True, but as for the second loop, it uses sizeof to help determine number of bits, replace int with char.

Unsigned long long's? I didn't even know they existed! Also, thanks for the code Tom, now I won't have to write the whole lookup table.

EDIT: @AWOL My original thought was that maybe there was a command that could do just that, connect pin 0 of a port to pin 7, pin 1 to pin 6 etc. I'm actually using an ATxmega256D3 rather than an ATmega which has a lot of stuff built in that I don't even understand, so maybe it's still possible?

..some bit reversing tricks for 8bitters:
http://www.piclist.com/techref/microchip/math/bit/revbits.htm