Slower than expected swap() test

Most of you are aware of the swap that doesn't use a temporary variable. I tested two swap() versions by passing references to two different arrays: 1) using bitwise XOR which avoids a temporary variable, and 2) a version that does use the temporary. My expectation was that the XOR would be faster. It actually is 4 times slower. I may have set the test up incorrectly. Either way, some of you assembly gurus know the reason for this result and, perhaps, would share it with me.

void swap(unsigned int& a, unsigned int& b)
{
    b=(a+b)-(a=b);    // This takes 8 seconds

    a ^= b;         // This takes about 12 seconds
    b ^= a;
    a ^= b;
 /*
  int temp;         // This takes 4 seconds

  temp = b;
  b = a;
  a = temp;
*/
}

void setup() {
  unsigned int a[100];
  unsigned int b[100];
  int i, j;
  unsigned long start, end;

  Serial.begin(115200);
  for (i = 0, j = 100; i < 100; i++, j--) {
    a[i] = i;
    b[i] = j;
  }

  for (i = 0; i < 100; i++) {
    Serial.print("a[");
    Serial.print(i);
    Serial.print("] = ");
    Serial.print(a[i]);
    Serial.print("   b[] = ");
    Serial.println(b[i]);
  }

  start = millis();
  for (j = 0; j < 30001; j++) {
    for (i = 0; i < 100; i++) {
      swap(a[i], b[i]);
    }
  }
  end = millis();
  Serial.println( (end - start) / 1000);
  Serial.println("=======================");

  for (i = 0; i < 100; i++) {
    Serial.print("a[");
    Serial.print(i);
    Serial.print("] = ");
    Serial.print(a[i]);
    Serial.print("   b[] = ");
    Serial.println(b[i]);
  }

}



void loop() {

}

Edit: I added a third method

Did you try bytes? After all with ints, the exclusive OR has to be done in two chunks.

I was curious so I wrote two functions based on your code

void swapXOR(unsigned int& a, unsigned int& b)
{
  a ^= b;         // This takes about 12 seconds
  b ^= a;
  a ^= b;
}

void swapSimple(unsigned int& a, unsigned int& b)
{
  int temp;
  temp = b;
  b = a;
  a = temp;
}

and then I look at the assembly language generated for both

Disassembly of swapXOR:

   0: fc 01       movw r30, r24
   2: db 01       movw r26, r22
   4: 20 81       ld r18, Z
   6: 31 81       ldd r19, Z+1 ; 0x01
   8: 8d 91       ld r24, X+
   a: 9c 91       ld r25, X
   c: 11 97       sbiw r26, 0x01 ; 1
[color=blue]   e: 28 27       eor r18, r24
  10: 39 27       eor r19, r25
[/color]  12: 31 83       std Z+1, r19 ; 0x01
  14: 20 83       st Z, r18
  16: 8d 91       ld r24, X+
  18: 9c 91       ld r25, X
  1a: 11 97       sbiw r26, 0x01 ; 1
[color=blue]  1c: 82 27       eor r24, r18
  1e: 93 27       eor r25, r19
[/color]  20: 8d 93       st X+, r24
  22: 9c 93       st X, r25
  24: 20 81       ld r18, Z
  26: 31 81       ldd r19, Z+1 ; 0x01
[color=blue]  28: 28 27       eor r18, r24
  2a: 39 27       eor r19, r25
[/color]  2c: 31 83       std Z+1, r19 ; 0x01
  2e: 20 83       st Z, r18
  30: 08 95       ret

Disassembly of swapSimple:

   0: fb 01       movw r30, r22
   2: 20 81       ld r18, Z
   4: 31 81       ldd r19, Z+1 ; 0x01
[color=green]   6: fc 01       movw r30, r24[/color]
   8: 40 81       ld r20, Z
   a: 51 81       ldd r21, Z+1 ; 0x01
[color=green]   c: fb 01       movw r30, r22[/color]
   e: 51 83       std Z+1, r21 ; 0x01
  10: 40 83       st Z, r20
[color=green]  12: fc 01       movw r30, r24
[/color]  14: 31 83       std Z+1, r19 ; 0x01
  16: 20 83       st Z, r18
  18: 08 95       ret

Initial guts feeling given swapXOR has 30 steps while the swapSimple has only 18 is that the simple swap will be faster

Indeed we can also see that the compiler understands we are dealing with xor on 2 bytes and does the EOR instruction 6 times (highlighted in blue)

In the second code, we can see the compiler makes smart use of the Z register to exchange the values and the MOVW instruction which makes a copy of one register pair into another register pair so dealing in 1 cycle (according to the documentation) with 2 bytes - highlighted in Green

In Arduino language, register pairs R26:R27, R28:R29 and R30:R31 have extra names in assembler: X, Y and Z. These pairs are 16-bit pointer registers, able to point to adresses SRAM locations (or program memory location for Z).

if you want to be thorough and know how much faster, then you need to look at each line and determine the number of cycle for each instruction to execute in the code, add that up.

I admit I've been lazy and did not do the timing work - happy with the knowledge gained here :))

KeithRB:
Did you try bytes? After all with ints, the exclusive OR has to be done in two chunks.

Switching data types to byte results in:

  • single statement 8 down to 7 seconds
  • XOR 12 8
  • temporary 4 2

Thanks to both of you...worth the experiment.
J-M-L: It's been a long time since I've done any assembler (Z80). I really need to get back into that. Thanks for taking the time to check.

with pleasure - it's always nice to challenge your brain and understand really what's going on.

we have all the tools in our hands.