Loop duration different - increment vs decrement

Hi,

I have written a simple program measuring the duration to increment a loop versus the duration to decrement a loop - having the same values as index.
To measure the duration, I am using micros (), that according with the specs has a resolution of 4uS.
The result is very strange, the loop duration when you increment is higher than the duration when you decrement - for equal index. Do you have any explanation for this?

Here is the code and the result on serial com:

int i,j;
unsigned long micro1,micro2,micro3,micro4;

void setup() {
Serial.begin (57600);
}

void loop() 
{
micro1 = micros ();
for (i=0;i<=32766;i++)
// do nothing
micro2 = micros();
micro3 = micros();
for (j=32766;j>=0;j--)
// do nothing
micro4 = micros ();
Serial.print("time increment: ");
Serial.println(micro2-micro1);
Serial.print("time decrement: ");
Serial.println(micro4-micro3);
delay (1000);
}


AND THE RESULT
...

time decrement: 144232
time increment: 150432
time decrement: 144232
time increment: 150424
time decrement: 144232
time increment: 150432
time decrement: 144224
time increment: 150428
time decrement: 144232
time increment: 150432
time decrement: 144236

If my maths is correct the difference is 6200 microsecs or 99200 instructions at 16MHz. That is almost exactly 3 instructions for each of the 32766 iterations.

Perhaps decrementing an int takes longer than incrementing it.

I wonder what would happen if you use a byte for the counter and count 0 to 255 and 255 to 0?

...R

Looks like some optimization Perhaps Branch Prediction?

First of all check the syntax for “for”. You are missing semicolon ‘;’ or {}. In your case micro2 and micro4 assignment
is evaluated within the loops. Not ‘do nothing’. Even though, both loops seems equal for execution time from C view but compiler has optimizing algorithm and the result does not need to be same. If you check the ASM listing you will see the different number of instructions for each loop, less for decrement.

114:	80 91 ca 01 	lds	r24, 0x01CA
 118:	90 91 cb 01 	lds	r25, 0x01CB
 11c:	8f 3f       	cpi	r24, 0xFF	; 255
 11e:	9f 47       	sbci	r25, 0x7F	; 127
 120:	a1 f0       	breq	.+40     	; 0x14a <loop+0x52>
 122:	0e 94 5f 01 	call	0x2be	; 0x2be <micros>
 126:	60 93 c0 01 	sts	0x01C0, r22
 12a:	70 93 c1 01 	sts	0x01C1, r23
 12e:	80 93 c2 01 	sts	0x01C2, r24
 132:	90 93 c3 01 	sts	0x01C3, r25
 136:	80 91 ca 01 	lds	r24, 0x01CA
 13a:	90 91 cb 01 	lds	r25, 0x01CB
 13e:	01 96       	adiw	r24, 0x01	; 1
 140:	90 93 cb 01 	sts	0x01CB, r25
 144:	80 93 ca 01 	sts	0x01CA, r24
 148:	e5 cf       	rjmp	.-54     	; 0x114 <loop+0x1c>
 162:	90 93 c9 01 	sts	0x01C9, r25
 166:	80 93 c8 01 	sts	0x01C8, r24
 16a:	80 91 c8 01 	lds	r24, 0x01C8
 16e:	90 91 c9 01 	lds	r25, 0x01C9
 172:	97 fd       	sbrc	r25, 7
 174:	10 c0       	rjmp	.+32     	; 0x196 <loop+0x9e>
 176:	0e 94 5f 01 	call	0x2be	; 0x2be <micros>
 17a:	60 93 b8 01 	sts	0x01B8, r22
 17e:	70 93 b9 01 	sts	0x01B9, r23
 182:	80 93 ba 01 	sts	0x01BA, r24
 186:	90 93 bb 01 	sts	0x01BB, r25
 18a:	80 91 c8 01 	lds	r24, 0x01C8
 18e:	90 91 c9 01 	lds	r25, 0x01C9
 192:	01 97       	sbiw	r24, 0x01	; 1
 194:	e6 cf       	rjmp	.-52     	; 0x162 <loop+0x6a>

The difference is certainly that is it much easier, and more efficient to test for zero, than to test for 32766 at the end of each iteration of the loop. Testing for 32766 requires fetching the value 32766 to a register, then comparing it to the loop index, then doing a condition branch, which is at least 3 clocks total. I'd bet the compare to zero is implicit, and only the branch instruction is needed which would be only 1 clock.

Regards, Ray L.

Thanks a lot for all of you. Indeed is making very much sense. Florin