Tips: For Loop - Incrementing Vs Decrementing

I’ve seen discussions where different compilers are supposed to provide different optimization for different increment methods - A quick and easy experiment to see what the Arduino IDE was like that showed something I did NOT expect!

Objective 1: Which is better, ++i, i++ or i += 1?

Objective 2: Which is better, --i, i-- or i -= 1?

Test code…

/*
  Test2: Run some timing tests
*/

#define LOOP_COUNT 999999

long lMicroStart;
long lMicroStop;
char x;

void setup() 
{

  // Enable Serial Logging
  Serial.begin(9600);
  Serial.println("Hello world!");
  Serial.println();
}

void loop() 
{
  //TESTS
  lMicroStart = millis();
  for (long i = 0; i < LOOP_COUNT; ++i) x ^= i;
  lMicroStop = millis();
  Serial.print("++i: ");
  Serial.println(lMicroStop-lMicroStart);

  lMicroStart = millis();
  for (long i = 0; i < LOOP_COUNT; i++) x ^= i;
  lMicroStop = millis();
  Serial.print("i++: ");
  Serial.println(lMicroStop-lMicroStart);

  lMicroStart = millis();
  for (long i = 0; i < LOOP_COUNT; i += 1) x ^= i;
  lMicroStop = millis();
  Serial.print("i += 1: ");
  Serial.println(lMicroStop-lMicroStart);

  Serial.println();

  lMicroStart = millis();
  for (long i = LOOP_COUNT-1; i >= 0; --i) x ^= i;
  lMicroStop = millis();
  Serial.print("--i: ");
  Serial.println(lMicroStop-lMicroStart);
  
  lMicroStart = millis();
  for (long i = LOOP_COUNT-1; i >= 0; i--) x ^= i;
  lMicroStop = millis();
  Serial.print("i--: ");
  Serial.println(lMicroStop-lMicroStart);
  
  lMicroStart = millis();
  for (long i = LOOP_COUNT-1; i >= 0; i -= 1) x ^= i;
  lMicroStop = millis();
  Serial.print("i -= 1: ");
  Serial.println(lMicroStop-lMicroStart);

  Serial.println();
}

Result for me on an ATmega168…

Hello world!

++i: 1761
i++: 1761
i += 1: 1763

--i: 881
i--: 881
i -= 1: 883

The IDE presumably optimized the code well as ++i, i++ and i += 1 all gave pretty much the same times as each other. Same for --i, i-- and i -= 1.

What WAS a surprise to me was that the – was TWICE as fast as the ++! So, rather than doing “for (x=0;x<10;x++)” use “for (x=9;x>=0;x–)” 8)

I’m not surprised in the slightest. The Zero flag is set upon decrement of the register containing the count vs loading and comparing the count to a value to set the Zero flag, Less instructions to decrement.

Good point Lloyd, but in that case shouldn’t starting at a negative value and incrementing up to zero be just as fast as decrementing down? On my Duemilanove the increments take 880 and decrements take 440. But adding this code:

  lMicroStart = millis();
  for (long i = -LOOP_COUNT; i < 0; i++) x ^= i;
  lMicroStop = millis();
  Serial.print("- i++: ");
  Serial.println(lMicroStop-lMicroStart);

takes 691.

Pete

You’re right it should as the instruction set does increment/decrement setting the same Zero flag. It’s just I haven’t seen a non optimizing compiler handle increments as well.

You could try changing the compilers optimization setting in “platforms.txt” to something a little more aggressive and recompile, test.

lloyddean: I'm not surprised in the slightest. The Zero flag is set upon decrement of the register containing the count vs loading and comparing the count to a value to set the Zero flag, Less instructions to decrement.

Sure - When the final Assembly [machine code / hex / whatever] is produced, 1 instruction would be faster than 2 - knowing the assembly that is produced from the 'C' is the key :roll_eyes: Nice to see that the IDE does great optimization!

I'd be interested in seeing more info on the actual Assembly produced from common 'C' code if a reference exists or in looking at an assembly reference & IDE if such exists!

el_supremo:
Good point Lloyd, but in that case shouldn’t starting at a negative value and incrementing up to zero be just as fast as decrementing down? On my Duemilanove the increments take 880 and decrements take 440. But adding this code:

  lMicroStart = millis();

for (long i = -LOOP_COUNT; i < 0; i++) x ^= i;
  lMicroStop = millis();
  Serial.print("- i++: ");
  Serial.println(lMicroStop-lMicroStart);



takes 691.

Pete

Given your middle value result, I’m assuming that the actual assembly is 4 bytes Vs 2 for my examples, but your increment a negative found a method that takes 3 bytes [or something like that - would love to compare the actual assembly - I can grab the .HEX file, but I need time to build, compare, repeat… A hex reference and memory map would help if one exists]

I figured out how to get assembler code from the compiler. I first did it without the optimizer flag -Os and all the loops looked roughly the same.
With the -Os flag added it changes a lot. Here is just the loops for the assembler code of the C statement shown in the comment:

// for (long i = 0; i < LOOP_COUNT; ++i) x ^= i;
.L2:
	eor r18,r24
	adiw r24,1
	adc r26,__zero_reg__
	adc r27,__zero_reg__
	cpi r24,lo8(999999)
	ldi r19,hi8(999999)
	cpc r25,r19
	ldi r19,hlo8(999999)
	cpc r26,r19
	ldi r19,hhi8(999999)
	cpc r27,r19
	brne .L2
	
// for (long i = -LOOP_COUNT; i < 0; i++) x ^= i;
.L4:
	eor r18,r24
	adiw r24,1
	adc r26,__zero_reg__
	adc r27,__zero_reg__
	sbiw r24,0
	cpc r26,__zero_reg__
	cpc r27,__zero_reg__
	brne .L4
	
//   for (long i = LOOP_COUNT-1; i >= 0; --i) x ^= i;
.L6:
	eor r18,r24
	sbiw r24,1
	sbc r26,__zero_reg__
	sbc r27,__zero_reg__
	brcc .L6

I’m not familiar enough with this assembler code to figure out exactly why the decrement is so much faster than either of the increment loops.
If nobody else can decode this, I’ll see if I can figure it out tomorrow.

Pete

Whoops, I made an assumption and forgot to ask - what board are you compiling for?

Duemilanove - I should have mentioned that!

Pete

el_supremo: I figured out how to get assembler code from the compiler. I first did it without the optimizer flag -Os and all the loops looked roughly the same. With the -Os flag added it changes a lot. Here is just the loops for the assembler code of the C statement shown in the comment:

...

I'm not familiar enough with this assembler code to figure out exactly why the decrement is so much faster than either of the increment loops. If nobody else can decode this, I'll see if I can figure it out tomorrow.

Pete

Thanks Pete - That's awesome!

Was around 25 years ago when I wrote an application in machine code on an Apple //e so I'm a bit rusty, but the theory is still the same - The speed can be related to the number of lines - each line would generally take 1 clock cycle to execute, thus the 5 lines for the decrement is just under half the time of the 12 lines used in the increment, with the 8 lines falling in the middle!

PS: I found the ATMEL codes if you want to determine exactly what happens!

lloyddean: Whoops, I made an assumption and forgot to ask - what board are you compiling for?

Mine is an ATmega168

el_supremo: I'm not familiar enough with this assembler code to figure out exactly why the decrement is so much faster than either of the increment loops. If nobody else can decode this, I'll see if I can figure it out tomorrow.

Pete

For "--i" I think this is ...

eor r18,r24 = That's the "x ^= i" sbiw r24,1 = i must be across r27,26,25 & 24 - Subtract 1 from r25 & 24 (in place with carry) sbc r26, 0 = Do the subtraction for the next higher byte [just using the carry] sbc r27, 0 = Do the subtraction for the final byte [again just using the carry] brcc = Repeat it again if the carry is clear

[if the final subtraction has a carry, the final subtraction crossed 0!]

[u]Looking at the assembly, I can't see any looping with a "long" count as being faster![/u]

PS: I'd expect that changing the long to an int would give just three lines...

eor r18,r24 = That's the "x ^= i" sbiw r24,1 = i must be across 25 & 24 - Subtract 1 from r25 & 24 (in place with carry) brcc = Repeat it again if the carry is clear

And comparing that to the "++i" examples, we have the added load (ldi) and compare (cpi/cpc) much like Lloyd expected ;)

PS: Once again - Excellent compilation/optimization!

Sorky:
So, rather than doing “for (x=0;x<10;x++)” use “for (x=9;x>=0;x–)”

I just want to point out that the second version only works if x is signed. After all, if it is unsigned then it will always be >= 0.

So that’s a trap. And it means you have lost half of your “counting space”. So for example, if you want to count to 200, then x can’t be a char, because you can’t have a char as +200. So you have to use an int. And now it is slower again.

[quote author=Nick Gammon link=topic=88126.msg662168#msg662168 date=1327207480] I just want to point out that the second version only works if x is signed. After all, if it is unsigned then it will always be >= 0.

So that's a trap. And it means you have lost half of your "counting space". So for example, if you want to count to 200, then x can't be a char, because you can't have a char as +200. So you have to use an int. And now it is slower again. [/quote]

Good point on the signed Vs unsigned [lucky I used a long which is signed ;)]!

As the "sdiw" works on two bytes, it may be just as fast on an int as a char - I think I've opened a can of works - Must look at more of the assembly [I think the geek in me is showing ;)]

el_supremo: I figured out how to get assembler code from the compiler.

You (or my inner geek) inspired me! I did a search and figured out how to get the assembly as well

// Hold shift while building
// observe the temporary directory location in the messages
// in the temporary directory do the following
avr-objdump -S *.elf > assembly.txt

Is that the same as you did or is there a better way?

Interesting… I’ll let the results speak for themselves!

for (int i = LOOP_COUNT; i > 0; --i) x ^= i;

  lMicroStart = millis();
 234:	0e 94 3d 02 	call	0x47a	; 0x47a <millis>
 238:	60 93 4a 01 	sts	0x014A, r22
 23c:	70 93 4b 01 	sts	0x014B, r23
 240:	80 93 4c 01 	sts	0x014C, r24
 244:	90 93 4d 01 	sts	0x014D, r25
 248:	90 91 52 01 	lds	r25, 0x0152
 24c:	83 e6       	ldi	r24, 0x63	; 99
  for (int i = LOOP_COUNT; i > 0; --i) x ^= i;
 24e:	98 27       	eor	r25, r24
 250:	81 50       	subi	r24, 0x01	; 1
 252:	e9 f7       	brne	.-6      	; 0x24e <loop+0x190>
 254:	90 93 52 01 	sts	0x0152, r25
  lMicroStop = millis();

for (char i = LOOP_COUNT; i > 0; --i) x ^= i;

  lMicroStart = millis();
 234:	0e 94 3d 02 	call	0x47a	; 0x47a <millis>
 238:	60 93 4a 01 	sts	0x014A, r22
 23c:	70 93 4b 01 	sts	0x014B, r23
 240:	80 93 4c 01 	sts	0x014C, r24
 244:	90 93 4d 01 	sts	0x014D, r25
 248:	90 91 52 01 	lds	r25, 0x0152
 24c:	83 e6       	ldi	r24, 0x63	; 99
  for (char i = LOOP_COUNT; i > 0; --i) x ^= i;
 24e:	98 27       	eor	r25, r24
 250:	81 50       	subi	r24, 0x01	; 1
 252:	e9 f7       	brne	.-6      	; 0x24e <loop+0x190>
 254:	90 93 52 01 	sts	0x0152, r25
  lMicroStop = millis();

for (unsigned char i = LOOP_COUNT; i > 0; --i) x ^= i;

  lMicroStart = millis();
 234:	0e 94 3d 02 	call	0x47a	; 0x47a <millis>
 238:	60 93 4a 01 	sts	0x014A, r22
 23c:	70 93 4b 01 	sts	0x014B, r23
 240:	80 93 4c 01 	sts	0x014C, r24
 244:	90 93 4d 01 	sts	0x014D, r25
 248:	90 91 52 01 	lds	r25, 0x0152
 24c:	83 e6       	ldi	r24, 0x63	; 99
  for (unsigned char i = LOOP_COUNT; i > 0; --i) x ^= i;
 24e:	98 27       	eor	r25, r24
 250:	81 50       	subi	r24, 0x01	; 1
 252:	e9 f7       	brne	.-6      	; 0x24e <loop+0x190>
 254:	90 93 52 01 	sts	0x0152, r25
  lMicroStop = millis();

NOTE: I changed the original “for (int i = LOOP_COUNT-1; i >= 0; --i) x ^= i;” to the above because the original produced logic with a load and compare!

Moral of the story - to really optimize the performance of your code, get to know the assembler that it becomes!

I’d probably use -

for ( unsigned char i = LOOP_COUNT; i--; ) { x ^= i; }

NOTE: that is post decrement, meaning check if ‘i’ is non-zero first then decrement it’s value. No more awkward code.

lloyddean: NOTE: that is post decrement, meaning check if 'i' is non-zero first then decrement it's value. No more awkward code.

Not sure that the pre/post affects the compare - possible I guess - depends on the IDE? Easy enough now to just check the assembly to see if there is a difference!

You've actually just hit the subject that got me doing this testing in the first place - Several forums argue that ++i is better than i-- based on the premise that the assembly for ++i would be a simple increment in place, where the i++ would require the value to be copied to another location (for the original return value) and then incremented... Just search ++i Vs i++ and you'll find plenty of stuff... I can see how that may apply so I figured - why not test it! Then I figured I'd try the decrement and found the big improvement and figured I'd better share ;)

PS: Different hardware and different compilers (or even different optimization settings) can give different assembly, so the for me the answer is, have a look at the assembly for a few different ways and take the shortest!

Is that the same as you did or is there a better way?

Different, not necessarily better. In the IDE Tools|Preferences I turned on verbose output for the compiler. From that I found the temporary directory where the intermediate output was going and it also shows the compiler commands used. I then set up my own command line call to the compiler and added -S for the assembler output which gives this monstrous command:

C:\arduino-1.0\hardware\tools\avr\bin\avr-gcc -S -Os -mmcu=atmega328p -DF_CPU=16000000L -DARDUINO=100 -IC:\arduino-1.0\hardware\arduino\cores\arduino -IC:\arduino-1.0\hardware\arduino\variants\standard incrtime.cpp

However, I did this using the scripting language TCL and its command shell called WISH so it wasn't as complicated as it looks. That command produces incrtime.s which is the assembler code.

Pete