Obscure missed compiling optimisation...

I am messing around with optiboot to make it smaller for an ATTiny, and have found one place where I can save 8 byte (ends up being 64 as it allows the bootloader to be moved one page closer to the end)

To save space, i first declared this:

typedef union {
	uint16_t integer;
	uint8_t array[2];
}twoByte;

It replaces things like:
someUint16 = (someUint8 << 8 ) | someOtherUint8;
To be:
someTwoByte.array[1] = someUint8;
someTwoByte.array[0] = someOtherUint8;
And can then be used directly as a int, massively reducing the code.

There is one bit where I am trying to use it however where the compiler could make an easy optimisation, but doesn't:

      bufPtr = buff;
      addrPtr = (uint16_t)(void*)address;
      ch = SPM_PAGESIZE / 2;
      do {
        twoByte a; //Again by using a union, code length is slashed, this time by 16 bytes.
        a.array[0] = *bufPtr++;
        a.array[1] = *bufPtr++;
        __boot_page_fill_short((uint16_t)(void*)addrPtr,a.integer);
        addrPtr += 2;
      } while (--ch);

Becomes:

    1e8e:	a0 e0       	ldi	r26, 0x00	; 0
    1e90:	b1 e0       	ldi	r27, 0x01	; 1
//start of the do while loop here
		twoByte a; //Again by using a union, code length is slashed, this time by 16 bytes.
		a.array[0] = *bufPtr++;
    1e92:	6c 90       	ld	r6, X ; brne jumps to here
		a.array[1] = *bufPtr++;
    1e94:	11 96       	adiw	r26, 0x01	; 1
    1e96:	7c 90       	ld	r7, X
    1e98:	11 97       	sbiw	r26, 0x01	; 1 - Subtract one
    1e9a:	12 96       	adiw	r26, 0x02	; 2 - Then add two?!?!?!?!?!
...
    1eae:	89 f7       	brne	.-30     	; //end of the do-while loop

For some reason it not only decides to use the "adiw" instruction, but also decides to essentially do "*bufPtr -1 +2", rather than just "*bufPtr + 1".

The amusing thing is that if I modify the code above to be this:

        twoByte a; //Again by using a union, code length is slashed, this time by 16 bytes.
        a.array[0] = *bufPtr++;
        a.array[1] = *bufPtr;//++;

It does what it is supposed to do, and uses this:

    1e8e:	a0 e0       	ldi	r26, 0x00	; 0
    1e90:	b1 e0       	ldi	r27, 0x01	; 1
//start of the do while loop here
		twoByte a; //Again by using a union, code length is slashed, this time by 16 bytes.

		a.array[0] = *bufPtr++;
    1e92:	6d 90       	ld	r6, X+
		a.array[1] = *bufPtr;//++;
    1e94:	7c 90       	ld	r7, X
...
    1eae:	89 f7       	brne	.-30     	; //end of the do-while loop

Notice the use of X+ instead of adiw, which is a great optimisation but doesn't help me unless I can find a way to make it generate this:

    1e8e:	a0 e0       	ldi	r26, 0x00	; 0
    1e90:	b1 e0       	ldi	r27, 0x01	; 1
//start of the do while loop here
		twoByte a; //Again by using a union, code length is slashed, this time by 16 bytes.

		a.array[0] = *bufPtr++;
    1e92:	6d 90       	ld	r6, X+
		a.array[1] = *bufPtr++;
    1e94:	7c 90       	ld	r7, X+
...
    1eae:	89 f7       	brne	.-30     	; //end of the do-while loop

Using X+ for both.

Could someone explain this one to me? as I can't make heads or tails of it, let alone fix it.

Well, i managed to find an optimisation elsewhere which saved the bytes, so fixing this isn't needed, though would be useful.

Well it would help if the Arduino project used more modern compilers. I downloaded the 1.0.1 IDE for Linux, and the compiler that they include in the tar ball is 4.3.2 which was released in August 27, 2008.

The avr-gcc that is in the Fedora 14 distribution that I normally run is 4.5.3, which was released on April 28th, 2011.

The avr-gcc that is in the Fedora 17 distribution that I've been trying to run is 4.6.3, which was released on March 1st, 2012.

GCC 4.7.1 was released on June 14th, 2012. I don't have a 4.7 built off hand.

Just for fun, I tried some small code on the 3 compilers, and the 4.3.2 code was 124 bytes, the 4.5.3 code was 104 bytes, and the 4.6.3 code was 84 bytes. Now obviously, it depends on what is being optimized.

I have some 85's I'm keen on using, might be useful for me too, can you post your current code?

Typically, expressions are represented as a tree and the optimizer works through the tree trying to flatten and prune. My gut tells me the optimizer is limited to 2 passes through the tree and only "sees" a relatively short distance from any given node. There also appear to be a few case specific deficiencies in the optimizer. In other words, the optimizer is good but not great and occasionally does rather dumb things.

There is one bit where I am trying to use it however where the compiler could make an easy optimisation, but doesn't:

I strongly suggest you not bother with it. In my experience, any tweaks you apply to get the optimizer to behave a certain way adversely affect other nearby code. In other words, what you are trying to do is annoyingly similar to Whack-A-Mole; smack down one bit of "bad" code only to have another pop up someplace else.

Yeah, I am going to leave this one be. I managed to find something unneeded elsewhere which saved an equivalent amount of space allowing me to move the bootloader up one page (64bytes of extra space!).

pYro_65:
I have some 85's I'm keen on using, might be useful for me too, can you post your current code?

I will be posting my code after a little more testing to make sure it works. It was built for the Tiny84, but should with a couple of extra #ifdef statements support the tiny 85 as I believe both work in an almost identical fashion.

Edit:
Timer 1 is different between the tiny84 and tiny85, but with a tiny modification, I have managed to compile an identical bootloader for the Tiny85 - totally untested though.

Believe me, GCC has many, many passes. :slight_smile:

I suspect it may be aliasing issues, where the compiler has to be real careful so as not to mess up calculations. Improving the alias detection has been one of the ongoing projects.

It may be as I mentioned before, it is fixed in newer releases. However, if it is not fixed, you need to report the bug via GCC's bugzilla process in order to get it looked at. This includes having a complete source file so that the compiler maintainers can reproduce it. Unfortunately, in order to simplify things for newbies, the Arduino IDE hides a lot of the details from you, and it makes it harder.

[quote author=Tom Carpenter link=topic=117713.msg885759#msg885759 date=1344388461]
]There is one bit where I am trying to use it however where the compiler could make an easy optimisation, but doesn't:[/quote]

It really depends on the specifics. Without a complete source to a function in question that shows the bug, it could be anything. Believe me, what people think are easy optimizations, often aren't. Sometimes it is a case, the maintainer didn't think of, other times it may be that optimizing that case is priority #565, and the developer has only gotten to #267.

Just a quick thank you - for introducing me to the union trick, it has been immediatley useful to me.

Duane B
rcarduino.blogspot.com

DuaneB:
Just a quick thank you - for introducing me to the union trick, it has been immediatley useful to me.

Duane B

I should mention that using unions is the preferred way. If you take random pointers and convert them to char * pointers to get to the individual bytes without using a union, you will get yourself in trouble, particularly with newer compilers. This is non-standard behaviour and a number of people find that it just sort of happened to work in a previous revision, and it doesn't in the newer revision, and their code was broken in the first place.

Hi,
I have gone for this -

union time16_t{
uint16_t theTime;
struct {
 uint8_t theLowByte; // little endian ordering
 uint8_t theHighByte;
 };
};

Results are as expected -

// nonsensical snippet to test compiler output using union to access individual
// bytes of 2 byte int. 

// result = single load, compare and store, Yay !

 if(currentTime.theHighByte > 100)
     1c4:	80 91 3a 01 	lds	r24, 0x013A
     1c8:	85 36       	cpi	r24, 0x65	; 101
     1ca:	38 f0       	brcs	.+14     	; 0x1da <loop+0xb6>
  {
    if(currentTime.theLowByte < 20)
     1cc:	80 91 39 01 	lds	r24, 0x0139
     1d0:	84 31       	cpi	r24, 0x14	; 20
     1d2:	18 f4       	brcc	.+6      	; 0x1da <loop+0xb6>
    {
     currentTime.theHighByte = 1;   
     1d4:	81 e0       	ldi	r24, 0x01	; 1
     1d6:	80 93 3a 01 	sts	0x013A, r24
    }
}

It should cut down quite a bit of code where I do not need the resolution offered by the least significant byte in comparisons and range checks but still need maximum accuracy in later operations.

Great, Thanks

Duane B

rcarduino.blogspot.com

DuaneB:
Hi,
I have gone for this -

union time16_t{

uint16_t theTime;
struct {
uint8_t theLowByte; // little endian ordering
uint8_t theHighByte;
};
};

Just bear in mind that this may have to be adjusted if you ever move your code off Arduinos. Some processors are big endian (or bi-endian, and the RTOS/OS you are using might be running in big-endian mode). Also using unions will make your code slower if you are running on a machine with 16/32/64-bit native integers.