Odd timing behavior on Uno with IDE 1.0.5

I have a decent logic analyzer (Logic16 to be more precise). Using the attached code, and with the logic analyzer taking samples at 16Mhz, it shows 0.125us pulse width and a .375us inter-pulse interval on the pin no matter what TEST_VALUE is set at. The code is functional in a first sense. A pulse train is recorded, so it is going around on the while(1) loop. It was supposed to be a simple test of a more complex design issue, but I must be doing something in ignorance or stupidity. Any thoughts?

#define TEST_VALUE 1000000

int main( void )
{
  unsigned long test_delay;
  
  DDRC   |= (1 << 1);                    // Enable pin 15 for OUTPUT
  
  while( 1 )
  {
    PORTC  |= (1 << 1);                  // Set pin 15 HIGH
    test_delay = TEST_VALUE;
    
    while( test_delay > 0)
    {
      test_delay--;
    }
    
    PORTC  &= ~(1 << 1);                  // Set pin 15 LOW
    test_delay = TEST_VALUE;
    
    while( test_delay > 0)
    {
      test_delay--;
    }
  }
}

Did you use pinMode on the pin to make it an output?
Wouldn't affect the timing I suppose.
I wonder if the compiler optimiezes this away

while( test_delay > 0)
{
test_delay--;
}

since it doesn't really do anything.
Maybe use delayMicroseconds instead.

You could also use
PINC = PINC | 0b00000001;
to make PORTC bit 0 wiggle in the while loop

while( test_delay > 0)
{
test_delay–;
PINC = PINC | 0b00000001;
}

and make TEST_VALUE a little smaller to compensate.

What I want to do is make a little busy loop for my own version of a delay. So I wrote something simple with the intent to vary the delay value and measure the differences via the logic analyzer to get a calibration. But I got this rather odd behavior instead. Perhaps it is the optimizer.

Compiler thinks the loop is pointless since the variable's result is never used elsewhere. Try declaring the variable static volatile.

Yep. This is what the compiler generated:

000000a6 <main>:
  a6:	39 9a       	sbi	0x07, 1	; 7
  a8:	41 9a       	sbi	0x08, 1	; 8
  aa:	41 98       	cbi	0x08, 1	; 8
  ac:	fd cf       	rjmp	.-6      	; 0xa8 <main+0x2>

No counting there.

Try making the variable volatile (not static) and moving it outside main, like this:

#define TEST_VALUE 1000000
volatile unsigned long test_delay;

int main( void )
...

That generates a lot more code.

That's it! Thanks, Nick.

There's no doubt some fixed overhead in this primitive scheme. As the count shortens much below 100, the relationship becomes non-linear. This is on a 16Mhz Uno.

test_count timing(with rounding)

10000 21271us
1000 2128us
100 214us
10 23us
2 5.8us

Here’s the machine code from using “volatile” with the cycle counts in brackets:

000000a6 <main>:
  a6:   39 9a           sbi     0x07, 1 ; 7   (2)
  a8:   20 e4           ldi     r18, 0x40       ; 64   (1)
  aa:   32 e4           ldi     r19, 0x42       ; 66   (1)
  ac:   4f e0           ldi     r20, 0x0F       ; 15   (1)
  ae:   50 e0           ldi     r21, 0x00       ; 0   (1)
  b0:   41 9a           sbi     0x08, 1 ; 8   (2)
  b2:   20 93 00 01     sts     0x0100, r18   (2)
  b6:   30 93 01 01     sts     0x0101, r19   (2)
  ba:   40 93 02 01     sts     0x0102, r20   (2)
  be:   50 93 03 01     sts     0x0103, r21   (2)
  c2:   13 c0           rjmp    .+38            ; 0xea <main+0x44>   (2)
  c4:   80 91 00 01     lds     r24, 0x0100   (2)
  c8:   90 91 01 01     lds     r25, 0x0101   (2)
  cc:   a0 91 02 01     lds     r26, 0x0102   (2)
  d0:   b0 91 03 01     lds     r27, 0x0103   (2)
  d4:   01 97           sbiw    r24, 0x01       ; 1   (2)
  d6:   a1 09           sbc     r26, r1   (1)
  d8:   b1 09           sbc     r27, r1   (1)
  da:   80 93 00 01     sts     0x0100, r24   (2)
  de:   90 93 01 01     sts     0x0101, r25   (2)
  e2:   a0 93 02 01     sts     0x0102, r26   (2)
  e6:   b0 93 03 01     sts     0x0103, r27   (2)
  ea:   80 91 00 01     lds     r24, 0x0100   (2)
  ee:   90 91 01 01     lds     r25, 0x0101   (2)
  f2:   a0 91 02 01     lds     r26, 0x0102   (2)
  f6:   b0 91 03 01     lds     r27, 0x0103   (2)
  fa:   00 97           sbiw    r24, 0x00       ; 0   (2)
  fc:   a1 05           cpc     r26, r1   (1)
  fe:   b1 05           cpc     r27, r1   (1)
 100:   09 f7           brne    .-62            ; 0xc4 <main+0x1e>   (1/2)
 102:   41 98           cbi     0x08, 1 ; 8   (2)
 104:   20 93 00 01     sts     0x0100, r18   (2)
 108:   30 93 01 01     sts     0x0101, r19   (2)
 10c:   40 93 02 01     sts     0x0102, r20   (2)
 110:   50 93 03 01     sts     0x0103, r21   (2)
 114:   13 c0           rjmp    .+38            ; 0x13c <main+0x96>   (2)
 116:   80 91 00 01     lds     r24, 0x0100   (2)
 11a:   90 91 01 01     lds     r25, 0x0101   (2)
 11e:   a0 91 02 01     lds     r26, 0x0102   (2)
 122:   b0 91 03 01     lds     r27, 0x0103   (2)
 126:   01 97           sbiw    r24, 0x01       ; 1   (2)
 128:   a1 09           sbc     r26, r1   (1)
 12a:   b1 09           sbc     r27, r1   (1)
 12c:   80 93 00 01     sts     0x0100, r24   (2)
 130:   90 93 01 01     sts     0x0101, r25   (2)
 134:   a0 93 02 01     sts     0x0102, r26   (2)
 138:   b0 93 03 01     sts     0x0103, r27   (2)
 13c:   80 91 00 01     lds     r24, 0x0100   (2)
 140:   90 91 01 01     lds     r25, 0x0101   (2)
 144:   a0 91 02 01     lds     r26, 0x0102   (2)
 148:   b0 91 03 01     lds     r27, 0x0103   (2)
 14c:   00 97           sbiw    r24, 0x00       ; 0   (2)
 14e:   a1 05           cpc     r26, r1   (1)
 150:   b1 05           cpc     r27, r1   (1)
 152:   09 f4           brne    .+2             ; 0x156 <main+0xb0>   (1/2)
 154:   ad cf           rjmp    .-166           ; 0xb0 <main+0xa>   (2)
 156:   df cf           rjmp    .-66            ; 0x116 <main+0x70>   (2)

Loading a long, adding one, comparing and then storing it is going to take a few cycles as you can see, so an overhead isn’t too surprising.

Nick, while your code fixes the problem, I am curious as to why the OP code does not work.
On face value it should work as intended and similar constructs would probably be quite common out there in Arduino land. Why doesn't it work? (Obviously using a volatile variable is the reason, but why does it need to be volatile and an internally declared variable not suitable).

The compiler is free to optimize away unused variables. And if "used" is only in the sense of adding 1 to it (but never using the result) then it is unused.

However adding volatile tells the compiler that this might have a side-effect (eg. an output pin) and therefore not to be too clever about optimizing it away.

An alternative might have been to print the result, unless the compiler is smart enough to pre-compute what that loop would have made it anyway.

why does it need to be volatile

Volatile tells the compiler that the variable may change (not will, only may) at any time. So, the compiler must generate code to fetch the value every time the variable is referenced. It can not use a cached value.

Fetching the value every time is going to, obviously, take longer than accessing a cached value.

While not disputing your explanation, this would seem to be a fault with the compiler.

And if "used" is only in the sense of adding 1 to it (but never using the result) then it is unused.

The evaluation "while ( test_delay > 0)" would seem to be a use of the variable and therefore worthy(vital) of inclusion in the compiled code.

@PaulS

So, the compiler must generate code to fetch the value every time the variable is referenced. It can not use a cached value.

Does this mean that a volatile variable is stored in flash while a "normal" variable is store in SRAM and a fair amount of extra code is necessary to read from flash? (I take "cached" to mean stored in SRAM).

Does this mean that a volatile is stored in flash while a "normal" is store in SRAM and a fair amount of extra code is necessary to read from flash? (I take "cached" to mean stored in SRAM).

No. All variables are are stored in SRAM. If the compiler knows that a variable can not change outside of the scope of a function, the compiler might store the value in a register. Register access is far faster than SRAM access.

If the compiler knows that a variable can change outside the scope of a function, it won't store the value in a register. Therefore, it will need to access the value from SRAM every time.

It may also need to generate extra code, since it can not optimize away apparently useless code.

lemming:
The evaluation “while ( test_delay > 0)” would seem to be a use of the variable and therefore worthy(vital) of inclusion in the compiled code.

Not at all. For example:

int main (void)
  {
  int a = 3;
  }

That does not generate any code to load 3 into “a”. Why? Because “a” is never used. Think of “used” in the sense of “being output in some way”.

Interestingly, I was right about the compiler knowing what test_delay was:

#define TEST_VALUE 1000000

int main( void )
{
  unsigned long test_delay;
  
  DDRC   |= (1 << 1);                    // Enable pin 15 for OUTPUT
  
  while( 1 )
  {
    PORTC  |= (1 << 1);                  // Set pin 15 HIGH
    test_delay = TEST_VALUE;
    
    while( test_delay > 0)
    {
      test_delay--;
    }
    
    PORTC  &= ~(1 << 1);                  // Set pin 15 LOW
    test_delay = TEST_VALUE;
    
    while( test_delay > 0)
    {
      test_delay--;
    }
  Serial.println (test_delay);
  }
  
}

Generates:

000000be <main>:
  be:	39 9a       	sbi	0x07, 1	; 7
  c0:	41 9a       	sbi	0x08, 1	; 8
  c2:	41 98       	cbi	0x08, 1	; 8
  c4:	84 e9       	ldi	r24, 0x94	; 148
  c6:	91 e0       	ldi	r25, 0x01	; 1
  c8:	40 e0       	ldi	r20, 0x00	; 0
  ca:	50 e0       	ldi	r21, 0x00	; 0
  cc:	60 e0       	ldi	r22, 0x00	; 0
  ce:	70 e0       	ldi	r23, 0x00	; 0
  d0:	2a e0       	ldi	r18, 0x0A	; 10
  d2:	30 e0       	ldi	r19, 0x00	; 0
  d4:	0e 94 a9 02 	call	0x552	; 0x552 <_ZN5Print7printlnEmi>
  d8:	f3 cf       	rjmp	.-26     	; 0xc0 <main+0x2>

Whilst I can’t explain some of those figures, clearly there is no looping there. After all the compiler looks at:

    while( test_delay > 0)
    {
      test_delay--;
    }

So what must test_delay be? Zero. So the loop can be discarded.

lemming:
The evaluation "while ( test_delay > 0)" would seem to be a use of the variable and therefore worthy(vital) of inclusion in the compiled code.

I'm not as smart as the compiler or the people who wrote it. Looks like the compiler sees that before the while loop the value is set to a constant (remember that defines are marcos) and then compared to the value 0. It's smart enough to figure out that evaluation ahead of time.

The next time test_delay is accessed, after the while loop, the current value is overwritten.

The only code inside the loop changed the value of test_delay which was not used after the scope of the while loop.

So it thought "hey, silly human, they didn't mean to put this pointless loop here!"

Seems rather extreme, but I think the gcc command line has optimizations set pretty high as I recall.