[SOLVED] Strange problem with volatile register being decremented inside an ISR

Hi all,

I have a problem which I think I know what the problem is, but have no clue what to do about it.

I have a timer driven ISR which will light an LED (infrared actually - but that doesn't matter) and I "push in" a pulse count into a volatile uint16_t which the ISR uses to count pulses. Outside the ISR in the real world, I run a simple loop to check when the uint16_t is decremented to 0 (telling me the ISR is done and ready for more).

At this point it may help to look at the code snippet (first, the ISR):

ISR (TIMER1_COMPA_vect)
{
    if (count) { // if a count was pushed in...
        state ? IO_PORT &= ~_BV (IR_BIT) : onoff ? IO_PORT |= _BV (IR_BIT) : IO_PORT &= ~_BV (IR_BIT);
        count -= state; // decrement count (on times only)
        state = state ? 0 : 1; // if led was on set it off, else set it on
    }
}

"count" is a uint16_t (volatile) and state is a uint8_t (also volatile) which only has the value 0 or 1.

Now here's what "pushes" values into "count" and then waits for it to finish:

void send_cmd (const uint16_t *command)
{
    uint16_t *ptr;

    IO_DDR |= _BV (IR_BIT); // set IR drive pin as an output

    onoff = 0; // reset on/off LED polarity

    ptr = (uint16_t *) command; // point to command table

    while ((count = pgm_read_word (ptr++))) { // push pulse count in
        onoff = ((ptr - command) % 2); // set LED on or off
        while (count);
        if (count) { fprintf (stdout, "FAILED!!! %u\n", count); }
    }

    IO_DDR &= ~_BV (IR_BIT); // set IR drive pin back to an input
}

Just look at the while loop (that's the only "important" part). I get a 16 bit value from PROGMEM, then stuff it into "count", then wait until the ISR decrements it down to zero (the "while (count);" statement in the middle of the while loop).

I was looking at the generated waveforms on the silly-scope and occasionally if would "screw up". I eventually found that putting ANOTHER "while (count);" in the loop (that is, two in a row) stopped the problem, but I didn't like it because I was MASKING a deeper problem.

So I stuck the printf line inside as a diagnostic and what do I find? Whenever it fails, "count" has a value of 255 when the previous line (while (count):wink: said it was zero.

Ah ha! Now I see (I think I see) what's happening. The two 8 bit halves of the uint8_t are not being updated together (atomically??)

So what I need to know is how to decrement "count" and when it tests as zero, it really is zero to any and all who look at it.

Any help or advice will be greatly appreciated. Thanks!

Accesses to entities shared with ISRs (bigger than 1 byte (like ints))
have to be guarded by disabling interrupts before and enabling after the read/write.

Another (unrelated) thing which is interesting is that for a different project, I had to modify the code above to use 16 bit pulse counts. So I just changed everything related to "uint16_t" and it worked just fine.

Then I said "WHAT THE ?????" IT shouldn't work!!!

Since I generate the even/odd (LED on/off) value by subtracting the current pointer from the start pointer and then doing a "% 2" (mod 2) it shouldn't work because the pointer gets moved by TWO for each "ptr++" instead of ONE if a uint8_t is used.

So I made a test program in Linux/GCC and this is what I got (first the code):

#include <stdio.h>
#include <inttypes.h>

uint16_t cmd[] = { // dummy data
    // header
    0x0060, 0x0018,
    // command
    0x0030, 0x0018, 0x0030, 0x0018, 0x0030, 0x0018, 0x0018, 0x0018, // E
    0x0030, 0x0018, 0x0030, 0x0018, 0x0018, 0x0018, 0x0018, 0x0018, // C
    // end of command
    0x0000,
};

volatile uint16_t count;
volatile uint8_t onoff;

int main (int argc, char *argv[])
{
    uint16_t *ptr;
    uint16_t delta;
    onoff = 0;
    ptr = cmd;

    while ((count = *ptr++)) {
        onoff = ((ptr - cmd) % 2);
        delta = (ptr - cmd);
        fprintf (stdout, "data: %u, ptr: 0x%08x, cmd: 0x%08x, delta: %u, onoff: %u\n", count, ptr, cmd, delta, onoff);
    }

    return 0;
}

What it prints:

** **root@michael:~/c-progs# ./test data: 96, ptr: 0x00600bc2, cmd: 0x00600bc0, delta: 1, onoff: 1 data: 24, ptr: 0x00600bc4, cmd: 0x00600bc0, delta: 2, onoff: 0 data: 48, ptr: 0x00600bc6, cmd: 0x00600bc0, delta: 3, onoff: 1 data: 24, ptr: 0x00600bc8, cmd: 0x00600bc0, delta: 4, onoff: 0 data: 48, ptr: 0x00600bca, cmd: 0x00600bc0, delta: 5, onoff: 1 data: 24, ptr: 0x00600bcc, cmd: 0x00600bc0, delta: 6, onoff: 0 data: 48, ptr: 0x00600bce, cmd: 0x00600bc0, delta: 7, onoff: 1 data: 24, ptr: 0x00600bd0, cmd: 0x00600bc0, delta: 8, onoff: 0 data: 24, ptr: 0x00600bd2, cmd: 0x00600bc0, delta: 9, onoff: 1 data: 24, ptr: 0x00600bd4, cmd: 0x00600bc0, delta: 10, onoff: 0 data: 48, ptr: 0x00600bd6, cmd: 0x00600bc0, delta: 11, onoff: 1 data: 24, ptr: 0x00600bd8, cmd: 0x00600bc0, delta: 12, onoff: 0 data: 48, ptr: 0x00600bda, cmd: 0x00600bc0, delta: 13, onoff: 1 data: 24, ptr: 0x00600bdc, cmd: 0x00600bc0, delta: 14, onoff: 0 data: 24, ptr: 0x00600bde, cmd: 0x00600bc0, delta: 15, onoff: 1 data: 24, ptr: 0x00600be0, cmd: 0x00600bc0, delta: 16, onoff: 0 data: 24, ptr: 0x00600be2, cmd: 0x00600bc0, delta: 17, onoff: 1 data: 24, ptr: 0x00600be4, cmd: 0x00600bc0, delta: 18, onoff: 0** **
OK now have I been drinking too much, or do I really see that the pointer "ptr" increments by TWO, but subtracting the ptr from the start yields a difference of ONE???

It is working the way I wanted it to work, but it seems to me that it SHOULDN"T be working. What the heck???

Whandall:
Accesses to entities shared with ISRs (bigger than 1 byte (like ints))
have to be guarded by disabling interrupts before and enabling after the read/write.

What you're saying seems to make sense, but I wonder why? There is nothing else running that would touch the variables (the code is compiled as a .cpp file using AVR-GCC and I know exactly what's running and where and when - it has no Arduino libraries in it).

Now when you say "disable / enable interrupts..... after the read/write", what do you mean by "read/write"?

Do you mean when I decrement the counter inside the ISR or when I test it for zero (or all of the above)?

The thing is, testing the variable for zero is asynchronous with it being decremented, so it could be "partially decremented" (like maybe the lower 8 bits underflowed and I read it at the exact moment before the borrow is subtracted from the high 8 bits, so my "snapshot" sees zero, but in reality it went from 257 to 256).

How can disabling interrupts anywhere keep me from accidentally "seeing" the variable "in the middle" of the update process?

Another strange thing is that some of the counts being "pushed" in are larger even than 512 (there's a 1760 [0x06E0] in there) so I would expect to see random failures with the value of 0x00FF, 0x01FF,... up to 0x05FF depending on which underflow/borrow I happened to stumble into... right?

Whandall:
Accesses to entities shared with ISRs (bigger than 1 byte (like ints))
have to be guarded by disabling interrupts before and enabling after the read/write.

Searching Google for something dusty I found in the back of my mind.... is THIS what my problem is and what I need:

24.4.7.2 Atomic Types

To avoid uncertainty about interrupting access to a variable, you can use a particular data type for which access is always atomic: sig_atomic_t. Reading and writing this data type is guaranteed to happen in a single instruction, so there’s no way for a handler to run “in the middle” of an access.

The type sig_atomic_t is always an integer data type, but which one it is, and how many bits it contains, may vary from machine to machine.

Data Type: sig_atomic_t

This is an integer data type. Objects of this type are always accessed atomically.

In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.

For all I know that's the way pointers to objects should behave.

By looking at the individual addresses used for the objects,
one easily forgets about them being pointers to objects, not bytes.

Pointers to ints increment and decrement by adding 2 to the byte address.
So subtracting pointers to ints generate a 'delta-index' value,
I think it has to be that way to be consistent with the array notation ptr[].

If you try to mix pointers to different objects in expressions,
the compiler will force you to specify the intended interpretation.

For the interrupt problem see http://www.atmel.com/webdoc/AVRLibcReferenceManual/group__util__atomic.html

volatile uint26_t count;

...

    while ((count = pgm_read_word (ptr++))) { // push pulse count in
        onoff = ((ptr - command) % 2); // set LED on or off
        while (1)
        {
            uint16_t cnt;
            noInterrupts();
            cnt = count;
            Interrupts();
            if (!cnt)
                break;
        }
        if (count) { fprintf (stdout, "FAILED!!! %u\n", count); }
    }

Regards,
Ray L.

Whandall:
For the interrupt problem see http://www.atmel.com/webdoc/AVRLibcReferenceManual/group__util__atomic.html

Well, I found and read the link you gave me. Problem is, I don't get it.

Could you clarify or give me a small example?

I managed to set it up so that it fails reliably (that's a laugh!) so when I find something that works (other than my double-test kludge), I'll know it.

Ugh... need to get some sleep maybe. I've been up for almost 48 hours straight (and my wife isn't in labor!) :slight_smile:

RayLivingston:

volatile uint26_t count;

...

while ((count = pgm_read_word (ptr++))) { // push pulse count in
       onoff = ((ptr - command) % 2); // set LED on or off
       while (1)
       {
           uint16_t cnt;
           noInterrupts();
           cnt = count;
           Interrupts();
           if (!cnt)
               break;
       }
       if (count) { fprintf (stdout, "FAILED!!! %u\n", count); }
   }




Regards,
Ray L.

Well, I think we have a winner here! It works!

I had to do it slightly differently though (of course "cnt" is declared in this function as a uint16_t.)

        do {
            cli ();
            cnt = count;
            sei ();
        } while (cnt);

So, I thought "maybe it's just the delay of all that code running that's allowing the count decrement to finish. So I tested it.

First I commented out the cli and sei - problem came back. Well, maybe cli and sei take enough time to hide it, so I changed the cli to an sei (same clock cycles). Problem was there.

Then put it back the way it's supposed to be - works fine.

Lastly, instead of running real command strings through it, I ran a whole range of "count" cycle counts from 10 to 10000 and setup a counter to increment whenever it failed. It took a while to finish, but when it did.... error count was zero!

You, sir, have solved my problem, and since I see how and why it works, I know it's not a kludge (even though it looks like one at a glance).

Thank you!

You were given the correct answer in the very first response you received, from Whandall....

The reason it works has NOTHING to do with delays. It has to do with the fact that it takes multiple instructions to read or write ANY variable which is more than 8 bits, and an interrupt can occur after only the first byte has been read or written, so you can sometimes read one byte from before the just before the variable was changed, and the other byte from just after the variable was changed by the ISR. If the value in the variable is 0x0100, you will sometimes read it as 0x0000.

Regards,
Ray L.

RayLivingston:
You were given the correct answer in the very first response you received, from Whandall....

The reason it works has NOTHING to do with delays. It has to do with the fact that it takes multiple instructions to read or write ANY variable which is more than 8 bits, and an interrupt can occur after only the first byte has been read or written, so you can sometimes read one byte from before the just before the variable was changed, and the other byte from just after the variable was changed by the ISR. If the value in the variable is 0x0100, you will sometimes read it as 0x0000.

Regards,
Ray L.

Yes, you're right. But the first response stated a fact rather than providing an example. Maybe I'm just tired, but I didn't get it at first.

In fact, I feel foolish doing the "range of counts" test because the problem had nothing to do with the actual count, but rather whenever at random I would asynchronously try to read the value of a variable that was only "half cooked".

Any value at any random time could produce the problem - I see that now. I've been up for two days and my brain is stuck in that "too tired to sleep and too tired to function" mode.

Maybe I should just set sleep mode, turn off the lights and the computer and try to get some sleep.
With my luck I'll probably forget to enable my interrupts first and be stuck sleeping forever...

Once I do fall asleep, I'll likely be asleep all weekend. Darn. WOW (waste of weekend).

Whandall:
Accesses to entities shared with ISRs (bigger than 1 byte (like ints))
have to be guarded by disabling interrupts before and enabling after the read/write.

Thank you for your reply. Now that I know what's going on, I understand what you wrote. I didn't get it at first.

(at least I clicked a karma++ for you!) :slight_smile:

Thank you.

Sometimes its hard for me to make myself understand, sorry.
The link to the Atmel documentation had the full explanation and an example.

Whandall:
Thank you.

Sometimes its hard for me to make myself understand, sorry.
The link to the Atmel documentation had the full explanation and an example.

Atmel's docs are not written all that well (IMO). They suffer from the typical "explained by a person who already understands it" problem.

Lots of times a short example or an analogy goes much further than hard facts.

For example, I explain bit shifting by using an analogy of a wood board with holes in it, filled with pencils. A pencil in a hole is a "1" and an empty hole is a "0". Then, rather than trying to explain the difference between "shifting" and "rotating" (along with the "huh? what?" factor of "borrow or carry through the carry bit in the processor status register), I simply take the pencils and move them all left or right and drop the ones that no longer have a hole on the floor (the "bit bucket"). It's visually obvious that shifting enough times in either direction results eventually in zero.

But bit ROTATION - I take the end pencil and place it in the vacated hole on the other end - again it's visually obvious that if you rotate an 8 bit register 8 times, you end up with the same number.

I've always hated school and college teaching methods... the professor throws gibberish up on the board, we all scramble to copy it to our notebooks, then he asks "any questions"? Um... I don't even get it well enough to HAVE a question to ask!

So, I compensate by trying to teach the way I would want to be taught... to UNDERSTAND it, not just to MEMORIZE it.

Sorry for the ramble. I STILL haven't gotten to sleep and I feel like a zombie. Oh boy when I finally crash it's not going to be fun.

Anyway, thanks for your help and input. 95% of my problem of not "getting it" was that I'm dead tired.

Thanks again!

RayLivingston:
Regards,
Ray L.

By the way, since I need to test if "count" is zero in several different places, I ended up making a "busywait" function. It's clean, simple and solid a a rock. Hasn't glitched once (of course it won't but I'm still keeping an eye on it).

// wait for ISR to finish (atomic 16 bit read)
void busy_wait (void)
{
    uint16_t tmp;
    do {
        cli ();
        tmp = count;
        sei ();
    } while (tmp);
}

Nice, clean and simple, and it exits with interrupts enabled (which is how I needed it).

Isn't it best to save the current register setting then return it to that state afterwards?

uint8_t oldSREG = SREG;
cli();
. . .
SREG = oldSREG;

LarryD:
Isn't it best to save the current register setting then return it to that state afterwards?

uint8_t oldSREG = SREG;
cli();
. . .
SREG = oldSREG;

Interesting thought.. However, looking at the bits in SREG, I don't see any that I need to care about ( I think).

Bit 7 – I: Global Interrupt Enable (this is the one I'm setting and clearing)
Bit 6 – T: Bit Copy Storage (not doing any bit load / bit stores - don't care about this one) (oh wait... does stuff like "PORTB |= _BV (0)" count as a "bit store"?)
Bit 5 – H: Half Carry Flag (not doing any BCD - don't care).
Bit 4 – S: Sign Bit (everything in my code is unsigned - don't care)
Bit 3 – V: Two’s Complement Overflow Flag (Not doing any math other than compares, increments and decrements - don't care).
Bit 2 – N: Negative Flag (branch instructions use this - don't care)
Bit 1 – Z: Zero Flag (ditto)
Bit 0 – C: Carry Flag (rotates and larger-than-8-bit operations use this - I may care about this one)

I guess, all in all... why not preserve SREG? A few extra clock cycles doesn't mean anything since all the code is timed by a timer interrupt. As long as I don't take longer than the interrupt interval, it doesn't matter at all.

Or I could be completely wrong...........

LarryD:
Isn't it best to save the current register setting then return it to that state afterwards?

uint8_t oldSREG = SREG;
cli();
. . .
SREG = oldSREG;

I thought about this... why should I preserve SREG? All I'm doing is setting or clearing the I bit (bit 7).

In fact, if something changes the OTHER bits, using a shadow copy of SREG will blow those changes away.

Right? (Someone - anyone - chime in here?)

If you wrote code that could be called from a variety of other places, some of which might have interrupts disabled already, then you'd want to preserve the interrupt flag and not simply set it after you do your operation. For example, if you were to call your routine from the ISR you probably wouldn't want it to reenable interrupts.

In your case it's very unlikely you'll ever do this so you can safely skip the saving of SREG.