Strange Error: Program takes the wrong branch!

Hello everyone,

I've been using Arduinos for quite a while now, but now stumled upon a very strange behaviour.
I'm using Arduino Uno or Nano (choose one...), to originally implement a control for a robot chassis with 2 drives (this explains the naming of "cnt_l" and "cnt_r" as being the left and right motor, originally.
Now, for demonstration, I replaced the external interrupts by just a timer interrupt every 2 msec's which should sufficiently represent the original interrupt frequency.
Now the interesting part is obvusly the "pulses" variable. This will be originally have values around ~1030 in the original application. By choice, I've set it to 1028 here.
The rest of the sketch should be quite easy to read; there are many Serial.println's for analysis, an ISR to just increment cnt_l and cnt_r, and some evaluation of cnt_l and cnt_r against "pulses".
Now this is the point where the "magic" happens: with "pulses" being 1028, you should see the board running into a wrong path after 4 loops (or so... can possibly vary?), as the "greater or equal" compare of "cnt_l" (being 1024, interestingly!) against "pulses" (being "1028" or whatever you enetered) gives a wrong "True" !!

Here is the code:

#include <MsTimer2.h>

// these are the questionable variables:
volatile unsigned int cnt_l=0, cnt_r=0, pulses=1028;

//volatile unsigned long cnt_l=0, cnt_r=0;

// these are for debugging only:
bool once[3] = {false, false, false};
unsigned int rot_cnt=0;

void setup() {
  // put your setup code here, to run once:
Serial.begin(115200);
pinMode(0,INPUT_PULLUP);
pinMode(12,OUTPUT);
pinMode(13,OUTPUT);

// for testing - also happens the same with external interrupts:
MsTimer2::set(2,countOnTimer);
}

void loop() {

  static int loopcnt=0;
  static boolean started=false;
  
  Serial.print("** Loop ");
  Serial.print(++loopcnt);
  Serial.println(" : **");
  Serial.print("pulses=");
  Serial.println(pulses);
  MsTimer2::start();
  started = true;

  while ( started ) {
    // this is where fault may happen -- variables cnt_l and cnt_r are incremented by interrupts: ...
    if ( cnt_l >= pulses ) {
      // it jumps in here at cnt_l=1024 instead of cnt_l >=1083 !!! 
      // sometimes also jumps in here at cnt_l=256 !!!
     if (! (once[0]) ) {
      Serial.print("LEFT Stop - cnt_l: ");
      Serial.println(cnt_l);
      Serial.print("- Req'd Pulses: ");
      Serial.println(pulses);
      Serial.println();
      once[0]=true;
      }
      if ( (cnt_l == 1024) || (cnt_l == 256) ) {
        Serial.println("FAIL!");
        Serial.println(cnt_l);
        while (true) {
        delay(1);
        }
      }
    }
    // or also here fault may happen: ...
    if (cnt_r >= pulses) {
      // it jumps in here at cnt_r=1024 instead of cnt_r >=1083 !!! 
      // sometimes also jumps in here at cnt_r=256 !!!
      if (! (once[1]) ) {
      Serial.print("RIGHT Stop - cnt_r: ");
      Serial.println(cnt_r);
      Serial.print("- Req'd Pulses: ");
      Serial.println(pulses);
      Serial.println();
      once[1]=true;
      }
      if ( (cnt_r == 1024) || (cnt_r ==256) ) {
        Serial.println("FAIL!");
        Serial.println(cnt_r);
        while (true) {
          delay(1);
        }
      }
     }
     if ( (cnt_l >= pulses) && (cnt_r >= pulses) ) {
      // possibly also here the fault may happen, but should be covered by the upper 2 appearances already: ...
      if (!(once[2]) ){
      Serial.println("LEFT+RIGHT Stop ");
      Serial.print("- cnt_r: ");
      Serial.println(cnt_r);
      Serial.print("- cnt_l: ");
      Serial.println(cnt_l);
      Serial.print("- Req'd Pulses: ");
      Serial.println(pulses);
      Serial.println();
      }
      digitalWrite(13,0);
      digitalWrite(12,0);

      if ( (cnt_r == 1024) || (cnt_r ==256) ) {
        Serial.println("FAIL!");
        Serial.println(cnt_r);
        while (true) {
          delay(1);
        }
      }
      if ( (cnt_l == 1024) || (cnt_l == 256) ) {
        Serial.println("FAIL!");
        Serial.println(cnt_l);
        while (true) {
        delay(1);
        }
      }
       Serial.print(++rot_cnt);
       Serial.println(" - Rotation STOP!");
       Serial.print("cnt_l=");
       Serial.println(cnt_l);
       Serial.print("cnt_r=");
       Serial.println(cnt_r);
       Serial.print("pulses=");
       Serial.println(pulses);
       
       // let motors run out:
       delay(300);
       Serial.println("Go on ...");
       MsTimer2::stop();
       cnt_l=0;
       cnt_r=0;
       started=false;
       once[0]=false;
       once[1]=false;
       once[2]=false;
     }
  }

Serial.println("END OF TURN");

digitalWrite(12,0);
digitalWrite(13,0);
delay(500);
}

void countOnTimer() {
  cnt_l++;
  cnt_r++;
}

Here is the serial output I see:

** Loop 1 : **
pulses=1028
LEFT Stop - cnt_l: 1028

  • Req'd Pulses: 1028

RIGHT Stop - cnt_r: 1028

  • Req'd Pulses: 1028

LEFT+RIGHT Stop

  • cnt_r: 1030
  • cnt_l: 1031
  • Req'd Pulses: 1028

1 - Rotation STOP!
cnt_l=1033
cnt_r=1034
pulses=1028
Go on ...
END OF TURN
** Loop 2 : **
pulses=1028
LEFT Stop - cnt_l: 1028

  • Req'd Pulses: 1028

RIGHT Stop - cnt_r: 1028

  • Req'd Pulses: 1028

LEFT+RIGHT Stop

  • cnt_r: 1030
  • cnt_l: 1031
  • Req'd Pulses: 1028

2 - Rotation STOP!
cnt_l=1033
cnt_r=1034
pulses=1028
Go on ...
END OF TURN
** Loop 3 : **
pulses=1028
RIGHT Stop - cnt_r: 1028

  • Req'd Pulses: 1028

LEFT+RIGHT Stop

  • cnt_r: 1028
  • cnt_l: 1029
  • Req'd Pulses: 1028

3 - Rotation STOP!
cnt_l=1031
cnt_r=1032
pulses=1028
Go on ...
END OF TURN
** Loop 4 : **
pulses=1028
LEFT Stop - cnt_l: 1024

  • Req'd Pulses: 1028

FAIL!
1024

  • You can see, it runs for 3 turns correctly, and fails in round 4.
  • Only changing "pulses" to "1029" instaed, will lead to abortion after 16 rounds.
    Partial output here:

...
15 - Rotation STOP!
cnt_l=1034
cnt_r=1035
pulses=1029
Go on ...
END OF TURN
** Loop 16 : **
pulses=1029
LEFT Stop - cnt_l: 1024

  • Req'd Pulses: 1029

FAIL!
1024

Please feel free to try other values of "pulses".

This behaviour is a total mystery to me - the common thing is, it always sees the respective counter at 1024!!! In the real application, this leads to the motor stopping prematurely.
I don't think the program has a logical error, since it runs for a varying number of turns.
I can also exclude it is HW (power supply) related, since it happens as well on an Arduino Uno, as on a bare-naked Arduino Nano.
More, I tried to put "pulses" definition in different places (local to "loop", or global), and also set the variables to different types, as well as making them "volatile" or not: No general change, it just changes the number of cycles after which the behaviour appears.

  • In the real application, I also observed drop-outs at the counter value "256" . Thus the related "if" decision. In this test, I haven't seen faults with "256" counts occurring however.

I think it must be related to Atmega 328's internals of Interrupt handling, or maybe to something in the Ardino (1.8.x) IDE .
To me, it looks a bit as if the "Equal" bit in the condition register would toggle when too many bits somewhere else are toggling at once (sorry for the possibly inaccurate wording)...

Any help would be greatly appreciated - this haunts me for days and weeks now...

Kind regards

Ansgar

An int is two bytes so you should temporarily disable interrupts while reading the values of cnt_l and cnt_l because the interrupt might update part of the value while you are trying to retrieve it.

Something like this

void countOnTimer() {
  cnt_l++;
  cnt_r++;
  newValue = true;
}

and, in loop()

if (newValue == true) {
  noInterrupts();
    latestCntL = cnt_l;
    latestCntR = cnt_r;
    newValue = false;
  interrupts();
}

then use the values in latestCntL etc in your calculations

I prefer not to set the counter values back to zero - just allow them to roll over. If you use subtraction - for example latestCntL - previousCntL you will get the correct result even when the value rolls over.

...R

Hello Robin,

OK, thanks so much, that does the trick in the first place!

Some more calculations and measurements to share:
I've measured the time: it consumes some ~4 microseconds to disable, copy the variables as you mentioned, and re-enable interrupts.
Then, my whole example "while" loop runs ~660 times before one timer count happens at this rate. So even if the real interrupt rate would go up to 2x 15 kHz at full motor speed which is then a total factor of 60 (now we have ~500 Hz), we still would have 10-11 "while" loop runs between single counts. And likely, I won't lose a count even then, since from one falling edge to the next, at 15kHz it is 66 microseconds which is well enough to pass through at least one "while" loop run.
But one must keep track of it, as soon as the unit gets more work to do than just spinning the wheels and counting. There will be distance sensors polling, ground tracking, ... you name it.

I'll consider the other hint concerning zeroing / not zeroing the counters... saves a little bit of time, if not resetting them, but on the other hand makes bugtracking a little easier if you have "absolute" values.

One side question remains to be cleared: what happens if acually having 2 interrupt( line)s set up?
Will they interfere? What's the best way to keep them apart? I mean, what if the 2nd interrupt would in worst case fall consistently into the time window of handling the 1st (disable/copy/re-enable)? Then I would lose counts...

best regards

Ansgar

stargar:
Then, my whole example "while" loop runs ~660 times before one timer count happens at this rate. So even if the real interrupt rate would go up to 2x 15 kHz at full motor speed which is then a total factor of 60 (now we have ~500 Hz), we still would have 10-11 "while" loop runs between single counts. And likely, I won't lose a count even then, since from one falling edge to the next, at 15kHz it is 66 microseconds which is well enough to pass through at least one "while" loop run.
But one must keep track of it, as soon as the unit gets more work to do than just spinning the wheels and counting. There will be distance sensors polling, ground tracking, ... you name it.

You have lost me with that. I'm not sure if you are saying there is plenty of time or not enough time.

A variation on my suggested code is to have the ISR only set newValue = true after 8 or 16 pulses.

I don't think you have said what is generating the pulses or how many pulses there are per revolution.

...R

Hello Robin,

Sorry I was kind of thinking out loud... I meant to say, normally there is well enough time to get around, even at very high speeds of the vehicle. I think I have actually measured frequencies up to 12kHz coming in from each of the motors' incremental encoders at full speed, so 15kHz estimate was with a bit of safety margin.

The only thing that's now left to check is how it will behave with 2 external interrupt channels. I don't like the thought of losing some interrupts while working away the others...

BR

stargar

stargar:
The only thing that's now left to check is how it will behave with 2 external interrupt channels. I don't like the thought of losing some interrupts while working away the others...

An ISR should never consume so much time that other interrupts can not be serviced in a reasonable time. There is almost never any time when that becomes an issue, with proper design.

Hello Robin,

well... but now just as an imaginated experiment, think of 2 ISRs set up on 2 Interrupt lines, running in perfect sync, while interrupt No. 2 would always be just a couple of microseconds behind No. 1, at ever the same distance in time.
The first interrupt gets triggered, you find out it happened, then switch off interrupts with noInterrupts() and grab the new counter value. In this time, before re-enabling the interrupt, the 2nd line falls to LOW. Then the interrupts are re-enabled, and the 2nd trigger event has not been seen... and because always at the same distance behind Interrupt No.1, it would just never be seen.

Or would Interrupt line No.2 then trigger an event while just re-enabling interrupts, if the line is still held LOW and interrupt type is set to FALLING?
Then that would be perfectly fine, because if in real life Interrupt No.2 would arrive just during the short time Interrupt No.1 is being handled (and IRQs are OFF), it would in any case stay LOW significantly longer than the "off" period, and would then be seen right after. Going to check this on a piece of HW...

(Side note: I think I can't use interrupt type LOW, since the LOW time is much too long, and a lot of interrupts would be generated then.)

BR

stargar

The event that causes the interrupt will be recognized even while interrupts are disabled and it will be acted on as soon as they are re-enabled. However if two events on the same pin occur while interrupts are disabled the second event will be lost.

...R

Hello Robin,

that sounds good. I think it is solved by that.
However, what I learned additionally is the matter with the "atomic operations" within the ISRs, at least for variables that are used also outside the ISR. Atomic operations can only be done on 8-bit variable types, so an "add" or "increment" can only be used on byte-sized variables reasonably in ISRs, not on "integer"s or "long"s.

I'll redo my code in this sense, using just a boolean value as a flag (also byte-size...) from the ISR, and do all the counting in the regular subroutine then.
Possibly with using only atomic operations, one doesn't even need to disable interrupts.
Case closed!

Thanks and best regards

stargar

stargar:
However, what I learned additionally is the matter with the "atomic operations" within the ISRs, at least for variables that are used also outside the ISR. Atomic operations can only be done on 8-bit variable types, so an "add" or "increment" can only be used on byte-sized variables reasonably in ISRs, not on "integer"s or "long"s.

I'll redo my code in this sense, using just a boolean [...]

There is no need. Accessing non-atomic variables from an ISR is a common and often necessary practice. All you have to do is disable interrupts in the main code for the short time that the variable is copied.

stargar:
One side question remains to be cleared: what happens if acually having 2 interrupt( line)s set up?
Will they interfere? What's the best way to keep them apart? I mean, what if the 2nd interrupt would in worst case fall consistently into the time window of handling the 1st (disable/copy/re-enable)? Then I would lose counts...

AFAIK: if you have two separate things on two separate interrupt lines, and it might be possible for both interrupts to happen effectively simultaneously, then yes it's possible that the second one might be missed. That is: glitches are possible. It's usually the case that when we are dealing with hardware, we are dealing with an imperfect world and have to code for it.

The usual method is to make your interrupt handlers as small and fast as possible. This is why the only thing you generally do in an interrupt handler is increment some variables and set a one-byte flag (so that reading it is atomic).

If the code absolutely, positively must not glitch, then I'd guess that the only solution involves adding some additional electronics - some sort of latching gate setup that the interrupt handler reads. But even then, it's might still be possible for the system to glitch - it's just that the 'window' can be reduced from microseconds down to nanoseconds, maybe.

There's no such thing as perfect, there is only mean time between failure.

-- EDIT --

Does the AVR do some sort of latching on interrupts? That is, is it the case that should an interrupt occur (ie: the line goes from LOW to HIGH for a RISING interrupt) during a nointerrupts block, it will trigger an interrupt as soon as interrupts are turned back on? In that case, provided that interrupts don't occur any faster than the interrupt code can handle them, it might be possible to become certain that the system will work reliably.

Generally, interrupts from events are latched. This is true on almost every processor I've ever seen. If both interrupt pins are tied together for example, both ISR will be called, one after the other.

stargar:
However, what I learned additionally is the matter with the "atomic operations" within the ISRs, at least for variables that are used also outside the ISR.

Isn't that what got us started? It is why I used noInterrupts() in the code in Reply #1. I thought I had covered the point in my first sentence. :slight_smile:

...R

To have safe access to state in the presence of interrupts you need a:

aarg:
Generally, interrupts from events are latched. This is true on almost every processor I've ever seen. If both interrupt pins are tied together for example, both ISR will be called, one after the other.

Makes sense, for exactly this reason.

Some types of interrupts are latched. Others aren't. Specifically, if you configure your External Interrupt to fire level-based instead of edge-based, it isn't latched.

to all who replied here:

many thanks - this was very comprehensive, and hopefully may serve as a valuable reference for others who need info on interrupts and their behavior.
Case closed :slight_smile:

Best regards

stargar