modulo - unexpected behaviour

While debugging another program I found some arguments for the modulo operator are less efficient than others. So I wrote a small test sketch to try and understand the pattern. Now I’m more confused.

Can anyone explain this one? Particularly, why do some of the time deltas appear to be negative? And why would modulo 4 take 3x longer than modulo 3?

Sketch:

void setup()
{
  TCCR4B = B00000001;
  Serial.begin(115200);

  byte SeqIndex = 0;

  for (int i = 1; i < 10; i++) {
    noInterrupts();
    byte temp = SeqIndex;
    unsigned int a = TCNT4;
    SeqIndex = (SeqIndex + 1) % i;
    unsigned int b = TCNT4;
    interrupts();
    Serial.print(SeqIndex, DEC);
    Serial.print(" : ");
    Serial.print(temp, DEC);
    Serial.print(" : ");
    Serial.print(i, DEC);
    Serial.print(" : ");
    Serial.println(b-a, DEC);
  }
}

void loop()
{
}

Output:

0 : 0 : 1 : 49
1 : 0 : 2 : 65344
2 : 1 : 3 : 42
3 : 2 : 4 : 130
4 : 3 : 5 : 65346
5 : 4 : 6 : 168
6 : 5 : 7 : 65496
7 : 6 : 8 : 65340
8 : 7 : 9 : 162

I'd be thinking the power-of-two were being optimized to inline bit-twiddling. Have a look at the C-code in the modulo function.

Can anyone explain this one?

I can’t but I can offer a different perspective. I took your code and modified it for a Teensy (I don’t have an Arduino right now and I’m too lazy to hook up a bare-bones 328). My results are much better…

0 : 0 : 1 : 241
1 : 0 : 2 : 240
2 : 1 : 3 : 240
3 : 2 : 4 : 240
4 : 3 : 5 : 240
5 : 4 : 6 : 240
6 : 5 : 7 : 240
7 : 6 : 8 : 240
8 : 7 : 9 : 240

Particularly, why do some of the time deltas appear to be negative?

My guess is that timer 4 is in a mode where it resets to zero instead of rolling over. Setting TCCR4A will probably fix the problem.

I’d be thinking the power-of-two were being optimized to inline bit-twiddling

That optimization is usually only done if the right-side is a constant.

void setup()
{
  Serial.begin(115200);

  TCCR3A = B00000000;
  TCCR3B = B00000001;
}

void loop()
{
  Serial.println( "========================================" );
  
  byte SeqIndex = 0;

  for (int i = 1; i < 10; i++) 
  {
    noInterrupts();
    byte temp = SeqIndex;
    unsigned int Start = TCNT3;
    SeqIndex = (SeqIndex + 1) % i;
    unsigned int Stop = TCNT3;
    interrupts();
    
    Serial.print(SeqIndex, DEC);
    Serial.print(" : ");
    Serial.print(temp, DEC);
    Serial.print(" : ");
    Serial.print(i, DEC);
    Serial.print(" : ");
    Serial.println( Stop-Start, DEC );
  }
  delay( 1000 );
}

(Sorry for delay, I had notification off on this thread.)

Thanks for reviewing. Your differing results are certainly more inline with what I would expect. The code I originally noticed the problem in (almost) definitely had Timer4 running in Normal mode. I'll try a few more things based on your findings and post back.

I tried looking for the modulo code too, but couldn't find it. In which file is it located?

Ok, with some more experimenting I’m starting to think that 240 system clock ticks is the normal amount of time to perform a modulo calculation. All other results that take less time I suspect are optimizations of the compiler for powers of 2, or other silly binary stuff that is beyond me :wink:

CB: You are correct setting TCCRnA fixed the problem. Ironically I solved this same problem for another forum member just a few days ago (where the default value of TCCRnA was presumed 0 but isn’t in Arduino_land).

Here is the latest code I was experimenting with…

void setup()
{
  TCCR3A = 0;
  TCCR3B = B00000001;
  Serial.begin(115200);

  byte SeqIndex = 0;

  for (int i = 1; i < 10; i++)
  {
    noInterrupts();
    unsigned int Start = TCNT3;
    SeqIndex = (SeqIndex + 1) % 3;  // performing modulo 4 instead of modulo 3 reduces Stop-Start to just 7
    unsigned int Stop = TCNT3;
    interrupts();
    Serial.print(SeqIndex, DEC);  // commenting this line reduces Stop-Start to just 4, even for % 3
    Serial.print(" : ");
    Serial.println( Stop-Start, DEC );  // in the code compiled here verbatim the result of Stop-Start = 240
  }
}

void loop()
{
}

Note the comments. The time for % 3 vs. % 4 operations was the original case that drew my attention to this. I presume the compiler optimizes for % 4 because it’s a power of 2.
And similar theory about commenting out the Serial.print of SeqIndex; without referencing that variable again it probably gets a fancy optimization.

Anyway, I think I can call this one solved.

I presume the compiler optimizes for % 4 because it’s a power of 2

Exactly. Modulus of a power of two is equivalent to a bitwise-and. On a processor that doesn’t have a division, the bitwise-and is considerably faster than modulus (division). But you already know that!

And similar theory about commenting out the Serial.print of SeqIndex; without referencing that variable again it probably gets a fancy optimization

Because the value of SeqIndex is never used, every thing associated SeqIndex is actually removed; it’s as if SeqIndex never existed.

Anyway, I think I can call this one solved

That’s always a good feeling!