Why global variables are slowing down my loop??

Hi everybody,
i’m porting an old PIC sketch made by me to an ARDUINO MEGA and i’m having a very strange problem.
This sketch is meant to make loop() last less than 1 millisecond for robotic purposes.

The problem that the loop, unlike the pic, lasts like 6000 microseconds. I can assure you that the old PIC did it in approx. 300 microseconds and it even had a slower clock.

The problem lies in global variables. If I declare global variables and modify them from the loop, slowdowns appear, and the loop lasts more than 6000 microseconds.
If I declare local variables in the loop and modify them, the total running time of the loop drops to 300 microseconds…
The problem is that I need at least one global variable because in the subsequent pieces of the sketch there is a spi interrupt that must send that variable.

Trying to be clearer, if I declare “counter[20]” in the loop I reach the goal of staying below the millisecond but if I declare it before the loop (globally), the loop almost reaches 6 milliseconds. The same thing happens for the variable “counter0”.
How is it simply possible? I need that “counter0” to be both global and volatile!ù

#Edit 1:
I deleted any Serial command and tried driving high and low a pin to measure loop time with an oscilloscope. The loop still lasts WAY LONGER if it uses global variables. The problem is not the Serial.

#Edit 2:
Posted photos of the println result in the 2 different cases. In the screenshots you can see where the variables are declared and the serial monitor that shows the one and only serial.println in the sketch that indicates the duration of loop in MICROSECONDS.
I don’t need to complete the code since right now i am only using the piece that i’ve posted here.

#define PC7 (PINC & 128) //PC7
#define PC6 (PINC & 64) //PC6
#define PC5 (PINC & 32)//PC5
#define PC4 (PINC & 16) //PC4
#define PC3 (PINC & 8) //PC3
#define PC2 (PINC & 4) //PC2
#define PC1 (PINC & 2) //PC1
#define PC0 (PINC & 1) //PC0

void setup() {
  Serial.begin(9600);
DDRC = 0;
}

//DECLARE THEM HERE AND GET 6000+ MICROSECONDS
  byte counter[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
  byte counter0;


void loop() {
//DECLARE THEM HERE AND GET 300 MICROSECONDS
//byte counter[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
//byte counter0;


  long t1 = micros();
  for (byte i = 0; i <= 250; i++) {
    counter[0] += !PC7;
    counter[1] += !PC6;
    counter[2] += !PC5;
    counter[3] += !PC4;
    counter[4] += !PC3;
    counter[5] += !PC2;
    counter[6] += !PC1;
    counter[7] += !PC0;
    counter[8] += !PC7;
    counter[9] += !PC6;
    counter[10] +=!PC5;
    counter[11] += !PC4;
    counter[12] += !PC3;
    counter[13] += !PC2;
    counter[14] += !PC1;
    counter[15] += !PC0;
    counter[16] += !PC7;
    counter[17] += !PC6;
    counter[18] += !PC5;
    counter[19] += !PC4;
  }
 counter0 =  counter[0];
long t2 = micros();
   Serial.println(t2 - t1);

}

Please help me!

This is not a good way to measure the time

Serial.println(micros() - t1);

Change it to

t2 = micros()
Serial.println(t2 - t1);

and see what happens

...R

Just tried, this isn't the solution... I can't even try to declare long t1 and long t2 out of the loop. I gotta declare them in the loop to avoid 5 milliseconds of slowdowns!

What if you try?  Serial.begin(115200);

AWOL: What if you try?  Serial.begin(115200);

Nope, i deleted any Serial command and drived high and low a pin to measure loop time with an oscilloscope. The problem is not the Serial. The problem is on the variable declaration.

basshunter: The problem is on the variable declaration.

While it may appear to you to be the case, that is just nonsense. You're missing something and you aren't showing us enough that we might see it.

In the example you provided above, in addition to the issue Robin pointed out, it may be the case that with the local variable version the compiler is simply removing all of that code since it doesn't see you using any of those values.

Why don't you provide a more complete example that shows what is really going on. Show us some output that is leading you to believe that your code is running slower.

Stack-pointer relative addressing vs absolute addressing on some architectures could account for a fraction of the discrepancy. I'd be inclined to put the benchmark code in a function with the array pointer as an argument.

You're iterating over a 20-element array 250 times, so you're doing that 5000 times.

Each one is... at least 3 instructions. For 15000 instructions minimum - that's just under 1ms on a 16mhz processor.

Since you report that it comes in at 300us with local variables, that makes me suspect that in the local variable version, it's just getting optimized out...

DrAzzy: You're iterating over a 20-element array 250 times, so you're doing that 5000 times.

Each one is... at least 3 instructions. For 15000 instructions minimum - that's just under 1ms on a 16mhz processor.

Since you report that it comes in at 300us with local variables, that makes me suspect that in the local variable version, it's just getting optimized out...

If what you are saying is true, i have no chances to have a global variable and keep my loop under 1 millisec

You're iterating over a 20-element array 250 times,

sp. "251". ;)

i have no chances to have a global variable and keep my loop under 1 millisec

You could use a faster processor and do your own back-of-a-beermat calculations.

It's worse than that - the only reason the one with local variables appears to run faster is that it isn't doing the same work (it can't be - there's not enough time for it to do so - it's optimizing it all out because values getting stored in counter are never used in loop and can't impact another function since they're local)

If you made it check the contents of counter (even with a test that wouldn't ever be true) and potentially do something that could impact something outside of loop based on that, that would ensure it couldn't optimize them out... And then the local variable version would be just as slow.

Delta_G: While it may appear to you to be the case, that is just nonsense. You're missing something and you aren't showing us enough that we might see it.

In the example you provided above, in addition to the issue Robin pointed out, it may be the case that with the local variable version the compiler is simply removing all of that code since it doesn't see you using any of those values.

Why don't you provide a more complete example that shows what is really going on. Show us some output that is leading you to believe that your code is running slower.

Take a look at the attached images in the main post

AWOL: Stack-pointer relative addressing vs absolute addressing on some architectures could account for a fraction of the discrepancy. I'd be inclined to put the benchmark code in a function with the array pointer as an argument.

I'm not sure that it will actually work... I've tried to compile it using global variables with Atmel Studio, still no luck but actually i got like 3000 microseconds instead of 5000. Still not enough for me...

AWOL: You could use a faster processor and do your own back-of-a-beermat calculations.

LOL man. How can a PIC18F2520 with 8 Mhz clock be faster than this mega2560?

Is it running the same code? Are you certain it's actually executing the code in that loop not optimizing it out? It would have to be doing those assignments with a single instruction in order to be under 1ms on an 8mhz chip!

This is a classic problem with benchmarking - the clever compiler will optimize away the benchmark if it knows the code won't impact anything.

Ok, I think we're firmly in xy territory. Out.

Is there any other way to put a variable into memory without having to declare it globally? Maybe EEPROM?

Yes of course you can put it in EEPROM, but it'll be slower still.

Or you could look at what you want to do, and decide on the best way to do it.

Did you try the test I suggested above? To try actually using the elements of counter array after you’ve iterated over them 250 times?

Try this:

#define PC7 (PINC & 128) //PC7
#define PC6 (PINC & 64) //PC6
#define PC5 (PINC & 32)//PC5
#define PC4 (PINC & 16) //PC4
#define PC3 (PINC & 8) //PC3
#define PC2 (PINC & 4) //PC2
#define PC1 (PINC & 2) //PC1
#define PC0 (PINC & 1) //PC0

void setup() {
  Serial.begin(9600);
DDRC = 0;
}

//DECLARE THEM HERE AND GET 6000+ MICROSECONDS
  byte counter[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
  byte counter0;


void loop() {
//DECLARE THEM HERE AND GET 300 MICROSECONDS
//byte counter[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
//byte counter0;


  long t1 = micros();
  for (byte i = 0; i <= 250; i++) {
    counter[0] += !PC7;
    counter[1] += !PC6;
    counter[2] += !PC5;
    counter[3] += !PC4;
    counter[4] += !PC3;
    counter[5] += !PC2;
    counter[6] += !PC1;
    counter[7] += !PC0;
    counter[8] += !PC7;
    counter[9] += !PC6;
    counter[10] +=!PC5;
    counter[11] += !PC4;
    counter[12] += !PC3;
    counter[13] += !PC2;
    counter[14] += !PC1;
    counter[15] += !PC0;
    counter[16] += !PC7;
    counter[17] += !PC6;
    counter[18] += !PC5;
    counter[19] += !PC4;
  }
 //counter0 =  counter[0];
 PINA=counter[0]; //This is externally visible, so it should ensure that the compiler can't optimize out all the work in the loop above. 
 
long t2 = micros();
   Serial.println(t2 - t1);

}

With that code, I suspect you’ll see very similar times regardless of where you declare the variable (and it will be slow both places). If this is the case, it proves that the difference is not where the variables are declared, but rather, whether that 251-iteration loop is happening at all or being removed by the compiler during optimization.

If you try doing something analogous on the PIC (ie, write one pin based on the values left in counter after the 251-iteration loop, you’ll probably see a dramatic change in the execution time on PIC as well.

Do it locally, but then only that function can see it.

How you declare the variables may be slowing things down too.

Post your code (use the code tag button </>), let us see.

I personally make everything global, any part of a sketch can access anything.
Once it’s declared before setup(), it doesn’t need to be declared again, and if 2 or 4 bytes or whatever are needed anyway each pass thru loop, why declare it everytime?
Harware curmudgeon hat there.

You don’t want real variables in EEPROM. A write takes 3.3mS and they have a limited number of writes you can do over a lifetime.

DrAzzy:
Did you try the test I suggested above? To try actually using the elements of counter array after you’ve iterated over them 250 times?

Try this:

#define PC7 (PINC & 128) //PC7

#define PC6 (PINC & 64) //PC6
#define PC5 (PINC & 32)//PC5
#define PC4 (PINC & 16) //PC4
#define PC3 (PINC & 8) //PC3
#define PC2 (PINC & 4) //PC2
#define PC1 (PINC & 2) //PC1
#define PC0 (PINC & 1) //PC0

void setup() {
  Serial.begin(9600);
DDRC = 0;
}

//DECLARE THEM HERE AND GET 6000+ MICROSECONDS
  byte counter[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
  byte counter0;

void loop() {
//DECLARE THEM HERE AND GET 300 MICROSECONDS
//byte counter[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
//byte counter0;

long t1 = micros();
  for (byte i = 0; i <= 250; i++) {
    counter[0] += !PC7;
    counter[1] += !PC6;
    counter[2] += !PC5;
    counter[3] += !PC4;
    counter[4] += !PC3;
    counter[5] += !PC2;
    counter[6] += !PC1;
    counter[7] += !PC0;
    counter[8] += !PC7;
    counter[9] += !PC6;
    counter[10] +=!PC5;
    counter[11] += !PC4;
    counter[12] += !PC3;
    counter[13] += !PC2;
    counter[14] += !PC1;
    counter[15] += !PC0;
    counter[16] += !PC7;
    counter[17] += !PC6;
    counter[18] += !PC5;
    counter[19] += !PC4;
  }
//counter0 =  counter[0];
PINA=counter[0]; //This is externally visible, so it should ensure that the compiler can’t optimize out all the work in the loop above.

long t2 = micros();
  Serial.println(t2 - t1);

}





With that code, I suspect you'll see very similar times regardless of where you declare the variable (and it will be slow both places). If this is the case, it proves that the difference is not where the variables are declared, but rather, whether that 251-iteration loop is happening at all or being removed by the compiler during optimization. 


If you try doing something analogous on the PIC (ie, write one pin based on the values left in counter after the 251-iteration loop, you'll probably see a dramatic change in the execution time on PIC as well.

PINA is a read-only register, i don’t get what are you doing. Anyways I tried your code and i still get the same result: fast with local variables, slow with global ones.