ineffecient to access arrays out of sequence

Basically this is just a little observation I made when accessing arrays out of sequence.

I had this inside an ISR, and it seriously affected performance to the worse. I found it pretty strange as this code is just a little bit of the ISR. I tried to have array element 2 at the top (not that it was important at all of course).

  for ( int x=0 ; x<4 ; x++ )
  {
    for ( int y=0 ; y<4 ; y++ )
    {
      int cx = x+1;
      int cy = y+1;
      int fivecy = 4 - y; // 5-cy
      face[2][x][y] = cube[cx][fivecy][0]; // Top (face 2)
      face[0][x][y] = cube[5][cx][cy];
      face[1][x][y] = cube[cx][0][cy];
      face[3][x][y] = cube[0][cx][fivecy];
      face[4][x][y] = cube[fivecy][5][5-cx];
      face[5][x][y] = cube[cx][cy][5]; // bottom (face 5)
    }
  }

Putting it in sequence made it faster again. I thought it wouldn’t matter, but I guess the compiler does some optimizations when arrays are accessed in sequence?

I have since optimized my ISR to ~40% latency, but these two versions (array in sequence vs out-of-sequence), had a latency of 53 timer2 clock cycles (86% latency) and (seemingly) 56 clock cycles (should have been about 89%, although the program reported 94% I think it was, not sure why). Pretty much at the limit of missing an ISR while inside the ISR for a timer2 timeout of 63 clock cycles (every 1 ms, or 1000 Hz, timer load value of 193 (=256-63)).

However the last figure of 56 clock cycles probably wrapped around several times. Timing 1000 loops of the main loop took about 9 seconds with the array in sequence, while the other took 1 min. 18 seconds. So it had a real big performance hit, more than I’d think a difference of about 14% execution time for the main loop vs. 11% (or even 4%).

At least from what I get. I followed this excellent tutorial on Arduino timer interrupts btw: http://www.uchobby.com/index.php/2007/11/24/arduino-interrupts/ in case others are interested.

I thought it wouldn't matter, but I guess the compiler does some optimizations when arrays are accessed in sequence?

Well, I guess the 3-D array plane/row calculations involve multiplication if the accesses are out of numerical order, but in-order would allow the compiler to optimise and use simple additions of constants. But that's just a guess.

The linked tutorial on timer interrupts is excellent. Thank you for posting it.

One issue that adds to the "misery" is that you use 16-bit loop/index variables. On an 8-bit RISC mcu it makes a lot of sense to always use byte (8-bits) for variables (or constants) in the 0-255 range. The ATmega do not have native instructions for multiplication/division and so this is compiled to use add/subtract/shift instructions in a loop. Every excess bit will typically force an additional pass through the loop and so add to execution time.

Another typical example of the 8/16-bit abuse is the blinking Led sketch:

  int ledPin = 13;

The use of “int” suggests (to me) that the ATmega has 65536 (or at least more than 256) I/O pins. As you know it doesn't (nor do the mega). Also “int” is a signed quantity. Does this mean that if we use negative pin numbers (e.g. -13) that the behavior will be different? It will not (a non-existing pin will be referenced), but signed quantities need additional processing for sign extension and it affects performance negatively.

Also the use of “int” suggests to the compiler that ledPin is a variable. For a ledPin it is highly unlikely that you will reassign (swap pins) while your sketch is running. So rather the following definition should be used:

  const byte ledPin = 13;
  // or alternatively
  //#define ledPin (byte)13;

Using the “const” prefix will allow the compiler to better optimize your code and it will also detect and report errors if you accidentally change the “variable” in your code when no change was intended.

Unfortunately this 8/16-bit abuse is all over the Playground as well as in the core and libraries supplied with the Arduino.

@BenF

I think you're missing a few =, there.

int ledPin [glow]=[/glow] 13;

@BenF

I think the usual explanation is that artists just want to get their kinetic sculpture moving, they don’t want to have to learn the difference between 8 and 16 bit numbers, signed and unsigned etc. You just tell them that a whole number is an int(eger) and let them get on with it.

Whether that explanation satifies you or not possibly depends on what you think the Arduino is for.

Andrew

BenF:

One issue that adds to the “misery” is that you use 16-bit loop/index variables.

Thanks for the input. I’ve been wondering a little about that. So is it more effective to use an 8-bit variable as an index unless the array is bigger than 256 elements? And also for loop variables?

How about (in a for loop) signed vs unsigned variables? I guess one issue when using unsigned variables, you cant check for when it goes “below” (wraps) zero?

Also, I noticed the following takes slightly longer execution time:

  for (int x=3 ; x>=0 ; x--) {some code}

than

  for (int x=0 ; x<4 ; x++) {some code}

Also the use of [ch8220]int[ch8221] suggests to the compiler that ledPin is a variable.

Yes, I agree, I use the “const int ledPin” variety myself. I’m not sure how the Arduino libraries convert/access pins that are defined with an 8-bit variable vs 16-bit is though, would it be more effective to have a “const unsigned byte ledPin”?

I agree with all the notes on optimizations that BenF and others have raised.

I do want to put a little caution though... don't go way out of your way to find every tiny optimization like this until the algorithm basically works. Once the algorithm works, then attack the bottleneck or worst-performing part(s) first. A bottleneck is defined as a problem that affects throughput in a significant way.

By all means, if writing a fast timer interrupt handler, you will need to wring every last cycle out of the processor so you don't bring everything else to a crawl or overrun the interrupts.

However, replacing 'int' with 'byte' everywhere doesn't just save memory, it adds some amount of complexity, especially when you accidentally cause logic errors by shaving necessary range or sign bits. It's usually better to look for actual causes of performance degradation and attack those. Fixing a sort routine to use a better algorithm can shave a lot more time than turning indices into register byte variables. And spending time micro-optimizing bad algorithms is lost when you replace the whole thing with a better algorithm.

Again in brief, I agree with the best use of the hardware architecture in your code, but I caution, don't fall victim to "premature optimization."

Alright - it seems obvious that you are doing some sort of 3D work with the Arduino (unless I am just reading too much into things) - can you tell us more about the project? Sounds interesting...

cr0sh2:

I will make a more complete posting in this forum when I'm more ready. Not sure when that will be though, still some programming left to do. But yeah you're right :)