PROGMEM read time: too slow?

Hi,

I'm doing some tests about reading data from a look-up table stored in the PROGMEM space. I came across this old topic:

http://forum.arduino.cc/index.php?topic=134782.0

where they state that it normally takes 3 clock cycles to read from flash, compared to 2 cycles when reading from SRAM.

I have the following code, running on an Arduino Nano:

#include <avr/pgmspace.h>
#define LUT_SIZE    10
const byte lut[LUT_SIZE] PROGMEM = {0,1,2,3,4,5,6,7,8,9};
int nIters = 1000;  
unsigned long tStart, tEnd;

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
}

void loop() {
  // put your main code here, to run repeatedly:
 
  tStart = micros();
  for(int i = 0; i < nIters; ++i)
  {
    pgm_read_byte(lut + i%LUT_SIZE);
  }
  tEnd = micros();

  Serial.println((float)(tEnd - tStart)/nIters);
  delay(1000); 
}

What I find is that it takes approximately 14.66 microseconds per access, which is quite a lot of clock cycles!

Am I doing something wrong here or is this just how it's supposed to be?

Thank you very much in advance!

const byte lut[LUT_SIZE] PROGMEM = {0,1,2,3,4,5,6,7,8,9};

Are you serious? Why do you need a lookup table when lut[ i ] = i?

Tinrik:
Am I doing something wrong here or is this just how it's supposed to be?

I find a time of 0.3 microseconds per access, using this code:

#include <avr/pgmspace.h>
#define LUT_SIZE    10
const byte lut[LUT_SIZE] PROGMEM = {0,1,2,3,4,5,6,7,8,9};
int nIters = 1000/LUT_SIZE;  
unsigned long tStart, tEnd;

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
}

void loop() {
  // put your main code here, to run repeatedly:
 
  tStart = micros();
  for(int i = 0; i < nIters; ++i)
  {
    pgm_read_byte(lut+0);
    pgm_read_byte(lut+1);
    pgm_read_byte(lut+2);
    pgm_read_byte(lut+3);
    pgm_read_byte(lut+4);
    pgm_read_byte(lut+5);
    pgm_read_byte(lut+6);
    pgm_read_byte(lut+7);
    pgm_read_byte(lut+8);
    pgm_read_byte(lut+9);
  }
  tEnd = micros();

  Serial.println((float)(tEnd - tStart)/(nIters*LUT_SIZE));
  delay(1000); 
}

0.3µs = 16*0.3 = 4.8 clock cycles.

Some time is used (even in my) for-loop incrementing the index and jumping to the first instruction in the loop.

What you are doing wrong is surely this division:

pgm_read_byte(lut + i%LUT_SIZE);

The operation "i%LUT_SIZE" is a division, and Atmega controllers do NOT HAVE ANY HARDWARE DIVISION, so the division is done in software only, which is VERY COSTLY when calculated in "clock cycles".

Many operations on Atmega controllers can be done in one clock cycle only, BUT NOT DIVIDING!

So if you want to do fast running code on Atmegas, AVOID DIVIDING!

AVOID DIVIDING!

Except for powers of two.

I agree, you are timing the division there, not the program memory read. The datasheet states that the read from program memory takes one more clock cycle (and they should know). At 16 MHz that is 62.5 ns. You don't need to write code to prove it. If your code does not give you that result, look at your code. The division by 10 would be an obvious problem area.

That was it, such a silly mistake! I will now think twice before using the modulo operator.

Thank you very much!