I had to work recently on a time sensitive project where I wanted to make many operations in less than 30 microseconds. Although some timing values can be found here and there in this forum, I thought that it could be useful to have a table showing the time taken to make some operations with an Arduino Uno (or other ATmega328-based Arduino running at 16 MHz). I thus tried to make my own measurements (see below for more details) and I ended up making somewhat detailed tables.
The first table shows the time taken to execute common Arduino commands. I also included few Assembly commands to show the potential speed gain for some operations. Also note that the time required for serial communications depends on the data rate and on whether the serial buffer is full or not.
The second and third tables show the calculation speed of an ATmega328-based Arduino running at 16 MHz. It is important to note that the actual time taken to make an operation depends on whether the variables are already in Register File or if they must be first transferred from RAM. I thus decided to make two tables:
This table assumes that all the variables must be fetched from RAM (worst case scenario):
This table assumes that all the variables are already available in the Register File (best case scenario):
I also attached the tables in pdf format to this message.
Method used to make the tables:
Disclaimer: I have a limited experience with AVR programming and GCC compiler. During my tests, I found that measuring the time taken to run some commands is tricky because the compiler can perform optimizations that can alter timing measurements. It is also difficult to get rid of any overhead and measure precisely the time taken for each operation. So, let me know if you believe that the method I used for timing measurements is wrong for some reason.
Here is the code I used for the timing measurements:
const int nb_of_interation_per_pass = 30000;
const int nb_of_pass = 100;
unsigned long initial_time = 0;
unsigned long final_time = 0;
float duration_dummy_loop = 0;
float duration =0;
float duration_sum = 0;
// All variables used for calculations are declared globally and volatile to minimize
// any possible compiler optimisation when performing the same operation multiple times.
volatile int dummy = 0;
volatile float float_1 = 0;
volatile float float_2 = 0;
volatile float float_3 = 0;
volatile long long_1 = 0;
volatile long long_2 = 0;
volatile long long_3 = 0;
volatile int int_1 = 0;
volatile int int_2 = 0;
volatile int int_3 = 0;
volatile byte byte_1 = 0;
volatile byte byte_2 = 0;
volatile byte byte_3 = 0;
volatile boolean bool_1 = 0;
volatile boolean bool_2 = 0;
volatile boolean bool_3 = 0;
void setup() {
Serial.begin (115200);
}
void loop() {
for (int j = 0; j < nb_of_pass ; j++) {
// STEP 1: We first calculate the time taken to run a dummy FOR loop to measure the overhead cause by the execution of the loop.
initial_time = micros();
for (int i = 0; i < nb_of_interation_per_pass ; i++) {
dummy++; // A dummy instruction is introduced here. If not, the compiler is smart enough to just skip the loop entirely...
}
final_time = micros();
duration_dummy_loop = float(final_time - initial_time)/nb_of_interation_per_pass; // The average duration of a dummy loop is calculated
dummy = 0;
// STEP 2 (optional): Pick some relevant random numbers to test the command under random conditions. Make sure to pick numbers appropriate for your command (e.g. no negative number for the command "sqrt()")
randomSeed(micros()*analogRead(0));
byte_1 = random(0,256);
byte_2 = random(1,256);
// STEP 3: Calculation of the time taken to run the dummy FOR loop and the command to test.
initial_time = micros();
for (int i = 0; i < nb_of_interation_per_pass ; i++) {
dummy++; // The dummy instruction is also performed here so that we can remove the effect of the dummy FOR loop accurately.
// **************** PUT YOUR COMMAND TO TEST HERE ********************
byte_3 = byte_1/byte_2; // Target command example
// **************** PUT YOUR COMMAND TO TEST HERE ********************
}
final_time = micros();
// STEP 4: Calculation of the time taken to run only the target command.
duration = float(final_time - initial_time)/nb_of_interation_per_pass - duration_dummy_loop;
duration_sum += duration;
dummy = 0;
Serial.print(j);
Serial.print(". ");
print_result(duration);
}
Serial.println();
Serial.println("********* FINAL AVERAGED VALUE *********** ");
print_result(duration_sum/nb_of_pass);
Serial.println("****************************************** ");
Serial.println();
duration_sum = 0;
delay(2000);
}
void print_result(float value_to_print){
Serial.print("Time to execute command: ");
Serial.print("\t");
Serial.print(value_to_print,3);
Serial.print(" us");
Serial.print("\t");
Serial.print(round(value_to_print*16));
Serial.println(" cycles");
}
The method I used can be summarized as follow:
- All variables are declared globally and volatile to minimize any possible compiler optimization.
- The time to run a dummy command a large number of times is measured using micros().
- The time to run the target command AND the dummy command the same number of times is measured using micros(). The time to run only the command is then obtained by a simple subtraction.
This technique should get rid of the overhead caused by the loop. The large number of iterations (I typically used 30000 iterations) also provides a relatively accurate timing despite the limited accuracy of micros(). After dividing by the number of iterations, the timing values can thus be measured to about ±0.02 us (~1/3 of a clock cycle). Also, I found that the timing of some operations depends on the values used to make the calculations (e.g. floating point operations, divisions, etc.). For these operations, I picked relevant random numbers and performed the timing calculation using at least 100 different values.
As all the variables are declared volatile, this code returns timing that includes the time required to read and write the variables to RAM. If we want only the time required by the Arithmetic Logic Unit to perform the operation, we have to subtract the time to read and write to RAM. The good news is that the time to access RAM can be measured easily. For example, writing to RAM corresponds to the time required to assign a known constant to a volatile variable (e.g. byte_1 = 128;).
From what I understand, the timing values I measured appear to be valid. For example, the measured times to access RAM (i.e. 2 clock cycles per byte) and to add two bytes (1 cycle if the bytes are in the Register) corresponds to that reported in the ATmega328 datasheet. Also, the time to execute asm("nop\n\t") is measured as 1 cycle as it should.
Comments and remarks are welcome.
Arduino calculation speed.pdf (139 KB)