Due not going as fast as it should??

I have a project that requires my microprocessor to be as fast as possible. I have a set of calculations that need done repeatedly. The end goal is to send signals to stepper motor drivers to follow trajectories. I purchased an arduino due (upgrading from mega) because it is theoretically much faster.

However after I got it and have been doing some benchmarking on the actual speeds of the two I am coming up with the due not being nearly as fast as it should.

From the clock speeds alone I would expect the due to be 5.25 faster than the mega (84MHz/16MHz). Not even taking into account the 32 bit processor over the 8 bit one. I don't know the details of this but from what I have read it is a big upgrade.

I have run an identical code in both processors. The calculations take ~968us on the due, and ~2832us on the mega... according to that the due is not even 3x faster than the mega. Below is the code.

float a0, a1, a2, a3, a4, a5, a6;     //polynomial coefficients
float thv = 360;            //middle angle
float ths = 0;             //initial theta
float thf = 0;            //final theta
float tf = 2;               //movement duration
float t=1.5;
float angle;
float start, stop, elapsed;
void setup()
{
  Serial.begin(9600);
}

void loop()
{
  start=micros();
  a0 = ths;
  a1 = 0;
  a2 = 0;
  a3 = (2 / pow(tf, 3)) * (32 * (thv - ths) - 11 * (thf - ths));
  a4 = -1 * (3 / pow(tf, 4)) * (64 * (thv - ths) - 27 * (thf - ths));
  a5 = (3 / pow(tf, 5)) * (64 * (thv - ths) - 30 * (thf - ths));
  a6 = -1 * (32 / pow(tf, 6)) * (2 * (thv - ths) - (thf - ths));
  angle = pow(t, 6) * a6 + pow(t, 5) * a5 + pow(t, 4) * a4 + pow(t, 3) * a3 + a0;   //calculate desired position
  stop=micros();
  elapsed=stop-start;
  Serial.println(elapsed);
  //takes ~968us on due
  //takes ~2832us on mega
}

Am I missing something?

Probably most of the calculation.

I doubt that a mega could carry out that bunch of floating-point-arithmetics in 2832µS.

pow (1.234, 5.678) AVR2 9293 AVR4 5047

Pow() on a due takes doubles as arguments, so it's mostly calculating twice as many bits, AND wasting time converting back and forth from float to double...

And since all your values are currently constants, it's hard to say how much of your calculations are being done at compile time instead of by the avr or arm.

Because none of the computed values is used, the whole expression gets optimized to nothing (my guess).
I think the run-time value measured show that quite clearly, 1 full round takes half the time of one pow(),
how could that be, if it is computed at all?
And I doubt that pow() (even if called with constants as arguments) gets computed at compile time.
A disassembly of the generated code could show what's really executed.

I surrender to the very very clever compiler.
The code gets executed, but in a very optimized way.
Many of the above used floats seem to allow shortcuts in the computation,
and there are a lot of common subexpressions.

I tried a simpler test showing that pow() gets substituted, unless you use a volatile argument

float expo = 5.678;
float base = 1.234;
volatile float vBase = 1.234;
float result1;
float result2;
unsigned long start, stop, elapsed;

void setup(){
  Serial.begin(115200);

  start = micros();
  result1 = pow(base, expo) + 5;
  result2 = pow(base, expo);
  stop = micros();
  elapsed = stop - start;
  Serial.print("elapsed standard ");
  Serial.println(elapsed);

  start = micros();
  result1 = pow(vBase, expo) + 5;
  result2 = pow(vBase, expo);
  stop = micros();
  elapsed = stop - start;
  Serial.print("elapsed volatile ");
  Serial.println(elapsed);
}
void loop() {}
elapsed standard 340
elapsed volatile 744

You guys rock, after removing the pow() functions my speeds are way faster. 32us for the due, and 400us on the mega, so 12.5x faster which is much closer to what I was expecting to get.

Man... The due is turning out to be a bitch to code with. Don't use the Serial.println function because it literally takes 300x longer than the SerialUSB.println! Don't use digitalWrite because it is even slower than the mega's digitalwrite, gotta use this abstract REG_PIOC_CODR = 0x1 << 26; crap. Don't use pow() either because it is slower than on the mega!

I bought it to blow the mega's speed out of the water, but it is turning into a minefield.

Do you guys know of any other nuances that the due has that are actually slower than the other arduino models?

Whandall:
Because none of the computed values is used, the whole expression gets optimized to nothing (my guess).
I think the run-time value measured show that quite clearly, 1 full round takes half the time of one pow(),
how could that be, if it is computed at all?
And I doubt that pow() (even if called with constants as arguments) gets computed at compile time.
A disassembly of the generated code could show what's really executed.

I surrender to the very very clever compiler.
The code gets executed, but in a very optimized way.
Many of the above used floats seem to allow shortcuts in the computation,
and there are a lot of common subexpressions.

I tried a simpler test showing that pow() gets substituted, unless you use a volatile argument

float expo = 5.678;

float base = 1.234;
volatile float vBase = 1.234;
float result1;
float result2;
unsigned long start, stop, elapsed;

void setup(){
 Serial.begin(115200);

start = micros();
 result1 = pow(base, expo) + 5;
 result2 = pow(base, expo);
 stop = micros();
 elapsed = stop - start;
 Serial.print("elapsed standard ");
 Serial.println(elapsed);

start = micros();
 result1 = pow(vBase, expo) + 5;
 result2 = pow(vBase, expo);
 stop = micros();
 elapsed = stop - start;
 Serial.print("elapsed volatile ");
 Serial.println(elapsed);
}
void loop() {}





elapsed standard 340
elapsed volatile 744

Interesting, I think I am only half following you though. Fortunately my project does not require anything to be raised to a power that is not an integer. So I guess I will just be typing out all of the individual multiplications for now instead of using the pow() function.

In the posted code you see two identical pow() calls embedded in different statements.

From the timing you see that it only gets executed once (first variant).

I forced different behaviour with the volatile in the second variation.

Same computation, but takes more than the double time, because it really gets computed twice.

I used the parameters from the AVR benchmarks, so 340 matches the 5047 cycles (314µS) quite well.
The different timebases (AVR uses processor cycles) confused me at first.

For integer exponents you better use multiplication or write a powI(float, int) function.

float expo = 2; // 5.678;
float base = 1.234;
volatile float vBase = base;
float result1;
float result2;
unsigned long start, stop, elapsed;

void setup() {
  Serial.begin(115200);
  Serial.print("base ");
  Serial.print(base, 3);
  Serial.print(" expo ");
  Serial.println(expo, 3);
  start = micros();
  result1 = pow(base, expo) + 5;
  result2 = pow(base, expo);
  stop = micros();
  pRes("standard", result1, result2);
  start = micros();
  result1 = pow(vBase, expo) + 5;
  result2 = pow(vBase, expo);
  stop = micros();
  pRes("volatile", result1, result2);
  start = micros();
  result1 = base * base + 5;
  result2 = base * base;
  stop = micros();
  pRes("stdmulti", result1, result2);
  start = micros();
  result1 = vBase * vBase + 5;
  result2 = vBase * vBase;
  stop = micros();
  pRes("volmulti", result1, result2);

}
void pRes(char* tag, float r1, float r2) {
  elapsed = stop - start;
  Serial.print("elapsed ");
  Serial.print(tag);
  Serial.print(" ");
  Serial.println(elapsed);
  if (r1 < r2) {
    Serial.println("Ups.");
  }
}
void loop() {}
base 1.234 expo 2.000
elapsed standard 352
elapsed volatile 692
elapsed stdmulti 24
elapsed volmulti 40

For integer exponents you better use multiplication or write a powI(float, int) function.

Are you sure about that? most of the documentation I find says that powl() is for "long double" float variables?

@Soronemus: Exactly which values are variables in your calculation?

  a3 = (2 / pow(tf, 3)) * (32 * (thv - ths) - 11 * (thf - ths));
  a4 = -1 * (3 / pow(tf, 4)) * (64 * (thv - ths) - 27 * (thf - ths));
  a5 = (3 / pow(tf, 5)) * (64 * (thv - ths) - 30 * (thf - ths));
  a6 = -1 * (32 / pow(tf, 6)) * (2 * (thv - ths) - (thf - ths));
  angle = pow(t, 6) * a6 + pow(t, 5) * a5 + pow(t, 4) * a4 + pow(t, 3) * a3 + a0;

you know, regardless of whether pow() is "good" or not, this is not a very fast way to write the function; you're needlessly repeating work that you've already done. How about:

float fast_f(float tf, float t) {
    static float last_tf = 0.0;   // Remember last tf
    static float a0, a2, a3, a4, a5, a6;

// Calculate coefficients, based on tf, only if tf has changed
//  assumes that thv ths, thf are constants.
#define c_theta(c1, c2) (c1 * (thv-ths) - c2 * (thf -ths))
    if (last_tf != tf) {
	last_tf = tf;  // Save this as "last" value
	a0 = ths;
	tf = 1/tf;  // We actually need reciprocal powers of tf...
	            //   and multiplying is faster than dividing.
	tffr = (tf * tf * tf);   // Start with 1/(tff cubed)
	a3 = (2 * tffr) * c_theta(32, 11);
	tffr *= tf;  // 1/(tf**4)
	a4 = -1 * (3 * tffr) * c_theta(64, 27);
	tffr *= tf;   // 1/(tf**5)
	a5 = (3 * tffr) * c_theta(64, 30);
	tffr *= tf;   // 1/(tf**6)
	a6 = -1 * (32 * tffr) * c_theta(2,1);
    }

    // Have coefficients, now calculate power series
    tf = t; 	// t**1
    angle = a0;
// angle += a1 * tf;   // (zero)
    tf *= t;     // t**2
// angle += a2 * tf;   // (zero)
    tf *= t;     // t**3
    angle += a3 * tf;
    tf *= t;     // t**4
    angle += a4 * tf;
    tf *= t;     // t**5
    angle += a5 * tf;
    tf *= t;     // t**6
    angle += a6 * tf;

    return angle;
}

So, I did a bit more research, and wrote some benchmark code that should make it difficult for things to be optimized away...

  1. AVR has highly optimized math functions for both basic math and the pow() function.
  2. ARM Cortex M3 (Due) has highly optimized basic math, but non-optimized pow() function.
  3. ARM Corext M0 (Zero) doesn't have any optimized floating point :frowning:
  4. As per previous messages, the ARM "pow()" function defaults to using/producing doubles, so it's significantly slower than using powf() (which uses floats.) Both of these are sort of pointless for integer exponents, though. You're better off implementing your own looped multiplication function. (which I did.)

Here are some results. (times in microseconds for 1000 iterations.) (code attached)
powd_function is using the double pow() function.
powf_function is using the float powf() function.
powi_function is using a homebrew looping integer power function.
fast_function is the successive multiple version similar to what I posted above.

AVR Times:
powd_function: Total Time = 2753076
powf_function: Total Time = 2753080   [[color=purple]AVR doesn't support double, so time is the same as float[/color]]
powi_function: Total Time = 390624
fast_function: Total Time = 129404

Due Times:
powd_function: Total Time = 1106214
powf_function: Total Time = 708751   [color=purple][ approx 3.9x faster than AVR ][/color]
powi_function: Total Time = 30736   [color=purple] [ approx 12x faster than AVR ][/color]
fast_function: Total Time = 16156    [color=purple][ approx 8x faster than AVR ][/color]

Yes, I suppose the powf()/powd() relative performance is a little disappointing.
And I'm a little surprised that the fast_function() performance isn't more better than the powi() code. (but twice as fast is ... not too bad.) (Note that this code hasn't been extensively tested to see whether the results are ... correct or anything. There could be significant bugs. Room for more improvement. Etc.

pwrseries.zip (2.09 KB)