Go Down

Topic: How fast does it need for +,-,*,/ at Arduino Due (Read 3525 times) previous topic - next topic

DUEDUE

Hi

I would like to know how much time does the compiler need for each operation +,-,/,* where can I figure it out ?

RayLivingston

Integer Addition, subtraction, and multiplication should all be singe-clock operations.  Division likely takes longer.  Floating point operations also are almost certainly considerably slower.  Study the SAM data sheet for details on integer operations.  You'll have to look at the generate assembler code for the floating point operations.

But, then, why do you care, unless you're doing a LOT of computation...

Regards,
Ray L.

RayLivingston

You can also very easily determine, empirically, by simply doing a loop containing many iterations of a single operation, and time the loop execution.  Just be sure your code USES the results, so the code doesn't get optimized away.

Regards,
Ray L.

RayLivingston

The instruction details are in section 12.9.  It does have hardware divide.  Of course, as with any cached processor, actual performance depends a LOT on whether the instructions, and data, are cached.  Most instructions are single-cycle when cached, but can take many cycles if not.  Actual performance is heavily dependant on the actual instruction sequence generated by the compiler, and where things are in memory.  Profiling your actual code is by FAR the best way to determine what the performance will be for your application.

Regards,
Ray L.

RayLivingston

The instruction details are in section 12.9.  It does have hardware divide.  Of course, as with any cached processor, actual performance depends a LOT on whether the instructions, and data, are cached.  Most instructions are single-cycle when cached, but can take many cycles if not.  Actual performance is heavily dependant on the actual instruction sequence generated by the compiler, and where things are in memory.  Profiling your actual code is by FAR the best way to determine what the performance will be for your application.

Regards,
Ray L.
Well I have to document a complicated code I created through Arduino Due.

I would run such a programm to emasure the time but my arduino Due dosent work
Can anyone try it for me ?

for the +,-,/,* operations with float ?

thanks dudes
Code: [Select]

int x;
unsigned long time;
 
void setup ()
{
    Serial.begin(115200);
}
 
void loop()
{
    time = micros();
    for (int i = 0; i <= 100; i++) {
        for (int j = 0; j <= 100; j++) {
            x = i/j;
        }
    }
    time = micros() - time;
    Serial.print(&quot;Completion time: &quot;);
    Serial.println(time);
    while (1) {}
}

That code is not likely to give you a good idea of the actual arithmetic performance, as it will be spending more time executing the loop than doing the actual division.  You need to do enough operations in the loop to swamp the loop overhead.

Again, profiling the ACTUAL code is the best way to go.  Actual performance will be heavily impacted by the data access patterns, loop overhead, other operations, etc., etc.

Regards,
Ray L.

DUEDUE

Hi thanks for your reply

how can I profile my  ACTUAL code ?

RayLivingston

Write the code, and time how long it takes to run using millis() or micros().

Regards,
Ray L.

pjrc

#7
Jan 25, 2015, 06:39 am Last Edit: Jan 25, 2015, 06:42 am by Paul Stoffregen
The hardware integer divide instructions (one does unsigned, the other does signed) each take 2 cycles.

Typically, surrounding code includes several instructions to move registers around, initialize registers with constants, and lots of other minor but unexpected overhead.  You can see this stuff by using "objdump -d" to disassemble the .elf file to assembly.  But analyzing the assembly code can be quite tedious, especially if you're not familiar with ARM assembly.


pjrc

Write the code, and time how long it takes to run using millis() or micros().
While this advise is good in concept, it's woefully lacking in practical details necessary to result in meaningful measurements.

There are many, many pitfalls.

For example, the compiler often will optimize code in unexpected ways.  Many people who've tried to measure how much time some particular arithmetic expression required ended up with zero time, because they used constants.  Even if you call a function, if the parameters are constants and the function is in the same file, the compiler can often propagate the constants and compute part or all of the result at compile time.

Even if you avoid letting the compiler pre-compute answers, the compiler can still sometimes identify common subexpressions and figure out crafty ways to compute something once and use it in 2 or more places.

The compiler can also move part or all of your computation outside of the region between where you measure the start and stop times.  If the results only depend on constants and in-memory data, the compiler "knows" memory doesn't magically change, so it can employ all sorts of crafty optimizations to do some of the work before or after any particular point in time.

Many of these optimizations are defeated by making variables "volatile", but doing so forces the compiler to make more memory accesses in many cases.  Sometimes this can result in much slower results than would be seen normally.

Even if all these issues are avoided, the resolution is in microseconds, which is 84 cycles on Arduino Due.  That's far too coarse for measuring something as fast as basic math operations.  The micros() function contains quite a bit of code, as well, so it has considerable overhead.

If you try to do the operation many times using a loop, the looping overhead can greatly skew the results.  Especially branching tends to be slow on pipelined processors.

Unrolling loops, which involves making many copies of the code, tends to give better results than looping.  But even that is error-prone on Arduino Due, because the CPU runs much faster than the flash memory.  There's a special buffer and small cache between the flash memory and CPU core.  Unrolled code tends to run fairly well, since the buffer pre-fetches ahead of where the CPU is executing, but it also tends to wipe cached data, adding extra latency to any branching.

My long-winded point is there are a LOT of pitfalls to measuring the speed of code on fast and complex processors like Arduino Due.  The general idea is good, but to suggest "Write the code, and time how long it takes" without any specific guidance to avoid all the issues is hardly helpful.

At the very least, meaningful benchmarking needs to be carefully tested first on similar code with predictable timing (perhaps by analyzing disassembly).  Then the code to be tested needed to be checked to make sure it doesn't trigger very different compiled code.  It's anything but easy to accurately measure by just writing some simple code and making a measurement without considering sources of error.

RayLivingston

All of which is why I said "Again, profiling the ACTUAL code is the best way to go."....

Regards,
Ray L.

robtillaart

The instruction details are in section 12.9.  It does have hardware divide.  Of course, as with any cached processor, actual performance depends a LOT on whether the instructions, and data, are cached.  Most instructions are single-cycle when cached, but can take many cycles if not.  Actual performance is heavily dependant on the actual instruction sequence generated by the compiler, and where things are in memory.  Profiling your actual code is by FAR the best way to determine what the performance will be for your application.

Regards,
Ray L.
That code is not likely to give you a good idea of the actual arithmetic performance, as it will be spending more time executing the loop than doing the actual division.  You need to do enough operations in the loop to swamp the loop overhead.

Again, profiling the ACTUAL code is the best way to go.  Actual performance will be heavily impacted by the data access patterns, loop overhead, other operations, etc., etc.

Regards,
Ray L.
Hi Ray,
it is simpler than you think, the trick is to make two time measurements after each other.
- one with 2 statements per iteration (e.g. 1000000 = 1M loops)
- one with 1 statement per iteration

subtracting the values would automatically subtract the overhead (which is identical for both loops) ,
leaving the actual time for 1M divisions.


Another trick I once used was the following. If you have a X Mhz processor make a loop of X million iterations. The number of seconds is the number of clock cycles used (including loop overhead). But it is fun to - sort of - count the clock cycles


Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

schwingkopf

I think the most direct way to measure code execution time is to toggle a digital output at the beginning and the end of the desired code snipped and measure its time difference with a scope. When using direction port manipulation this should give an accuracy of a couple of CPU cycles or better.

This way you are least invasive to the way the actual code is compiled.

robtillaart

The best way is to use an internal timer, then you can measure #clock cycles.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

bobcousins

#13
Jan 26, 2015, 08:05 pm Last Edit: Jan 26, 2015, 08:06 pm by bobcousins
The best way is... who knows? As usual, we have too little information from the OP to go on.

Sounded a lot like one of those generic homework questions.




Please ask questions in the forum so everyone can benefit. PM me for paid work.

Go Up