RayLivingston:
Write the code, and time how long it takes to run using millis() or micros().
While this advise is good in concept, it's woefully lacking in practical details necessary to result in meaningful measurements.
There are many, many pitfalls.
For example, the compiler often will optimize code in unexpected ways. Many people who've tried to measure how much time some particular arithmetic expression required ended up with zero time, because they used constants. Even if you call a function, if the parameters are constants and the function is in the same file, the compiler can often propagate the constants and compute part or all of the result at compile time.
Even if you avoid letting the compiler pre-compute answers, the compiler can still sometimes identify common subexpressions and figure out crafty ways to compute something once and use it in 2 or more places.
The compiler can also move part or all of your computation outside of the region between where you measure the start and stop times. If the results only depend on constants and in-memory data, the compiler "knows" memory doesn't magically change, so it can employ all sorts of crafty optimizations to do some of the work before or after any particular point in time.
Many of these optimizations are defeated by making variables "volatile", but doing so forces the compiler to make more memory accesses in many cases. Sometimes this can result in much slower results than would be seen normally.
Even if all these issues are avoided, the resolution is in microseconds, which is 84 cycles on Arduino Due. That's far too coarse for measuring something as fast as basic math operations. The micros() function contains quite a bit of code, as well, so it has considerable overhead.
If you try to do the operation many times using a loop, the looping overhead can greatly skew the results. Especially branching tends to be slow on pipelined processors.
Unrolling loops, which involves making many copies of the code, tends to give better results than looping. But even that is error-prone on Arduino Due, because the CPU runs much faster than the flash memory. There's a special buffer and small cache between the flash memory and CPU core. Unrolled code tends to run fairly well, since the buffer pre-fetches ahead of where the CPU is executing, but it also tends to wipe cached data, adding extra latency to any branching.
My long-winded point is there are a LOT of pitfalls to measuring the speed of code on fast and complex processors like Arduino Due. The general idea is good, but to suggest "Write the code, and time how long it takes" without any specific guidance to avoid all the issues is hardly helpful.
At the very least, meaningful benchmarking needs to be carefully tested first on similar code with predictable timing (perhaps by analyzing disassembly). Then the code to be tested needed to be checked to make sure it doesn't trigger very different compiled code. It's anything but easy to accurately measure by just writing some simple code and making a measurement without considering sources of error.