Benchmarking the Due

The Due has a Flash buffer to optimize access ("memory accelerator" rather than cache), so if you have a tight loop that might make a difference. Floating point will require functions calls which might not utilize the buffer effectively.

A long is 4 bytes.

The execution time is largely proportional to the number of instructions and memory accesses required. It can be hard to guess from the C source code what assembler is being generated, so it is well worth looking at that. Your integer test is likely done mostly in registers. Smart optimisation could remove some of the steps.

Remember that the 8 bit requires a lot of extra code to do 4 byte arithmetic, it's far more than 4 times slower. When you add everything up, 280 times slower is not unreasonable.

I think you are right about float vs double. You could try with sinf and cosf which are the single precision versions.