Go Down

Topic: ArduinoDUE - FloatingPointVS Integer math (Read 1 time) previous topic - next topic


Hi everybody,
I would like to ask you a simple question. Does anybody can explain me why floating point double precision math are almost 2x faster than integer math on ArduinoDUE?

I've implemented some digital signal processing on my arduino, like 5KS 16 bit sampling, quadrature demodulation, antialiasing filtering, downsampling, LP filtering, mean and standard deviation calculations. As I read on SAM3X8E manual, the uC has not a floating point unit, but it can emulate it. That's why many forums suggest to use integer math instead of float or double math, and that's the reason why the first implementation of the algorythm I did, used integer math.
But yesterday I was asking my self if I could get more precision using double math, even with some more computation time, and after some code upgrade, I got the final result:

  • more precision

  • half computation time

I'm wondered of the results I got and I just can not explain my self the reason why on an average of 11 algorythm executions, I get an average execution time of 714.18 [us] for integer math, against 355.28 [us] for double precision floating point math, i.e. 32bit native math VS 64bit emulated math.

I'm asking you this question, because this is part of my thesis work and I'm not able to justify this behavior, this is not the one expected.. Thanx


I also am baffled. That result makes no sense at all. The only way to really answer the question is to have an example though. We'd need to see some code that exhibits this behavior. Off the top of my head I can think of only a couple of possibilities:

1. Have you tested for proper output from both versions? That is, can you verify that both produce the right result?
2. The algorithm used for integer math was terrible compared to the one used for floating point. It is possible to write bad code in any language and at any time. So, make sure your algorithms are efficiently constructed.

Equivalent versions of an algorithm with one coded as integer math and the other FP should show the integer version to be many times faster on a Due.


maybe floating point is done in hardware and integer math (partly) in software?

Please post sample code so it can be verified.
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)


Hi guys, thank for your replies. The code is quite long, so I'll try to post an extract for the two version in two different posts.
Before I post the code, I want to grant you that I get the same result in both ways. The algorythm has been first written in Matlab and then translated in C. To compare the results, I get the raw samples printed on Serial port, I copy and paste them in Matlab passing them through matlab algorythm. Except some normal fluctuations on the 4th decimal, the result it's the same.

Now,think about I've dinamically allocated some vectors using polyAlloc() function: rawSignal, polySignal, polyIQ and polyResult. I've also succesfully acquired 5000 16 bit samples and stored them inside rawSignal vector. Then I call computeOutput() function, that uses the filter() function.


This is the integer math version:
Code: [Select]

#define nelems(x) sizeof(x)/sizeof(*x)

const int Fs = 5000; // Sample Rate
const int D = 5; // Down sampling factor
float T = 1.0; // Acquisition Time sec
double meanValue = 0.0;
double stdValue = 0.0;

int volatile sampleInd = 0; // Global Buffer Location
int * volatile rawSignal; // Acquisition RAM Buffer
int * polySignal[D]; // PolySignal branches
int * polyIQ[2]; // Quadrature filtered, downsampled, demodulated signals
unsigned int * polyResult; // Reconstructed, LP filtered signal

// I-Q quadrature oscillations
const float I[D] = {2.0, 0.61803398,-1.61803398,-1.61803398, 0.61803398};// Cosine
const float Q[D] = {0.0, 1.90211303, 1.17557050,-1.17557050,-1.90211303};// Sine

// AntiAliasing filter coefficients :: FIR window Kaiser, Fs=5000, Fc=300
const float AAF[15] = {0.01786240,0.03388546,0.05089129,0.06751284,0.08232631,0.09400803,0.10148476,0.10405780,0.10148476,0.09400803,0.08232631,0.06751284,0.05089129,0.03388546,0.01786240};
float * polyAAF[D]; // Polyphase downsampled impulse responses

// LP filter coefficients :: FIR window Kaiser, Fs=1000, Fc=15
const float LP[21] = {0.04144742,0.04320807,0.04483017,0.04629891,0.04760078,0.04872372,0.04965724,0.05039261,0.05092289,0.05124307,0.05135014,0.05124307,0.05092289,0.05039261,0.04965724,0.04872372,0.04760078,0.04629891,0.04483017,0.04320807,0.04144742};

void computeOutput(){
timemicros = micros();
rawSignal[0] = rawSignal[5]; // First value adjust

int offset = 0; // Offset
for(int h=0;h<Fs*T;h++)

for (int i=0;i<Fs*T/D;i++) // Initialize arrays
polyIQ[0][i] = 0;
polyIQ[1][i] = 0;

for (int j=0;j<D;j++)
// DownSampling
for (int i=j;i<Fs*T;i+=5)
polySignal[j][i/5] = rawSignal[i]-offset; // Downsampling and offset removal

for (int i=j;i<nelems(AAF);i+=5)
polyAAF[j][i/5] = AAF[i];

// AntiAlias Filtering
filter(polyAAF[j],nelems(AAF)/D,polySignal[j],(int) Fs*T/D);

// Quadrature Demodulation
for (int i=0;i<Fs*T/D;i++)
polyIQ[0][i] += I[j]*(polySignal[j][i]);
polyIQ[1][i] += Q[j]*(polySignal[j][i]);

// Norm calculation :: sqrt(I^2+Q^2)
for (int i=0;i<Fs*T/D;i++)
polyResult[i] = (unsigned int) sqrt(polyIQ[0][i]*polyIQ[0][i] + polyIQ[1][i]*polyIQ[1][i]);

// LP filtering
filter((float *) LP,nelems(LP),(int*) polyResult,(int) Fs*T/D);

// Elaborate stats

double convFactor = 2.0*4.0/65535.0; // 2 * FullScale / (2^16-1)
meanValue *= convFactor;
stdValue  *= convFactor;



bool polyAlloc()
if(NULL != rawSignal)
free(rawSignal); // Deallocate buffer memory
rawSignal = NULL;
polyResult= NULL;

rawSignal  = (int *)malloc(Fs*T*sizeof(int)); // Dynamic Memory Allocation
polyResult = (unsigned int *)malloc(Fs*T/D*sizeof(int));

polyIQ[0]  = (int *)malloc(Fs*T/D*sizeof(int));
polyIQ[1]  = (int *)malloc(Fs*T/D*sizeof(int));

bool polys = true;
for (int j=0;j<D;j++)
polySignal[j]=    (int *)malloc(Fs*T/D*sizeof(int));
polyAAF[j]   =  (float *)malloc(nelems(AAF)/D*sizeof(float));

polys &= polySignal[j] != NULL && polyAAF[j] != NULL;

return (rawSignal[0] != NULL && polyIQ[0] != NULL && polyIQ[1] != NULL && polyResult != NULL && polys);

void filter(float *Num, int n, int *vect, int vLen)
int average = 0; // Initial conditions calculation
for(int h=0;h<vLen;h++)

int numbuf[n]; // Initialize FIFO FIR buffer
for(int h=0;h<n;h++)

for (int h=0;h<vLen;h++){ // Filtering cycle
int FIR = 0;

for (int k=n-1;k>0;k--) // Cycle FIR buffer elements
numbuf[0] = vect[h]; // Push new element inside FIR buffer

for (int j=0;j<n;j++) // FIR convolution
FIR += (int) numbuf[j]*Num[j];

vect[h] = FIR; // Output calculation

Go Up