Offline
Newbie
Karma: 0
Posts: 24


« on: August 26, 2013, 10:42:58 am » 
Hi everybody, I would like to ask you a simple question. Does anybody can explain me why floating point double precision math are almost 2x faster than integer math on ArduinoDUE? I've implemented some digital signal processing on my arduino, like 5KS 16 bit sampling, quadrature demodulation, antialiasing filtering, downsampling, LP filtering, mean and standard deviation calculations. As I read on SAM3X8E manual, the uC has not a floating point unit, but it can emulate it. That's why many forums suggest to use integer math instead of float or double math, and that's the reason why the first implementation of the algorythm I did, used integer math. But yesterday I was asking my self if I could get more precision using double math, even with some more computation time, and after some code upgrade, I got the final result:  more precision
 half computation time
I'm wondered of the results I got and I just can not explain my self the reason why on an average of 11 algorythm executions, I get an average execution time of 714.18 [us] for integer math, against 355.28 [us] for double precision floating point math, i.e. 32bit native math VS 64bit emulated math. I'm asking you this question, because this is part of my thesis work and I'm not able to justify this behavior, this is not the one expected.. Thanx



Logged





Earth
Offline
Sr. Member
Karma: 14
Posts: 331
Arduino rocks


« Reply #1 on: August 26, 2013, 01:07:13 pm » 
I also am baffled. That result makes no sense at all. The only way to really answer the question is to have an example though. We'd need to see some code that exhibits this behavior. Off the top of my head I can think of only a couple of possibilities:
1. Have you tested for proper output from both versions? That is, can you verify that both produce the right result? 2. The algorithm used for integer math was terrible compared to the one used for floating point. It is possible to write bad code in any language and at any time. So, make sure your algorithms are efficiently constructed.
Equivalent versions of an algorithm with one coded as integer math and the other FP should show the integer version to be many times faster on a Due.



Logged





Global Moderator
Netherlands
Offline
Shannon Member
Karma: 224
Posts: 13915
In theory there is no difference between theory and practice, however in practice there are many...


« Reply #2 on: August 26, 2013, 01:27:40 pm » 
maybe floating point is done in hardware and integer math (partly) in software?
Please post sample code so it can be verified.



Logged





Offline
Newbie
Karma: 0
Posts: 24


« Reply #3 on: August 26, 2013, 04:17:45 pm » 
Hi guys, thank for your replies. The code is quite long, so I'll try to post an extract for the two version in two different posts. Before I post the code, I want to grant you that I get the same result in both ways. The algorythm has been first written in Matlab and then translated in C. To compare the results, I get the raw samples printed on Serial port, I copy and paste them in Matlab passing them through matlab algorythm. Except some normal fluctuations on the 4th decimal, the result it's the same.
Now,think about I've dinamically allocated some vectors using polyAlloc() function: rawSignal, polySignal, polyIQ and polyResult. I've also succesfully acquired 5000 16 bit samples and stored them inside rawSignal vector. Then I call computeOutput() function, that uses the filter() function.



Logged





Offline
Newbie
Karma: 0
Posts: 24


« Reply #4 on: August 26, 2013, 04:21:08 pm » 
This is the integer math version: #define nelems(x) sizeof(x)/sizeof(*x)
const int Fs = 5000; // Sample Rate const int D = 5; // Down sampling factor float T = 1.0; // Acquisition Time sec double meanValue = 0.0; double stdValue = 0.0;
int volatile sampleInd = 0; // Global Buffer Location int * volatile rawSignal; // Acquisition RAM Buffer int * polySignal[D]; // PolySignal branches int * polyIQ[2]; // Quadrature filtered, downsampled, demodulated signals unsigned int * polyResult; // Reconstructed, LP filtered signal
// IQ quadrature oscillations const float I[D] = {2.0, 0.61803398,1.61803398,1.61803398, 0.61803398};// Cosine const float Q[D] = {0.0, 1.90211303, 1.17557050,1.17557050,1.90211303};// Sine
// AntiAliasing filter coefficients :: FIR window Kaiser, Fs=5000, Fc=300 const float AAF[15] = {0.01786240,0.03388546,0.05089129,0.06751284,0.08232631,0.09400803,0.10148476,0.10405780,0.10148476,0.09400803,0.08232631,0.06751284,0.05089129,0.03388546,0.01786240}; float * polyAAF[D]; // Polyphase downsampled impulse responses
// LP filter coefficients :: FIR window Kaiser, Fs=1000, Fc=15 const float LP[21] = {0.04144742,0.04320807,0.04483017,0.04629891,0.04760078,0.04872372,0.04965724,0.05039261,0.05092289,0.05124307,0.05135014,0.05124307,0.05092289,0.05039261,0.04965724,0.04872372,0.04760078,0.04629891,0.04483017,0.04320807,0.04144742};
void computeOutput(){ timemicros = micros(); rawSignal[0] = rawSignal[5]; // First value adjust int offset = 0; // Offset for(int h=0;h<Fs*T;h++) offset+=rawSignal[h]; offset/=Fs*T; for (int i=0;i<Fs*T/D;i++) // Initialize arrays { polyIQ[0][i] = 0; polyIQ[1][i] = 0; } for (int j=0;j<D;j++) { // DownSampling for (int i=j;i<Fs*T;i+=5) polySignal[j][i/5] = rawSignal[i]offset; // Downsampling and offset removal for (int i=j;i<nelems(AAF);i+=5) polyAAF[j][i/5] = AAF[i];
// AntiAlias Filtering filter(polyAAF[j],nelems(AAF)/D,polySignal[j],(int) Fs*T/D); // Quadrature Demodulation for (int i=0;i<Fs*T/D;i++) { polyIQ[0][i] += I[j]*(polySignal[j][i]); polyIQ[1][i] += Q[j]*(polySignal[j][i]); } }
// Norm calculation :: sqrt(I^2+Q^2) for (int i=0;i<Fs*T/D;i++) polyResult[i] = (unsigned int) sqrt(polyIQ[0][i]*polyIQ[0][i] + polyIQ[1][i]*polyIQ[1][i]);
// LP filtering filter((float *) LP,nelems(LP),(int*) polyResult,(int) Fs*T/D);
// Elaborate stats statistics(); double convFactor = 2.0*4.0/65535.0; // 2 * FullScale / (2^161) meanValue *= convFactor; stdValue *= convFactor; printResults(); polyFreeAlloc(); timemicros=micros()timemicros; Serial.println(timemicros); }
bool polyAlloc() { if(NULL != rawSignal) { free(rawSignal); // Deallocate buffer memory free(polyResult); rawSignal = NULL; polyResult= NULL; } rawSignal = (int *)malloc(Fs*T*sizeof(int)); // Dynamic Memory Allocation polyResult = (unsigned int *)malloc(Fs*T/D*sizeof(int)); polyIQ[0] = (int *)malloc(Fs*T/D*sizeof(int)); polyIQ[1] = (int *)malloc(Fs*T/D*sizeof(int)); bool polys = true; for (int j=0;j<D;j++) { polySignal[j]= (int *)malloc(Fs*T/D*sizeof(int)); polyAAF[j] = (float *)malloc(nelems(AAF)/D*sizeof(float)); polys &= polySignal[j] != NULL && polyAAF[j] != NULL; } return (rawSignal[0] != NULL && polyIQ[0] != NULL && polyIQ[1] != NULL && polyResult != NULL && polys); }
void filter(float *Num, int n, int *vect, int vLen) { int average = 0; // Initial conditions calculation for(int h=0;h<vLen;h++) average+=vect[h]; average/=vLen; int numbuf[n]; // Initialize FIFO FIR buffer for(int h=0;h<n;h++) numbuf[h]=average;
for (int h=0;h<vLen;h++){ // Filtering cycle int FIR = 0;
for (int k=n1;k>0;k) // Cycle FIR buffer elements numbuf[k]=numbuf[k1]; numbuf[0] = vect[h]; // Push new element inside FIR buffer for (int j=0;j<n;j++) // FIR convolution FIR += (int) numbuf[j]*Num[j]; vect[h] = FIR; // Output calculation } }



Logged





Offline
Newbie
Karma: 0
Posts: 24


« Reply #5 on: August 26, 2013, 04:23:56 pm » 
And this is the double math version: #define nelems(x) sizeof(x)/sizeof(*x) const int Fs = 5000; // Sample Rate const int D = 5; // Down sampling factor float T = 1.0; // Acquisition Time sec double meanValue = 0.0; double stdValue = 0.0;
int volatile sampleInd = 0; // Global Buffer Location short int * volatile rawSignal; // Acquisition RAM Buffer double * polySignal[D]; // PolySignal branches double * polyIQ[2]; // Quadrature filtered, downsampled, demodulated signals double * polyResult; // Reconstructed, LP filtered signal
// IQ quadrature oscillations const double I[D] = { 2.000000000000000, // Cosine 0.618033988749895, 1.618033988749895, 1.618033988749895, 0.618033988749894}; const double Q[D] = {0.000000000000000, // Sine 1.902113032590307, 1.175570504584947, 1.175570504584946, 1.902113032590307};
// AntiAliasing filter coefficients :: FIR window Kaiser, Fs=5000, Fc=300 const double AAF[15] = {0.017862400522538, 0.033885458153087, 0.050891294195008, 0.067512846793389, 0.082326310000912, 0.094008030100344, 0.101484759026756, 0.104057802415933, 0.101484759026756, 0.094008030100344, 0.082326310000912, 0.067512846793389, 0.050891294195008, 0.033885458153087, 0.017862400522538}; double * polyAAF[D]; // Polyphase downsampled impulse responses
// LP filter coefficients :: FIR window Kaiser, Fs=1000, Fc=15 const double LP[21] = { 0.041447420224253, 0.043208077329578, 0.044830173046221, 0.046298913638535, 0.047600786034868, 0.048723722606653, 0.049657248993218, 0.050392612896925, 0.050922892035903, 0.051243079731859, 0.051350146923975, 0.051243079731859, 0.050922892035903, 0.050392612896925, 0.049657248993218, 0.048723722606653, 0.047600786034868, 0.046298913638535, 0.044830173046221, 0.043208077329578, 0.041447420224253};
bool volatile BUSY_STATE = false; // Internal state
unsigned long int timemicros;
void computeOutput(){ timemicros = micros(); rawSignal[0] = rawSignal[5]; // First value adjust short int offset = 0; // Offset for(int h=0;h<Fs*T;h++) offset+=rawSignal[h]; offset/=Fs*T; for (int i=0;i<Fs*T/D;i++) // Initialize arrays { polyIQ[0][i] = 0; polyIQ[1][i] = 0; } for (int j=0;j<D;j++) { // DownSampling for (int i=j;i<Fs*T;i+=5) polySignal[j][i/5] = (double)(rawSignal[i]offset); // Downsampling and offset removal for (int i=j;i<nelems(AAF);i+=5) polyAAF[j][i/5] = AAF[i];
// AntiAlias Filtering filter(polyAAF[j],nelems(AAF)/D,polySignal[j],(int) Fs*T/D); // Quadrature Demodulation for (int i=0;i<Fs*T/D;i++) { polyIQ[0][i] += I[j]*(polySignal[j][i]); polyIQ[1][i] += Q[j]*(polySignal[j][i]); } }
// Norm calculation :: sqrt(I^2+Q^2) for (int i=0;i<Fs*T/D;i++) polyResult[i] = (double) sqrt(polyIQ[0][i]*polyIQ[0][i] + polyIQ[1][i]*polyIQ[1][i]);
// LP filtering filter((double*) LP,nelems(LP),(double*) polyResult,(int) Fs*T/D);
// Elaborate stats statistics(); double convFactor = 2.0*4.0/65535.0; // 2 * FullScale / (2^161) meanValue *= convFactor; stdValue *= convFactor; printResults(); polyFreeAlloc(); timemicros=micros()timemicros; Serial.println(timemicros); }
bool polyAlloc() { if(NULL != rawSignal) { free(rawSignal); // Deallocate buffer memory free(polyResult); rawSignal = NULL; polyResult= NULL; } rawSignal = (short int *)malloc(Fs*T*sizeof(short int)); // Dynamic Memory Allocation polyResult = (double *)malloc(Fs*T/D*sizeof(double)); polyIQ[0] = (double *)malloc(Fs*T/D*sizeof(double)); polyIQ[1] = (double *)malloc(Fs*T/D*sizeof(double)); bool polys = true; for (int j=0;j<D;j++) { polySignal[j]= (double *)malloc(Fs*T/D*sizeof(double)); polyAAF[j] = (double *)malloc(nelems(AAF)/D*sizeof(double)); polys &= polySignal[j] != NULL && polyAAF[j] != NULL; } return (rawSignal[0] != NULL && polyIQ[0] != NULL && polyIQ[1] != NULL && polyResult != NULL && polys); }
void polyFreeAlloc() { for (int i=0;i<D;i++) { free(polySignal[i]); polySignal[i] = NULL; } free(polyIQ[0]); free(polyIQ[1]); polyIQ[0] = NULL; polyIQ[1] = NULL; }
void filter(double *Num, int n, double *vect, int vLen) { double average = 0.0; // Initial conditions calculation for(int h=0;h<vLen;h++) average+=vect[h]; average/=(double)vLen; double numbuf[n]; // Initialize FIFO FIR buffer for(int h=0;h<n;h++) numbuf[h]=average;
for (int h=0;h<vLen;h++){ // Filtering cycle double FIR = 0.0;
for (int k=n1;k>0;k) // Cycle FIR buffer elements numbuf[k]=numbuf[k1]; numbuf[0] = vect[h]; // Push new element inside FIR buffer for (int j=0;j<n;j++) // FIR convolution FIR += numbuf[j]*Num[j]; vect[h] = FIR; // Output calculation } }



Logged





Offline
Newbie
Karma: 1
Posts: 42


« Reply #6 on: August 26, 2013, 04:39:22 pm » 
I experienced once that there was a big time difference when dividing two int32 values compared to dividing two int64 values (don't remember the excact numbers, but was something like 1us to 40us). For all other operations it seemed to be like expected. Unless the division problem, in all my code examples int64 math was much quicker than float32 math and roughly two or three times slower than int32 math. So the problem could be performing divisions using data types larger than the atomic size which is 32bit for the DUE. The division of float32's however should be somehow an atomic operation. Any comments or tests on that? *update: you can also find where exactly the time difference is caused in your code by using stuff like: t1 = micros(); **code** t2 = micros(); Serial.println(t2t1); for every block of code.. I guess you will find an int64 division to be the crucial point. could you try?


« Last Edit: August 26, 2013, 04:55:11 pm by schwingkopf »

Logged





Offline
Newbie
Karma: 0
Posts: 24


« Reply #7 on: August 26, 2013, 04:41:27 pm » 
I forgot the statistics() function: void statistics() { // Mean value unsigned int meanSum = 0; for (int i=0;i<Fs*T/D;i++) meanSum += polyResult[i];
meanValue = (double)meanSum/(Fs*T/D); // Averaging reduces quantization noise by 1/sqrt(K) K=1000 => 30 dB => 5 LSBs // Standard deviation value double stdSum = 0; for (int i=0;i<Fs*T/D;i++) stdSum += ((double)polyResult[i]meanValue)*((double)polyResult[i]meanValue);
stdValue = sqrt(stdSum/(Fs*T/D1)); }



Logged





Offline
Newbie
Karma: 0
Posts: 24


« Reply #8 on: August 26, 2013, 04:51:12 pm » 
Hi schwingkopf, I've used almost no division in my code but only multiplications and sums, except some average calculation. In the integer math version there are only multiplication between int32 and float32, while in double version there are multiplication between double64..



Logged





0
Offline
God Member
Karma: 26
Posts: 623
Always making something...


« Reply #9 on: August 26, 2013, 05:11:35 pm » 
I have noticed on Teensy3 that 32 bit float support from newlib is particularly slow. Since Arduino Due is using the same C library, it's probably pretty similar. I have not yet investigated why.
One thing I do know: 32 bit division is extremely fast. It's handled in hardware using only 2 cycles.



Logged





Offline
Newbie
Karma: 0
Posts: 24


« Reply #10 on: August 26, 2013, 05:18:46 pm » 
Thanks for suggestions. Now it's quite late for me, so I'm going to sleep. Tomorrow I'll get the execution time for each step inside the code and I'll post the output. I don't know if it could be useful, but I'm using Atmel Studio Framework with VisualMicro plugin to compile and flash the code. Goodnight



Logged





