Go Down

### Topic: clock cycles for math functions (Read 10824 times)previous topic - next topic

#### bubulindo

#15
##### Jun 22, 2011, 12:53 pm
Say you need to know the sin(X), and you'll have a one degree (for the sake of the example) precision...
Code: [Select]

int sin[360];

sin[0] = 0;
sin[1] = 17; //multiplied by 1000
sin[2] = 35;
...

sin[90] = 1000;

//and so on, and so on...

This... is a hobby.

#### robtillaart

#16
##### Jun 22, 2011, 02:13 pm
no need for a 360 degree table ==> you need a table sinus[91]   0..90 all others can be mirrored.

(code not tested)
Code: [Select]

float sinus[91];   //0..90 -- can be an int*1000 as bubulindo stated too to make it faster or even  0..100 then it will fit in one byte, depends on the precission needed.

float _sin(int x)  // float x allows interpolation; left as an exercise
{
// handle negative values for x
if (x < 0) return -sin(-x);

// handle values above 360
if (x >=360) x %= 360;  // if prevents the expensive modulo if not needed

{
case 0: return sinus[x]; break;
case 1: return sinus[180-x]; break;
case 2: return -1 * sinus[x-180]; break;
case 3: return -1 * sinus[360-x]; break;
}
}

float _cos(int x)
{
return sin(x + 90);
}

float _tan(int x)
{
return sin(x)/cos(x);  // may return NaN  not a number
}
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

#### bubulindo

#17
##### Jun 22, 2011, 02:22 pm
I know... I was just showing how to set up one of such tables. You can also translate sin() into cos() with some arithmetic. But that wasn't the purpose.
This... is a hobby.

#### robtillaart

#18
##### Jun 22, 2011, 03:00 pm
sometimes I just lose myself in coding
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

#### Magician

#19
##### Jun 23, 2011, 04:41 am
I have a question for Experts in the area
Tweaking with integer FFT code (Board UNO, ATMega328, 16 MHz), I couldn't get results any better than 23 ms ( 128 points calculus ).  I'm using sine LUT, with pgm_read_word to get a value, and I know that reading happened 2 times in iner loop, which executed 127 times.

When I shoot down reading LUT at all, and just multiply with dummy constant instead of sine:
Code: [Select]

result show 9 ms. The question is :
- isn't it too much  16 ms / 254 = 63 usec per one pgm_read_word?

#20
##### Jun 23, 2011, 06:36 am

Your comments remove more code than just the pgm_read_word.  Each piece can potentially generate code depending on the outer context.  In other words, each of these is a potential code generator...

&Sinewave
N_WAVE/4
j+
[j] /* can also include a multiplication */

In addition, if the outer code is "complex", various register values may have to be shuffled around (push / pop / move).

Quote
result show 9 ms. The question is : - isn't it too much  16 ms / 254 = 63 usec per one pgm_read_word?

= 1008 clock cycles.  Does seem a bit high.

In any case, that may be a good spot for some hand optimization.

#### robtillaart

#21
##### Jun 23, 2011, 09:16 am
Is it possible to put the lookup table in RAM iso progmem and measure the timing.

Is your Sinewave table of the type float or int ?   // keep the math completely int or long if possible iso floats

How much entries are in the table?

j+N_WAVE/4
you could initialize a index k = j + NWAVE/4; and let it go with the flow:  e.g.  for (j=0; k = N_WAVE/k; j < 128; j++, k++).
Will not change very much but many drops make an ocean

Please post some code / at least the loop
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

#### westfw

#22
##### Jun 23, 2011, 09:42 am
pgmspace reads are actually quite fast (the instructions are read from pgmspace, so it has to be fast!)

Try this example (carefully checked to make sure the compiler doesn't optimize away anything that shouldn't be optimized away, and organized to make the disassembly somewhat reasonable...)
Code: [Select]
#include <avr/pgmspace.h>

void setup()
{
Serial.begin(19200);
}
volatile long a, b;

unsigned long *p = (unsigned long *)1000;

void loop()
{
unsigned long i, ramstart, ramend, pgmstart, pgmend;
b = 12354;

ramstart = millis();
for (i=0; i < 100000; i++) {
a = b;
}
ramend = millis();

pgmstart = millis();
for (i=0; i < 100000; i++) {
a = *p++;
}
pgmend = millis();

Serial.print("RAM Variable time: ");
Serial.println(ramend - ramstart);
Serial.print("pgmmem Variable time: ");
Serial.println(pgmend - pgmstart);

;
Serial.println();
Serial.println();
}

#### Magician

#23
##### Jun 23, 2011, 01:05 pmLast Edit: Aug 20, 2011, 01:21 am by Magician Reason: 1
Results for ram/pgm test
Quote
RAM Variable time: 182
pgmmem Variable time: 182
RAM Variable time: 182
pgmmem Variable time: 182

#### westfw

#24
##### Jun 23, 2011, 07:39 pm
I'm not familiar with FFT coding at all, but I rewrote your code slightly and gave it a try here, on an otherwise not-connected-to-anything Ardunio.

New code in the FFT loop:
Code: [Select]
while (l < fft_n) {
istep = l << 1;
for (m=0; m<l; ++m) {
j = m << k;
/* 0 <= j < N_WAVE/2 */
smart2++;

//test2 - NO_PGM
#define TEST2
#ifdef TEST2
#else
#endif
//  wr >>= 1;
//  wi >>= 1;

for (i=m; i<fft_n; i+=istep) { //istep = 2, 4, 8, 16 ......
j = i + l;

//test 3 - No multiplication.
#ifdef TEST3
tr = smart2; //FIX_MPY(wr,fr[j]) - FIX_MPY(wi,fi[j]);
ti = smart2; //FIX_MPY(wr,fi[j]) + FIX_MPY(wi,fr[j]);'
#else
tr = FIX_MPY(wr,fr[j]) - FIX_MPY(wi,fi[j]);
ti = FIX_MPY(wr,fi[j]) + FIX_MPY(wi,fr[j]);
#endif
qr = fr[i];
qi = fi[i];

Is that what you meant to have happen?  I'm not getting quite the same results that you reported (perhaps because I don't have a signal source connected?)

no tests:    Time FFT calc.: 23696    Time SQRT calc.: 8040
test2:    Time FFT calc.: 13044    Time SQRT calc.: 7988

I suspect that rather than pgm memory being slow, what's happening is that the compiler is being especially clever in the test case, and managing to completely remove one of your inner loops (or similar.)  I'm willing to disassemble and check, if you confirm that I'm on the right track so far.

#### robtillaart

#25
##### Jun 23, 2011, 07:58 pm
Quote
There are just 192 entries in table for now, would be more, so I couldn't move table in RAM - memory space is in high demand

Folding as I did above - makes it shorter, and you can leave out part of the code as your input will only be 0..360 if i'm right

The code that's left is shorter than writing out the complete table. As you use bytes the table costs 91 bytes of RAM.

Code: [Select]

int _sin(int x)
{
{
case 0: return sinus[x]; break;
case 1: return sinus[180-x]; break;
case 2: return -1 * sinus[x-180]; break;
case 3: return -1 * sinus[360-x]; break;
}
}
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

#### Magician

#26
##### Jun 23, 2011, 08:47 pm
To westfw:  yes, you are on a right track. I forget to mention test #4: when I cut off pgm_, and
multiplication I did one more move: change wi and wr to char (it was int when I get 9 ms),
but I didn't change pgm_read_word to pgm_read_byte, as it still mark out in my test.
Probably, this is why it 13 ms in your results.
To robtillaart:  table include 0 .. 270 in 192 points, algorithm works only if quantity of elements in the sine table > than fx[]. As I'm gonna to extend fx[] to 1024, (removing x[] - debug intermediate) I'd met really tough memory conditions.
I 've been looking on i-net, and find a book "math toolkit for real time programming", they folded
sine wave even more that 4x, up to 16x (0.. 22.5 degree)

#### westfw

#27
##### Jun 23, 2011, 10:56 pm
first thing noticed in analysis:
Without the pgm_read_word code, the resulting fix_fft() function ends up only containing two calls to the __mulsi3 function (internal multiply helper.)

#### robtillaart

#28
##### Jun 24, 2011, 11:18 am
Quote
I 've been looking on i-net, and find a book "math toolkit for real time programming", they folded
sine wave even more that 4x, up to 16x (0.. 22.5 degree)

yes I knew that can be done but the profit of the folding should make up the extra code for folding, that's why I never exceeded 4 times folding. 8x folding or more would need the same amount of bytes if you want to have an accuracy of 1 degree: Furthermore sin ( 2A ) = 2 sin A cos A => Two lookups. For 16x folding you need even more AFAIK.

But if the accuracy needed is less, very true
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

#### Magician

#29
##### Jun 25, 2011, 07:06 pm
The more weird results on testing PROGMEM:
1) If I declare array with : const prog_int8_t Sinewave[N_WAVE-N_WAVE/4] PROGMEM = {