I am moving from using DSP's to using micros, and trying to squeeze the capabilities.
Looking at the forum and documentation, I cannot find a clock cycle list for basic functions.
e.g. sqrt, sin, cos etc
in most DSP applications, you can get a table
e.g. for a single atan on a TI67xx takes 167 clock cycles, but an array of atan with the correct look up tables will enable a processing speed of 3.5 clock cycles (two solutions ever 7 cycles)+26 cycles overhead (at a cost of 2k ram).
Where can I get such for the Arduino?

3000+ clocks. No particular optimizations for multiple values, as far as I know.
The AVR has no hardware support for trig functions. Nor even floating point. Nor even fixed point Divide.
If you need a table of FP math function efficiencies, you probably need a different CPU.

Magician:
LUT tables stored in EEPROM, IMHO is elegant solution. 32 kB on UNO.
"Progmem" will do the job.

I think you meant flash... although, the 32kB are not real since the bootloader takes some of that and the user software is going to take another bit. But for a lookup table should be good enough.

Reading data from Flash lookup table is faster than reading data from EEPROM lookup table? Are you certain?

It would be pretty close, but progmem is faster than reading the AVR's interal EEPROM. The EEPROM is treated as a peripheral, so you output the address to a couple of IO ports, twiddle some bits to start a read, and then find the result in another IO port. PROGMEM is actual memory (the PC has to access it anyway), so you just load the address into the Z registers and execute an LPM instruction (there's even an auto-increment mode.)

Both would be pretty fast compared the the existing SW trig functions.

(I suspect the person who thought EEPROM was too slow was thinking of external, serial, EEPROM.)

The way to write fast "signal processing" code on a device like Arduino is to try VERY HARD to stay AWAY from doing any floating point math at all. You're reading integers from input devices, and presumably writing integers to some output device, and converting to floating point in between is just a convenient crutch. (alas, VERY convenient. Or DSPs would never have added floating point.)

no need for a 360 degree table ==> you need a table sinus[91] 0…90 all others can be mirrored.

(code not tested)

float sinus[91]; //0..90 -- can be an int*1000 as bubulindo stated too to make it faster or even 0..100 then it will fit in one byte, depends on the precission needed.
float _sin(int x) // float x allows interpolation; left as an exercise
{
// handle negative values for x
if (x < 0) return -sin(-x);
// handle values above 360
if (x >=360) x %= 360; // if prevents the expensive modulo if not needed
switch(x/90) //which quadrant?
{
case 0: return sinus[x]; break;
case 1: return sinus[180-x]; break;
case 2: return -1 * sinus[x-180]; break;
case 3: return -1 * sinus[360-x]; break;
}
}
float _cos(int x)
{
return sin(x + 90);
}
float _tan(int x)
{
return sin(x)/cos(x); // may return NaN not a number
}

I know... I was just showing how to set up one of such tables. You can also translate sin() into cos() with some arithmetic. But that wasn't the purpose.

I have a question for Experts in the area
Tweaking with integer FFT code (Board UNO, ATMega328, 16 MHz), I couldn't get results any better than 23 ms ( 128 points calculus ). I'm using sine LUT, with pgm_read_word to get a value, and I know that reading happened 2 times in iner loop, which executed 127 times.
There are 254 readings overall.

When I shoot down reading LUT at all, and just multiply with dummy constant instead of sine:

wr = 5; //pgm_read_word(&Sinewave[j+N_WAVE/4]);
wi = 5; //-pgm_read_word(&Sinewave[j]);

result show 9 ms. The question is :

isn't it too much 16 ms / 254 = 63 usec per one pgm_read_word?