Recently I put some work into generating sine waves intended for testing 24 bit audio hardware. Here's a quick message about this new work, in case anyone's interested....
Normally sine waves are generated on microcontrollers using a table lookup. That's perfect if the sine wave happens to be an exact division of the sample rate. But if you want to generate waveforms at any frequency, you end up needing points on the waveform that are "between" two entries in the table. The 2 common approaches are to simply use the nearest or prior table value, or to grab the nearest 2 values from the table and use linear interpolation.
But if you want a sine wave with extremely low distortion, where 16 or 20 or more bits are within +/- 1 from an ideal sine wave, you'd need an extremely large table!
Sine can be computed using Taylor series approximation. The formula is: (where x is the angle, in radians)
sin(x) = x - (x^3)/3! + (x^5)/5! - (x^7)/7! + (x^9)/9! - (x^11)/11! + ....
This series goes on forever, but each extra terms makes the approximation rapidly converge to the true value. In doing quite a lot of testing, I discovered the C library function on Linux for sin() uses this approximation, to only the (x^7)/7! term. I also found a few sites talking about going to the (x^9)/9! for "professional quality" audio.
If you're still reading by this point, you're probably shaking your head, thinking this couldn't possibly be practical in a microcontroller. That's a complex equation with floating point numbers, and huge values in x^11 and 11!, since 11 factorial happens to be 39916800.
The code I'm sharing here implements this equation to the (x^11)/11! term using 32 bit integers, using only 12 multiply instructions, which execute in a single cycle on Cortex-M4. The add & subtract take zero CPU time, since those multiply instructions also come in flavors that do a multiple and accumulate, either positive or negative accumulate.
The Cortex-M4 multiplies perform a 32x32 to 64 bit multiply, and then discard the low 32 bits, with proper round off. That turns out to be exactly the right thing for managing the huge values of x raised to an increasing power, and the huge numbers of the factorials. Since those divisions are by constants, it's possible to multiply by the reciprocal to get the same effect.
So, here's is the optimized code:
On top of the 12 cycles for multiplies, there's a few bit shifts, and a quick conditional test which subtracts from a constant. That's necessary because the Taylor series approximation applies only if the angle is between -pi/2 to +pi/2. For the other half of the sine wave, that subtract maps back into the valid range, because the sine wave has symmetry.
This function takes a 32 bit angle, where 0 represents 0 degrees, and 0xFFFFFFFF is just before 360 degrees. So the input is perfect for a DDS phase accumulator. The output is a 32 bit signed integer, where 0x7FFFFFFF represents an amplitude of +1.0, and 0x80000001 represents -1.0.
This code will never return 0x80000000, so you don't need to worry about that case.
I did quite a lot of testing while working out these constants and the bit shifts for correct numerical ranges. I believe the top 25 bits are "perfect". Six of the low 7 bits are very close, but the approximation does diverge slightly as the angle approaches pi/2 magnitude. The LSB is always zero, since the computation needs to have extra overhead range to accommodate values representing up to ~1.57 (pi/2) before the latter terms converge to the final accurate value.
For 8 bit AVR, this approach probably isn't practical. It probably isn't practical on Cortex-M0+ either, since there's no 32x32 multiply with 64 bit result. Cortex-M3 does have such a multiply, but not in the convenient version that rounds off and discards the low 32 bits. On Cortex-M4, this code runs very fast. In fact, when executing at 100 MHz or faster, it might even rival the table lookup, since non-sequential flash accesses (for the table) usually involve a few wait states for a cache miss. Then again, this code does have 6 integer constants, for the conversion to radians and the factorial coefficients... and depending on compiler flags and flash caching behavior, loading those 6 constants might be the slowest part of this algorithm?
I'm sure most people will still use table lookups, and maybe linear interpolation between 2 table entries. But I wanted to take a moment to share this anyway. Hope you find it interesting.