High precision sine wave synthesis

Recently I put some work into generating sine waves intended for testing 24 bit audio hardware. Here's a quick message about this new work, in case anyone's interested....

Normally sine waves are generated on microcontrollers using a table lookup. That's perfect if the sine wave happens to be an exact division of the sample rate. But if you want to generate waveforms at any frequency, you end up needing points on the waveform that are "between" two entries in the table. The 2 common approaches are to simply use the nearest or prior table value, or to grab the nearest 2 values from the table and use linear interpolation.

But if you want a sine wave with extremely low distortion, where 16 or 20 or more bits are within +/- 1 from an ideal sine wave, you'd need an extremely large table!

Sine can be computed using Taylor series approximation. The formula is: (where x is the angle, in radians)

sin(x) = x - (x^3)/3! + (x^5)/5! - (x^7)/7! + (x^9)/9! - (x^11)/11! + ....

This series goes on forever, but each extra terms makes the approximation rapidly converge to the true value. In doing quite a lot of testing, I discovered the C library function on Linux for sin() uses this approximation, to only the (x^7)/7! term. I also found a few sites talking about going to the (x^9)/9! for "professional quality" audio.

If you're still reading by this point, you're probably shaking your head, thinking this couldn't possibly be practical in a microcontroller. That's a complex equation with floating point numbers, and huge values in x^11 and 11!, since 11 factorial happens to be 39916800.

The code I'm sharing here implements this equation to the (x^11)/11! term using 32 bit integers, using only 12 multiply instructions, which execute in a single cycle on Cortex-M4. The add & subtract take zero CPU time, since those multiply instructions also come in flavors that do a multiple and accumulate, either positive or negative accumulate.

The Cortex-M4 multiplies perform a 32x32 to 64 bit multiply, and then discard the low 32 bits, with proper round off. That turns out to be exactly the right thing for managing the huge values of x raised to an increasing power, and the huge numbers of the factorials. Since those divisions are by constants, it's possible to multiply by the reciprocal to get the same effect.

So, here's is the optimized code:

On top of the 12 cycles for multiplies, there's a few bit shifts, and a quick conditional test which subtracts from a constant. That's necessary because the Taylor series approximation applies only if the angle is between -pi/2 to +pi/2. For the other half of the sine wave, that subtract maps back into the valid range, because the sine wave has symmetry.

This function takes a 32 bit angle, where 0 represents 0 degrees, and 0xFFFFFFFF is just before 360 degrees. So the input is perfect for a DDS phase accumulator. The output is a 32 bit signed integer, where 0x7FFFFFFF represents an amplitude of +1.0, and 0x80000001 represents -1.0.

This code will never return 0x80000000, so you don't need to worry about that case.

I did quite a lot of testing while working out these constants and the bit shifts for correct numerical ranges. I believe the top 25 bits are "perfect". Six of the low 7 bits are very close, but the approximation does diverge slightly as the angle approaches pi/2 magnitude. The LSB is always zero, since the computation needs to have extra overhead range to accommodate values representing up to ~1.57 (pi/2) before the latter terms converge to the final accurate value.

For 8 bit AVR, this approach probably isn't practical. It probably isn't practical on Cortex-M0+ either, since there's no 32x32 multiply with 64 bit result. Cortex-M3 does have such a multiply, but not in the convenient version that rounds off and discards the low 32 bits. On Cortex-M4, this code runs very fast. In fact, when executing at 100 MHz or faster, it might even rival the table lookup, since non-sequential flash accesses (for the table) usually involve a few wait states for a cache miss. Then again, this code does have 6 integer constants, for the conversion to radians and the factorial coefficients... and depending on compiler flags and flash caching behavior, loading those 6 constants might be the slowest part of this algorithm?

I'm sure most people will still use table lookups, and maybe linear interpolation between 2 table entries. But I wanted to take a moment to share this anyway. Hope you find it interesting.

Very interesting,

what I do not understand is why you used the extra shifts on some places and not on others,
Is that to keep the constants in 32 bit?


sintaylor(0..PI/2) is 30% slower on UNO (160 vs 120 uSec)
One needs to go to 11-factor to keep accuracy ok.

double sintaylor(const double angle)
{
  //  sin(x) = x - (x^3)/3! + (x^5)/5! - (x^7)/7! + (x^9)/9! - (x^11)/11! + ....

  double a2 = angle * angle;
  double x = angle;

  double rv = x;

  x = x * (1.0 / 6) * a2;
  rv -= x;

  x = x * (1.0 / 20) * a2;
  rv += x;

  x = x * (1.0 / 42) * a2;
  rv -= x;

  x = x * (1.0 / 72) * a2;
  rv += x;

  x = x * (1.0 / 110) * a2;
  rv -= x;

  return rv;
}

think you could reuse some vars in your code to minimize footprint.

I could only squeeze one shift out around p1. (cannot verify)

	p1 =  multiply_32x32_rshift32_rounded(angle << 2, 1686629713);
	p2 =  multiply_32x32_rshift32_rounded(p1, p1) << 1;
	p3 =  multiply_32x32_rshift32_rounded(p1, p2) << 2;
	sum = multiply_subtract_32x32_rshift32_rounded(p1, p3, 1431655765);  // here

as there is shifted less it might affect precision ?

The number of bit shifts isn't a problem because the Cortex-M4 has a barrel shifter. A shift of any number of bits is completed in one clock cycle (LSLS instruction).

Pete

el_supremo:
The number of bit shifts isn't a problem because the Cortex-M4 has a barrel shifter. A shift of any number of bits is completed in one clock cycle (LSLS instruction).

Pete

Thanks, I expected that ... so 1 shift less is just 1 tick less per call.

Still the question of precision stands?

so 1 shift less is just 1 tick less per call.

I don't think there's a "tick" involved. A different number of shifts will still execute in the same clock cycle.

Still the question of precision stands?

Yeah. I'm slowly working at figuring out what the constants are and how the precision is maintained (and not lost).

Pete

robtillaart:
Still the question of precision stands?

Yes, I believe you have a good point.

But the 2 bit shift can't be done to the angle before the first multiply, because it's a 31 bit number at that point. Doing it afterwards would put p1 into 2.30 format instead of 3.29. That looks like it would save a shift before the instruction which computs x + x^3/3!, and you still should get the same p2 & p3 terms for the rest of the computation just by adjusting the number of shifts.

I think that probably works out to give the same result, with 1 shift operation saved. :slight_smile:

It does work, with 1 less shift operation!! I verified the p2 & p3 terms retain precision. I also did a quick check on real hardware, though admittedly this is only the top 16 bits fed to a STGL5000 codec (not going to win any audiophile awards) and viewed on an 8 bit oscilloscope with at best about 12 bits in its high res acquisition mode.

Here's the latest code. I added comments with the fixed point format at each step, which hopefully might make reading the code and understanding the shifts a little easier.

// High accuracy 11th order Taylor Series Approximation
// input is 0 to 0xFFFFFFFF, representing 0 to 360 degree phase
// output is 32 bit signed integer, top 25 bits should be very good
static int32_t taylor(uint32_t ph)
{
        int32_t angle, sum, p1, p2, p3, p5, p7, p9, p11;

        if (ph >= 0xC0000000 || ph < 0x40000000) {                            // ph:  0.32
                angle = (int32_t)ph; // valid from -90 to +90 degrees
        } else {
                angle = (int32_t)(0x80000000u - ph);                        // angle: 2.30
        }
        p1 =  multiply_32x32_rshift32_rounded(angle, 1686629713) << 2;        // p1:  2.30
        p2 =  multiply_32x32_rshift32_rounded(p1, p1) << 1;                   // p2:  3.29
        p3 =  multiply_32x32_rshift32_rounded(p2, p1) << 2;                   // p3:  3.29
        sum = multiply_subtract_32x32_rshift32_rounded(p1, p3, 1431655765);   // sum: 2.30
        p5 =  multiply_32x32_rshift32_rounded(p3, p2) << 1;                   // p5:  5.27
        sum = multiply_accumulate_32x32_rshift32_rounded(sum, p5, 286331153);
        p7 =  multiply_32x32_rshift32_rounded(p5, p2);                        // p7:  8.24
        sum = multiply_subtract_32x32_rshift32_rounded(sum, p7, 54539267);
        p9 =  multiply_32x32_rshift32_rounded(p7, p2);                        // p9: 11.21
        sum = multiply_accumulate_32x32_rshift32_rounded(sum, p9, 6059919);
        p11 = multiply_32x32_rshift32_rounded(p9, p2);                       // p11: 14.18
        sum = multiply_subtract_32x32_rshift32_rounded(sum, p11, 440721);
        return sum <<= 1;                                                 // return:  1.31
}

Turns out the p5 term can also avoid the extra shift, just by adjusting the remaining coefficients. :slight_smile:

// High accuracy 11th order Taylor Series Approximation
// input is 0 to 0xFFFFFFFF, representing 0 to 360 degree phase
// output is 32 bit signed integer, top 25 bits should be very good
static int32_t taylor(uint32_t ph)
{
        int32_t angle, sum, p1, p2, p3, p5, p7, p9, p11;

        if (ph >= 0xC0000000 || ph < 0x40000000) {                            // ph:  0.32
                angle = (int32_t)ph; // valid from -90 to +90 degrees
        } else {
                angle = (int32_t)(0x80000000u - ph);                        // angle: 2.30
        }
        p1 =  multiply_32x32_rshift32_rounded(angle, 1686629713) << 2;        // p1:  2.30
        p2 =  multiply_32x32_rshift32_rounded(p1, p1) << 1;                   // p2:  3.29
        p3 =  multiply_32x32_rshift32_rounded(p2, p1) << 2;                   // p3:  3.29
        sum = multiply_subtract_32x32_rshift32_rounded(p1, p3, 1431655765);   // sum: 2.30
        p5 =  multiply_32x32_rshift32_rounded(p3, p2);                        // p5:  6.26
        sum = multiply_accumulate_32x32_rshift32_rounded(sum, p5, 572662306);
        p7 =  multiply_32x32_rshift32_rounded(p5, p2);                        // p7:  9.23
        sum = multiply_subtract_32x32_rshift32_rounded(sum, p7, 109078534);
        p9 =  multiply_32x32_rshift32_rounded(p7, p2);                        // p9: 12.20
        sum = multiply_accumulate_32x32_rshift32_rounded(sum, p9, 12119837);
        p11 = multiply_32x32_rshift32_rounded(p9, p2);                       // p11: 15.17
        sum = multiply_subtract_32x32_rshift32_rounded(sum, p11, 881443);
        return sum <<= 1;                                                 // return:  1.31
}

Paul, would there be any benefit to using the SMLAL instruction to do a signed 32 bit product and accumulate in 64 bits with one adjustment at the end to extract the correct bits? The lack of multiply-subtract can be handled by making the appropriate constants negative. If you can arrange the products so that they all have the same Q format, would SMLAL allow retention of even more precision?

My brain hurts. :slight_smile:

Pete

Nice work Paul,

you can possibly remove another shift at the end of the algorithm.

sum = multiply_subtract_32x32_rshift32_rounded(sum, p11, 881443);
return sum <<= 1; // return: 1.31

==>

sum = multiply_subtract_32x32_rshift32_rounded(sum, p11, 1762886); // or 1762885 / ...87
return sum;

I noticed that return sum <<= 1; would always return an even number.

"I noticed that return sum <<= 1; would always return an even number."
Yes - that is multiplying by 2, so it would make an even number.

CrossRoads:
"I noticed that return sum <<= 1; would always return an even number."
Yes - that is multiplying by 2, so it would make an even number.

True :slight_smile:
what I wanted to say, is the last bit would not be significant anymore by that shift and by the variation I proposed that last bit could become significant if the changed constant in the line above would be odd.

That constant is one in the range { 1762885 1762886 1762887 }

I am seeing a problem here. You are doing some great work getting accuracy in the sine wave amplitude but will the inaccuracy in the timing not be the limiting factor in measuring the purity of the sine wave?

Suppose you are getting timing accuracy of 1 part per million. If you measure a sine wave for 1 second the inaccuracy in timing at the end of the duration will be 1 microsecond. If you are measuring a 1 KHz sine wave, this will translate to an error at the zero crossing of sine(2pi1000 Hz1 sec10^-6) = 0.006283.

Put another way, the widening of the spectral line at 1 KHz will be dominated by the jitter of your timing base. One way that jitter is quantified is by specifying the width of the fundamental spectral line.

el_supremo:
Paul, would there be any benefit to using the SMLAL instruction to do a signed 32 bit product and accumulate in 64 bits with one adjustment at the end to extract the correct bits?

I believe you'd have to go that route, or something like it, if you wanted to extend this to 13th, 15th, 17th, 19th order to end up with more than 25 accurate bits. I haven't explored that, but from my experience with other DSP algorithms, even though the 64+32x32 instruction is single cycle, a good number of extra instructions end up being used to manage the 64 bit numbers. It's pretty similar to going from all 8 bit to 16 bit on AVR.

robtillaart:
what I wanted to say, is the last bit would not be significant anymore

Truth is, the bottom 5-6 bits aren't significant before the final 1 bit shift. That's partly due a few bits of round off errors in the products (p1, p2, p3, etc), but mostly from due to using the Taylor series up to only its 11th order.

To get all 32 bits significant and accurate, you'd need to use at least a few more terms of the Taylor series, and you'd have to carry more than 32 bit results throughout the computation. The final 1 bit shift pales in comparison.

by that shift and by the variation I proposed that last bit could become significant if the changed constant in the line above would be odd.

That constant is one in the range { 1762885 1762886 1762887 }

There's a problem with this. The sum is still 2.30 format. If you adjust the constant so the final multiply which computes the (x^11/11!) term ends up in 1.31 format, you'll get a totally wrong result when subtracting from the 2.30 sum.

So far, I've worked with the assumption 2.30 format is required, because the first term of the Taylor series is just the angle in radians, which spans a range of -1.571 to +1.571 (pi/2). Maybe there's some crafty way to rearrange things to avoid this?

charliesixpack:
but will the inaccuracy in the timing not be the limiting factor in measuring the purity of the sine wave?

That's a good point. I have been considering only amplitude!

The phase accumulator is 32 bits. But that in itself doesn't mean the phase error is 1/2^32. It scales with the ratio of frequency to sample rate.

Suppose you've generating 44.1 kHz sample rate data for a 440 Hz sine wave (note A4). That's not a perfect integer multiple like 1 kHz. Ideally, you'd increment the 32 bit phase by 42852281.41 for each sample. Truncating or rounding off the 0.41 does indeed lead to phase error. But it's on approximately the same scale as the errors in amplitude.

But this does scale with frequency. If you have a high sample rate and you try to generate a very low frequency, indeed the phase increment becomes a fairly small number, relative to the 25 significant amplitude bits! :frowning:

Hmmm... maybe more work needed.....

Suppose you've generating 44.1 kHz sample rate data for a 440 Hz sine wave (note A4). That's not a perfect integer multiple like 1 kHz. Ideally, you'd increment the 32 bit phase by 42852281.41 for each sample. Truncating or rounding off the 0.41 does indeed lead to phase error. But it's on approximately the same scale as the errors in amplitude.

You can use bresenham (line/circle/ellipse) techniques to sum the errors and do corrections once and a while, keeping it all integer.

Used that technique here for timing too - Precision timing issues - Project Guidance - Arduino Forum -

You are considering the theoretical accuracy of producing a sine wave over one cycle. I am asking what difference does it make if you will never be able to measure that accuracy. Even if you measure the accuracy using the same time base, the jitter will be different when measured at a different time. As I previously noted, the phase for the cycle 1 second removed from the start gives a huge error especially for higher frequencies. A sine wave with zero amplitude error produced with a time base with jitter will have a large amplitude error when measured with a perfect time base.

What measurement will be used to verify your accuracy? There are those that think that the only benefit of 24 bits over 16 bits is greater dynamic range.

charliesixpack:
You are considering the theoretical accuracy of producing a sine wave over one cycle.

Well yeah, the function takes 1 cycle phase input and gives amplitude output.

Hopefully it's easy to see how that generates a continuous waveform when called from the DDS code with a phase accumulator.

I am asking what difference does it make if you will never be able to measure that accuracy.

Suppose you build an audio shield with a 24 DAC. Or suppose someone else built one, and you wanted to try testing it. If you check out that project, you can see he's done some testing with sine waves, looping the DAC output back to the ADC input.

So part of the motivation here is to be able to create arbitrary sine waves that are fully 24 bit accurate, at least in the pure digital domain. When passed through real hardware that will add amplitude errors and clock jitter, the idea here is to at least be able to believe the data didn't have such problems before hardware tried to turn it into analog signals.

There's also a purely academic interest here, at least on my part. Maybe you'd scoff at that notion, but I think it's interesting to dig into these algorithms and learn more, even if the accuracy is beyond even the very best hardware today's technology can produce. I also want to share this, without knowing if anyone else will really ever use it for anything practical. Questions regularly come up on these forums about generating sine waves, and the answer is almost always fairly low res table lookup, so maybe others will find this interesting, even if they don't have a practical way to measure anything on this level in the analog domain.

What measurement will be used to verify your accuracy? There are those that think that the only benefit of 24 bits over 16 bits is greater dynamic range.

Honestly, I'm not sure. I did build up one of William's 24 bit boards, and I have extra parts to build a second one. I am curious to try connecting one board's output to the input of another, rather than simple analog loopback where the same chip with the same crystal is doing both DAC and ADC.

Beyond that, truth is I don't have any solid plans for more rigorous testing. I've toyed with the idea of maybe renting high-end gear for a month from a place like TestEquity. Because this relates to Teensy, maybe PJRC could even write off that expense... I can probably talk Robin into renting something for 3-digit dollars, but buying something at 4-5 digits ain't gonna happen! But that day is a long way off, if ever. At this moment, I don't even believe the digital data I'm generating is necessarily as good as the Cirrus CS4272 chip claims to be!

Even if this never gets used for any practical purpose, at least I think it's interesting, and I believe I've maybe even learned a thing or two along the way......

I did not mean to imply that there is not value in the exercise and I am sorry if I came across that way. No scoffing here. I was just wondering where it was leading.

As far as generating sine waves, I did some investigating using an unstable IIR to generate a sine wave. Since I used floating point math the result was not as efficient as table lookup but it worked pretty well nonetheless. I did not have the ambition to pursue the possibility of using integer math.