Optimizing additive synthesis /phase accumulators

Hey guys,

So... my latest project is a replica movie prop. It's got servos, touch sensors, a shift register, and a dac running on hardware SPI.

You can read more about it here:

Anyway, the sound effect it plays is generated via additive synthesis. Using a lot of fixed point math, a sine table, and 16 bit phase accumulators, I add two to six sine waves together, and modulate their pitch and volume with two more, in addition to modulation I perform using the current speed setting of the prop. In total, I have ten phase accumulators which re being updated, and I do the frequency modulation by adjusting the phase accumulator step values directly... which I can do because it turns out there is a 1:1 mapping of step value to frequency, calculated like so: (frequency/samplerate)*65536. The samples generated in this manner are then fed into a 1024 byte buffer which my DAC interrupt increments an index into as it pulls data from it.

The problem is, my code isn't fast enough. My DAC needs to be updated 31,250 times a second so that I can generate high pitched chirping noises, but my testing quickly revealed that the update function is lagging behind quite a bit. How much exactly, I'm not sure, but when I tried to display how many samples behind the function calculated it was based on the indexes into the buffer the numbers were all over the place, and my main loop slowed to a crawl when I allowed it to update the entire buffer if need be. I managed work around the issue by letting the function that generates the samples to only update one quarter of the buffer at a time, and that seems to have worked because my sounds are largely repetitive, but it's a kludge, and I fear that it is impacting the quality of my sound output, and causing my servos to jitter occasionally.

So here is the code which generates the samples for my DAC. If you have any suggestions for how to make this like 4x as fast, I'd love to hear it:

const byte sintable[256] = { // According to a post on AVRfreaks, a sine table in PROGMEM is 5x slower to access than an array in ram.
  127,130,133,136,139,143,146,149,152,155,158,161,164,167,170,173,
  176,179,182,184,187,190,193,195,198,200,203,205,208,210,213,215,
  217,219,221,224,226,228,229,231,233,235,236,238,239,241,242,244,
  245,246,247,248,249,250,251,251,252,253,253,254,254,254,254,254,
  255,254,254,254,254,254,253,253,252,251,251,250,249,248,247,246,
  245,244,242,241,239,238,236,235,233,231,229,228,226,224,221,219,
  217,215,213,210,208,205,203,200,198,195,193,190,187,184,182,179,
  176,173,170,167,164,161,158,155,152,149,146,143,139,136,133,130,
  127,124,121,118,115,111,108,105,102,99,96,93,90,87,84,81,
  78,75,72,70,67,64,61,59,56,54,51,49,46,44,41,39,
  37,35,33,30,28,26,25,23,21,19,18,16,15,13,12,10,
  9,8,7,6,5,4,3,3,2,1,1,0,0,0,0,0,
  0,0,0,0,0,0,1,1,2,3,3,4,5,6,7,8,
  9,10,12,13,15,16,18,19,21,23,25,26,28,30,33,35,
  37,39,41,44,46,49,51,54,56,59,61,64,67,70,72,75,
  78,81,84,87,90,93,96,99,102,105,108,111,115,118,121,124}; 

  void updateSoundBuffer() {
    
      const byte phaseAccumulators = 10;

      const word phaseFrequencyBase[phaseAccumulators] = {4429/2, 8860/2, 1930, 3942, 5933, 7884, 9835, 11786, 2, 1}; // 2000, 1000
      const word phaseStepBase[phaseAccumulators] = {(phaseFrequencyBase[0]/31250.0)*65536, // sine_a1
                                                     (phaseFrequencyBase[1]/31250.0)*65536, // sine_a2
                                                     (phaseFrequencyBase[2]/31250.0)*65536, // sine_b1
                                                     (phaseFrequencyBase[3]/31250.0)*65536, // sine_b2
                                                     (phaseFrequencyBase[4]/31250.0)*65536, // sine_b3
                                                     (phaseFrequencyBase[5]/31250.0)*65536, // sine_b4
                                                     (phaseFrequencyBase[6]/31250.0)*65536, // sine_b5
                                                     (phaseFrequencyBase[7]/31250.0)*65536, // sine_b6
                                                     (phaseFrequencyBase[8]/31250.0)*65536, // volume oscillation 25%-100% at 2hz-1hz.
                                                     (phaseFrequencyBase[9]/31250.0)*65536  // pitch oscillation +-50hz at 1hz.
                                                    };            

      static word phase[phaseAccumulators];                                               // Phase accumulator. Used as index into sine table with >> 8. [0..65535]
      word phaseStep[phaseAccumulators];                                                  // Step value by which to increment phase[] each sample. 
      
      const int pitchChange = 2000;//1400*1.5;                                                   // How much the frequency of the sound effect should rise between 0% extension and 100% extension.
      const int pitchStepChangeMax = (pitchChange / 31250.0) * 65536;                     // This value, scaled by the current PKE speed, is added to the step values for the phase accumulators which generate the sine waves used to recreate the PKE's sound.  It is another way of representing the frequency change when the PKE speeds up.
      int pitchStepChange;                                                                // The calculated amount by which we should adjust the step values for all the frequency phase accumulators. 
      
      const int pitchOscillation = 50;                                                    // How much the frequency of the sound effect should rise/fall. (100hz)
      const int pitchStepOscillationMax = (pitchOscillation / 31250.0) * 65536;             
      int pitchStepOscillation;
      
      const int volumeOscillationChange = 10;                                             // How much the frequency of the volume oscillation changes as the speed of the PKE increases.
      const int volumePhaseStepChangeMax = (volumeOscillationChange / 31250.0) * 65536;
      
      byte volumePhaseMultiplier = 0;
      
      int difference;                                                                     // The number of samples which need to be updated this loop.

      byte sine;                                                                          // Temp variable used to hold a value read from the sine table.
      
      static byte noiseIndex = 0;                                                         // Index into noise table incremented once per reading.  Automatically loops.
      static byte oldnoise;                                                               // Used to high-pass filter noise.
      byte noise;                                                                         // Temp variable used to hold a value read from the noise table.

      byte pkeVolumeFixed;                                                                // 0.256 fixed-point representation of current PKE volume level.
      byte volume;                                                                        // 0.256 fixed-point representation of current volume level after 1-2hz oscillation has been applied, and just before it is applied to the current sample being rendered.
      
      const word sectionLength[4] = {256, 128, 256, 384};                                 // The number of samples in each of the four sections which make up the sound effect.  (5 sines, 5 sines w/ noise, 5 sines, 2 sines)
      static double sectionDivider = 1;                                                   // Section length randomly varies between /1 /2 and /4. This value signifies the number of bits to shift the values by to accomplish that. [0..2]
      static word sectionIndex = 0;                                                       // Incremented as we step through each section.
      static byte section;                                                                // The section currently being rendered.
      static byte pulseCount;                                                             // A pulse is a group of 5 sections.  We count how many have been rendered so we can change the timing between pulses every second or so.
      
      static int nextIndex = 0;                                                           // The next sample in the buffer which needs to be updated.
      long sample = 0;                                                                    // The new sample to be placed into the buffer.  We may be able to reduce this to a byte or a word.  

      byte i, n;                                                                          // Temp variables used for loops.
      
      //static long largest;
      

     // Calculate volume change based on current PKE speed.

       //pkeVolume = pkeVolume * (0.25 + 0.75*(pkeSpeed*pkeSpeed));
       pkeVolume = pkeVolume * (0.2 + 0.8*(pkeSpeed*pkeSpeed));
       pkeVolumeFixed = pkeVolume * 255;

    // Update the step values for the phase accumulators:
    // We need not calculate these each sample because they only change as the speed of the PKE changes, which can only happen during the main loop.
            
    // Calculate base step values for sines, using current PKE speed:
      pitchStepChange = pitchStepChangeMax * pkeSpeed;
      for (n = 0; n < 8; n++) { // Adjust pitch.
        phaseStep[n] = phaseStepBase[n] + pitchStepChange;
      }
    
     // Calculate step value for volume oscillator, using current PKE speed.  The volume oscillates more slowly as the PKE speeds up, so volumePhaseStepChangeMax is a negative number.
       phaseStep[8] = phaseStepBase[8] + (volumePhaseStepChangeMax * pkeSpeed); 
          
     // Calculate step value for pitch oscillator:
       phaseStep[9] = phaseStepBase[9];

      
    difference = soundBufferIndex - nextIndex; // Determine how many samples behind we are.
    if (difference < 0) { difference = SOUNDBUFFERSIZE + difference; } // Handle case where sample playing is before next sample to be updated in the buffer.
    if (difference >= 128) { difference = 128; } // Temp code to prevent rendering too many samples at once.
    for (int i = 1; i < difference; i++) { // If there is a difference of 1 or more, fill in the gap.

      // Compute the sample. 
      // We have approximately 512 cycles, on average, in which to do this.
      
      // Add sines.
        switch (section) {      
  
          case 0: // Two sines.
          case 2:
          
            sample = sintable[phase[0]>>8]; // Read sample from sine table.  Compute index from fixed point 256.256 phase accumulator. 
            sample = sample + sintable[phase[1]>>8];
            
            sample = sample - 128*2; // Center sample around 0.  We need to do this before dividing to get the "average" because we aren't dividing by the same number as the number of samples, and the volume is being reduced a bit, so the math doesn't work out otherwise.
                                      
            sample = sample >> 2; // Divide by 4.
            
            break;
        
          case 1: // Two sines plus noise.

            sample = sintable[phase[0]>>8];
            sample = sample + sintable[phase[1]>>8];

            noise = pgm_read_byte(noisetable + noiseIndex); // Read sample from noise table.
            noiseIndex++; // Increment index into noise table.  Stored in a byte so it automatically loops.
            
            oldnoise = (oldnoise + noise) >> 1; // Average noise with oldnoise to high-pass filter it.  Could apply this filter to LUT in advance.  Try sound effect without this filter, see if it sounds okay?
            
            sample = sample + oldnoise;

            sample = sample - 128*3; // Center sample around 0.
            
            sample = sample >> 2; // Divide by 4. 
      
            break;
      
          case 3: // Six sines.
        
            sample = sintable[phase[2]>>8];
            sample = sample + sintable[phase[3]>>8];
            sample = sample + sintable[phase[4]>>8];
            sample = sample + sintable[phase[5]>>8];
            sample = sample + sintable[phase[6]>>8];
            sample = sample + sintable[phase[7]>>8];
            
            sample = sample - 128*6; // Center sample around 0.
            sample = sample >> 3; // Divide by 8, because dividing by 6 is too costly.
            
            break;
            
        }
        
        // At this point, sample is an signed value -128..128. 

/*
        if (DEBUG == 1) {
           Serial.print("sample = ");
           Serial.println(sample);
        }   
*/                
      // Oscillate volume:
      // (At a rate of 1-2hz, from 25% to 100%. Volume oscillates more slowly as PKE speeds up.  volume = ([0..255] * (64*256 + 192*[0..256]))>>16
      
        sine = sintable[phase[8]>>8]; // [0..255]
        
        // Adjust maximum swing:
          // sine = 128 + sine>>1; // [128..255] 
          //sine = 192 + sine>>2; // [192..255] 
          sine = 128 + sine>>2; // [128..192] used to avoid need to clamp below.
          //sine = 160 + sine>>2; // [160..224] used to avoid need to clamp below.
        
        volume = ((long) pkeVolumeFixed * sine) >> 8; // [0..255]
        
        //volume = ((long)pkeVolumeFixed * (64*255 + 192*sine))>>16; // volume = [0..255] 
        //volume = pkeVolume * sine // [0..255]
        //volume = pkeVolume * (64 + (192*sine)>>8) // [0..255]
        
      // Adjust volume of sample.
      
        //sample = sample * pkeVolume;
        //sample = (sample * pkeVolumeFixed) >> 8;
        
        //sample = (sample * volume) >> 8; // sample = [-32640..32385]/256 = [-128..127] 
        //sample = sample<<2; // Multiply volume of sample by 4.
        
        sample = (sample * volume) >> 6; // Divide by lower value to maximize volume, which on average is around 64, not 256.
        
        // Clamp sample to range -128, 127  

//          if (sample < -128) { sample = -128; }
//          if (sample > 127) { sample = 127; }
        
        /*

        
        // Display highest value for sample generated so far.
        // Result = 67
        
        if (DEBUG == 1) {
          if (sample > largest) { 
            largest = sample; 
            Serial.print("sample = ");
            Serial.println(sample);
          }
        }   
        
        */

        
      // Place sample in the sound buffer.
      // Sound buffer requires 8-bit samples [0..255]

        soundBuffer[nextIndex] = sample + 128;

        nextIndex++; // Increment the index.
        nextIndex = nextIndex & 1023; // Wrap nextIndex when it reaches the end of the buffer.  Bitwise AND is much faster than a modulo.

              
      // Update the phase accumulators. (In other words, increment the fixed-point index into the sine tables.)
        
        // Compute pitch oscillation. 
          pitchStepOscillation = (long)(pitchStepOscillationMax * (sintable[phase[9]>>8]-128)) >> 8; 
          //pitchStepOscillation = (long)(pitchStepOscillationMax * (sintable[phase[9]>>8])) >> 8; 

        for (n = 0; n < 8; n++) {
          phase[n] = phase[n] + phaseStep[n] + pitchStepOscillation; // Add step to accumulator.  No if statement or modulus is needed because it will wrap automatically once it reaches 65535.
        }              
        
        phase[8] = phase[8] + phaseStep[8]; // Volume oscillator
        //phase[8] = phase[8] + phaseStep[8]<<volumePhaseMultiplier; // Volume oscillator
        phase[9] = phase[9] + phaseStep[9]; // Pitch oscillator

        
      // Update which section we are in.
      
        sectionIndex++; // Increment the number of samples we have rendered in this section.
        if (sectionIndex >= (sectionLength[section]*sectionDivider)) { // If we have rendered all the samples needed for this section, move on to the next section.

          sectionIndex = 0;
          section++;
          
          if (section == 4) { // If we have gone past the last section, select a new sectionDivider, and start again from the first section.
            
            pulseCount++; // Keep track of the number of pulses (chord pairs) we have rendered at this speed.
            
            if (pulseCount >= 32) { 
              
              pulseCount = 0;        
              
              // At a 31250hz sample rate, and standard pulses being 1024 samples long, there are around 30 pulses per second. 
              // If we want to change the pulse rate every second or so, (as defined by sectionLength[section]>>sectionDivider) then changing the rate every 32 pulses will do the job.
              // Since sectionDivider allows us to speed up the pulses, if we want to wait for the same period of time before changing speeds, we must remember to multiply the desired pulseCount by sectionDivider, 
              // or divide pulseCount by sectionDivider before comparing it.
              
              // Random() is a slow function but since we're only using it once in a while, it might be okay.  

              switch ((int)(pkeSpeed*3)) { // With an (int) conversion, fractional portions are discarded, so everything except pkeSpeed = 1.0 is mapped to 0, 1, or 2.
              
                case 0: // Select from any of the three speeds at the "slowest" (quietest, lowest pitch) setting.
              
                  //sectionDivider = random(0, 3); // random(0, 3) returns a float greater than or equal to 0 and less than 3, so the range when typecast to (int) will be [0..2].
                  //noise = pgm_read_byte(noisetable + noiseIndex);
                  
                  // sectionDivider = 1;
                      
                  //sectionDivider = 0.75 + random(0,3)*0.5;
                  sectionDivider = 0.75 + random(0, 10)*0.04;
                  //volumePhaseMultiplier = random(0,3);
                  
                  break;
                
                case 1: // At "medium" setting, select from the two fastest speeds at random.
              
                  //sectionDivider = random(1, 3); 
                  //sectionDivider = 0.875;
                  
                  //sectionDivider = 0.75 + random(0,2)*0.5;
                  sectionDivider = 0.75 + random(0, 10)*0.02;
                  //volumePhaseMultiplier = random(1,3);
                  
                  break;
               
                case 2: // At the "highest" setting, use only the fastest speed.
                case 3: // Only when pkeSpeed = 1.0 will the result be 3.
            
                  sectionDivider = 0.75 + random(0, 10)*0.01; 
                  //volumePhaseMultiplier = 2;
                  break;
                  
              }
              
            }
            
            section = 0; 
            
          }

        }        
      
        
      }    


      /*
      if ((DEBUG == 1) && (difference > 100)) {
        Serial.print("samples behind = ");
        Serial.println(difference);
        Serial.println(soundBufferIndex);
        Serial.println(nextIndex);
        Serial.println();
      }    
      */
    
  }

A lot of code...

How are you compiling it?

Frome what I read - to improve by 4x requires an algorithm change,
or a simplicifation in the inner loop.

It would help you think about the time spent in the code if you calculated the number of times each line of code is executed..

removed bug comment as not sure anymore...

How am I compiling it? I'm not sure what you mean. I'm using the Arduino IDE if that's what you're asking.

As for how much each line of code is executed...

The function is called once per main loop. The main loop reads the analog inputs, and has a couple 1ms delays in it to do so. Around 64 samples will be played in the time it takes for those delays to execute, but if the function is fast enough it should be able to catch up.

There is a bit of setup code at the start of the function for variables which change only when the speed changes. Since the speed can change only outside the function, and when the analog inputs are read, I can avoid having to recalculate those variables as long as I'm in the function refilling the buffer.

Most of the rest of the code in the function is executed once per sample. The oscillators are updated once per sample as well, and there are ten of those. Now that I think about it, I suppose I could unroll that oscillator update loop and save a few cycles there by executing one fewer conditional jump per sample.

At the end of the function there's some code with some conditionals. This is what I'm reffering to:

        sectionIndex++; // Increment the number of samples we have rendered in this section.
        if (sectionIndex >= (sectionLength[section]*sectionDivider)) { // If we have rendered all the samples needed for this section, move on to the next section.

I suppose I could precalculate sectionLength[section]*sectionDivider. The sectionlength and sectiondivider can't change until that condition is met and I've moved on to the next section.

The code inside that conditional is executed roughly only once every 1024 samples though, so unless that is really really slow I shouldn't need to worry too much about it.

You calculate the phase[0...9] as 16 bit numbers but only use one byte in the code in the inner loop .. phase[x]>>8

You would make the code easier to execute (so faster) if you stored the bit of it you need to use.
ie calculate highPhase[i] = phase[i]>>8 outside the loop.

I cannot see why sample should be a long
It contains the sum of upto 6 byte values, divided and shifted to be 0 centered, then multiplied by a byte volume.
It should be a signed 16 bit int...
This code would be 2 to 4x quicker to execute

      // Adjust volume of sample.
        sample = (sample * volume) >> 6; // Divide by lower value to maximize volume, which on average is around 64, not 256.
           sample = sample + sintable[phase[3]>>8];

might be 2x faster

Similarly
volume = ((long) pkeVolumeFixed * sine) >> 8; // [0..255]
is doing a 32bit multiply, when a 16 bit result would do I think.
If the compiler generates a call to the 32 bit multiply this is costly > 32 clocks.
A 32 bit multiply is 4x slower than a sixteen bit multiply which is 4x slower than a 8 bit multiply which is 2 clocks.

The only other oportunity in the code for improvement that I can see is the logic around the sectionIndex, section number and the switch(section).

If you could pull this logic out of the sample loop, and have three functions one for section02(), section1() and section2() you might get a speed up.

This does depend on the rest of the codes structure.

Is there a simulator environment in which you can profile the execution of the code?

Professionally, i refuse to optimise without measurement.

in this play environment it is fun almost because the measurement tools seem to be missing, but this is a lot of code to get running fast and correctly - and profile tools would help a lot.

good luck.

So my comments in 2nd reply (written b4 i read your reply) seem to be ok:)

For tools, you can see an assembler listing of the code.

If you search the forums for avr-objdump i think you will find instructions (hopefiully for your OS and arduino version / environment).

Then each byte of op-code is a clock, and a few instructions (like the 8 bit multiply variants) are 2, and the obscure flash reading ones etc.. you need the databook for.

It is easy to count up where the time goes.

-- i am off for the day.

Good luck again.

You calculate the phase[0...9] as 16 bit numbers but only use one byte in the code in the inner loop ..
You would make the code easier to execute (so faster) if you stored the bit of it you need to use.
outside the loop.

The phase accumulators are incremented after each sample is calculated in advance of the next sample to be rendered. I'm pretty sure I can't do that calculation outside the main loop.

I cannot see why sample should be a long
It contains the sum of upto 6 byte values, divided and shifted to be 0 centered, then multiplied by a byte volume.
It should be a signed 16 bit int...

You may be right about that. It's hard to keep track of how large the values will get. I think I made it a long at some point to make sure that an overflow wasn't the problem. When I first tried this code after writing it, it didn't work at all. That's why there's a few floating point values used in there in the code I commented out. I was trying to simplify things until I found the source of the problem, then I went back and optimized it again. I'll double check that and the other variables to see what I can make smaller.

If you could pull this logic out of the sample loop, and have three functions one for section02(), section1() and section2() you might get a speed up.

Yeah, I was looking at that after I posted the code. Not really sure how to go about optimizing that in a clean manner. Right now it's like:

For {
switch {
case:
case:
case:
}
}

But to avoid the case in the loop I'd need to flip it around something like this:

repeat {
switch {
case:
repeat {
} until we'vew rendered all samples in this section
case:
repeat {
} until we'vew rendered all samples in this section
case:
repeat {
} until we'vew rendered all samples in this section
} until we've rendered all samples

And all those case statements would have a ton of repeated code in them. Which isn't good if I want to be able to maintain the code and tweak it. Then I'd have to copy all the tweaks three times every time. And I've been doing tons of tweaking to get the sound right.

Is there a simulator environment in which you can profile the execution of the code?
Professionally, i refuse to optimise without measurement.
in this play environment it is fun almost because the measurement tools seem to be missing, but this is a lot of code to get running fast and correctly - and profile tools would help a lot.

I don't know of any such tool. But I guess I should look for that thing you mentioned which will let me look at the generated assembly code. I think it's gonna be hard to find the relevant section to look at in that though. I imagine it's not going to have any of the variable names and be a mess of stuff being pushed and popped off the stack and I won't be able to make heads or tails of it.

Hm... one thing I can do with the phase accumulators is not increment them all every loop. As I mentioned, I could unroll that loop where I increment the accumulators. But rather than do that, instead I could increment the accumulators I use in the switch statement that selects which ones to combine. I think I did it the way I did originally because I was trying to make the code simpler and more generalized, and also because I thought it might improve the sound quality, but now I think that's probably not necessary.

One new simple change...

      static double sectionDivider = 1;                                                   // Section length randomly varies between /1 /2 and /4. This value signifies the number of bits to shift the values by to accomplish that. [0..2]
      static word sectionIndex = 0;                                                       // Incremented as we step through each section.
      static byte section;                                                                // The section currently being rendered.
        sectionIndex++; // Increment the number of samples we have rendered in this section.
        if (sectionIndex >= (sectionLength[section]*sectionDivider)) { // If we have rendered all the samples needed for this section, move on to the next section.
          sectionIndex = 0;
          section++;

Is doing a float (double is not implemented on AVR-GCC) multiply each time through the loop. That is a couple of hundred cycles.

You could count down from sectionLength [ section ] *sectionDivider to 0.

=======
The comment about restructuring is a bit of a guess. I thought the outer function might look something like this

      const word sectionLength[4] = {256, 128, 256, 384};                                 // The number of samples in each of the four sections which make up the sound effect.  (5 sines, 5 sines w/ noise, 5 sines, 2 sines)
      static double sectionDivider = 1;                                                   // Section length randomly varies between /1 /2 and /4. This value signifies the number of bits to shift the values by to accomplish that. [0..2]
      static byte pulseCount;                                                             // A pulse is a group of 5 sections.  We count how many have been rendered so we can change the timing between pulses every second or so.


      calcSection02(sectionLength[0]*sectionDivider) ;
      calcSection1(sectionLength[1]*sectionDivider) ;
      calcSection02(sectionLength[2]*sectionDivider) ;
      calcSection3(sectionLength[3]*sectionDivider) ;

      pulseCount++; // Keep track of the number of pulses (chord pairs) we have rendered at this speed.
            
      if (pulseCount >= 32) { 
          pulseCount = 0;        
          // At a 31250hz sample rate, and standard pulses being 1024 samples long, there are around 30 pulses per second. 
          // If we want to change the pulse rate every second or so, (as defined by sectionLength[section]>>sectionDivider) then changing the rate every 32 pulses will do the job.
          // Since sectionDivider allows us to speed up the pulses, if we want to wait for the same period of time before changing speeds, we must remember to multiply the desired pulseCount by sectionDivider, 
          // or divide pulseCount by sectionDivider before comparing it.
              
          // Random() is a slow function but since we're only using it once in a while, it might be okay.  
          switch ((int)(pkeSpeed*3)) { // With an (int) conversion, fractional portions are discarded, so everything except pkeSpeed = 1.0 is mapped to 0, 1, or 2.
              
             case 0: // Select from any of the three speeds at the "slowest" (quietest, lowest pitch) setting.
                  // sectionDivider = 1;
                sectionDivider = 0.75 + random(0, 10)*0.04;
                break;
                
             case 1: // At "medium" setting, select from the two fastest speeds at random.
                sectionDivider = 0.75 + random(0, 10)*0.02;
                break;
               
             case 2: // At the "highest" setting, use only the fastest speed.
             case 3: // Only when pkeSpeed = 1.0 will the result be 3.
                sectionDivider = 0.75 + random(0, 10)*0.01; 
                break;
          }
      }

and the calcSection02() etc would then count down and contain only the code for their part.

I am not sure if this works out as I can not see the surrounds of tour code.

===

The output of avr-objdump is your code intermingled with the assembler.
I agree the address and stack maniplulations are a pain to follow but the code as comments makes it easier to follow.

One new simple change...
Is doing a float (double is not implemented on AVR-GCC) multiply each time through the loop. That is a couple of hundred cycles.

That's one of the changes I mentioned above. I got rid of it by storing the multiplied value in a temporary variable.

Here's the latest code. It seems much faster than the original code:

void updateSoundBuffer() {
    
      const byte phaseAccumulators = 10;

      const word phaseFrequencyBase[phaseAccumulators] = {4429/2, 8860/2, 1930, 3942, 5933, 7884, 9835, 11786, 1, 1}; // 2000, 1000
      const word phaseStepBase[phaseAccumulators] = {(phaseFrequencyBase[0]/31250.0)*65536, // sine_a1
                                                     (phaseFrequencyBase[1]/31250.0)*65536, // sine_a2
                                                     (phaseFrequencyBase[2]/31250.0)*65536, // sine_b1
                                                     (phaseFrequencyBase[3]/31250.0)*65536, // sine_b2
                                                     (phaseFrequencyBase[4]/31250.0)*65536, // sine_b3
                                                     (phaseFrequencyBase[5]/31250.0)*65536, // sine_b4
                                                     (phaseFrequencyBase[6]/31250.0)*65536, // sine_b5
                                                     (phaseFrequencyBase[7]/31250.0)*65536, // sine_b6
                                                     (phaseFrequencyBase[8]/31250.0)*65536, // volume oscillation 25%-100% at 2hz-1hz.
                                                     (phaseFrequencyBase[9]/31250.0)*65536  // pitch oscillation +-50hz at 1hz.
                                                    };            

      static word phase[phaseAccumulators];                                               // Phase accumulator. Used as index into sine table with >> 8. [0..65535]. No modulus is needed with these accumulators as they automatically wrap back to 0 once they hit 65535.
      word phaseStep[phaseAccumulators];                                                  // Step value by which to increment phase[] each sample. 
      
      const int pitchChange = 2000;//1400*1.5;                                            // How much the frequency of the sound effect should rise between 0% extension and 100% extension.
      const int pitchStepChangeMax = (pitchChange / 31250.0) * 65536;                     // This value, scaled by the current PKE speed, is added to the step values for the phase accumulators which generate the sine waves used to recreate the PKE's sound.  It is another way of representing the frequency change when the PKE speeds up.
      int pitchStepChange;                                                                // The calculated amount by which we should adjust the step values for all the frequency phase accumulators. 
      
      const int pitchOscillation = 50;                                                    // How much the frequency of the sound effect should rise/fall. (100hz)
      const int pitchStepOscillationMax = (pitchOscillation / 31250.0) * 65536;           // At a pitch oscillation of 50hz up/down, this value will be 104.    
      int pitchStepOscillation;
      
      const int volumeOscillationChange = 10;                                             // How much the frequency of the volume oscillation changes as the speed of the PKE increases.
      const int volumePhaseStepChangeMax = (volumeOscillationChange / 31250.0) * 65536;
      
      byte volumePhaseMultiplier = 0;
      
      int difference;                                                                     // The number of samples which need to be updated this loop.

      byte sine;                                                                          // Temp variable used to hold a value read from the sine table.
      
      static byte noiseIndex = 0;                                                         // Index into noise table incremented once per reading.  Automatically loops.
      static byte oldnoise;                                                               // Used to high-pass filter noise.
      byte noise;                                                                         // Temp variable used to hold a value read from the noise table.

      byte pkeVolumeFixed;                                                                // 0.256 fixed-point representation of current PKE volume level.
      byte volume;                                                                        // 0.256 fixed-point representation of current volume level after 1-2hz oscillation has been applied, and just before it is applied to the current sample being rendered.
      
      const word sectionLengthBase[4] = {256, 128, 256, 384};                             // The number of samples in each of the four sections which make up the sound effect.  (5 sines, 5 sines w/ noise, 5 sines, 2 sines)
      static word sectionLength = 256;                                                    // Desired length of section currently being rendered.

      static double sectionDivider = 1;                                                   // Section length randomly varies between /1 /2 and /4. This value signifies the number of bits to shift the values by to accomplish that. [0..2]
      static word sectionIndex = 0;                                                       // Incremented as we step through each section.
      static byte section;                                                                // The section currently being rendered.
      static byte pulseCount;                                                             // A pulse is a group of 5 sections.  We count how many have been rendered so we can change the timing between pulses every second or so.
      
      static int nextIndex = 0;                                                           // The next sample in the buffer which needs to be updated.
      int sample = 0;                                                                     // The new sample to be placed into the buffer. 

      byte i, n;                                                                          // Temp variables used for loops.
      
      //static long largest;


     // Calculate volume change based on current PKE speed.

       //pkeVolume = pkeVolume * (0.25 + 0.75*(pkeSpeed*pkeSpeed));
       pkeVolume = pkeVolume * (0.2 + 0.8*(pkeSpeed*pkeSpeed));
       pkeVolumeFixed = pkeVolume * 255;

    // Update the step values for the phase accumulators:
    // We need not calculate these each sample because they only change as the speed of the PKE changes, which can only happen during the main loop.

    // Calculate base step values for sines, using current PKE speed:
      pitchStepChange = pitchStepChangeMax * pkeSpeed;
      for (n = 0; n < 8; n++) { // Adjust pitch.
        phaseStep[n] = phaseStepBase[n] + pitchStepChange;
      }
    
     // Calculate step value for volume oscillator, using current PKE speed.  The volume oscillates more slowly as the PKE speeds up, so volumePhaseStepChangeMax is a negative number.
       phaseStep[8] = phaseStepBase[8] + (volumePhaseStepChangeMax * pkeSpeed); 
          
     // Calculate step value for pitch oscillator:
       phaseStep[9] = phaseStepBase[9];
    difference = soundBufferIndex - nextIndex; // Determine how many samples behind we are.
    if (difference < 0) { difference = SOUNDBUFFERSIZE + difference; } // Handle case where sample playing is before next sample to be updated in the buffer.
    if (difference >= 64) { difference = 64; } // Temp code to prevent rendering too many samples at once.

    for (int i = 1; i < difference; i++) { // If there is a difference of 1 or more, fill in the gap.

      // Compute the sample. 
      // We have approximately 512 cycles, on average, in which to do this.
      
      // Compute pitch oscillation. 
      
        sine = sintable[phase[9]>>8]; // Get value of pitch oscillator.
        phase[9] += phaseStep[9]; // Increment pitch oscillator phase accumulator.
        
        pitchStepOscillation = (pitchStepOscillationMax * (sine-128)) >> 8; // [13208, -13312]>>8 = [51, -52] // Calculate by how much to adjust the oscillator steps.

      // Add sines.

        switch (section) {      
  
          case 0: // Two sines.
          case 2:
          
            sample = sintable[phase[0]>>8]; // Read sample from sine table.  Compute index from fixed point 256.256 phase accumulator. 
            sample = sample + sintable[phase[1]>>8];
            
            sample = sample - 128*2; // Center sample around 0.  We need to do this before dividing to get the "average" because we aren't dividing by the same number as the number of samples, and the volume is being reduced a bit, so the math doesn't work out otherwise.
                                      
            sample = sample >> 2; // Divide by 4.

            phase[0] += phaseStep[0] + pitchStepOscillation; // Add step to accumulator. 
            phase[1] += phaseStep[1] + pitchStepOscillation; 
            
            break;
        
          case 1: // Two sines plus noise.

            sample = sintable[phase[0]>>8];
            sample = sample + sintable[phase[1]>>8];

            noise = pgm_read_byte(noisetable + noiseIndex); // Read sample from noise table.
            noiseIndex++; // Increment index into noise table.  Stored in a byte so it automatically loops.
            
            //oldnoise = (oldnoise + noise) >> 1; // Average noise with oldnoise to high-pass filter it.  Could apply this filter to LUT in advance.  Try sound effect without this filter, see if it sounds okay?
            
            sample = sample + noise; //oldnoise;

            sample = sample - 128*3; // Center sample around 0.
            
            sample = sample >> 2; // Divide by 4. 

            phase[0] += phaseStep[0] + pitchStepOscillation; 
            phase[1] += phaseStep[1] + pitchStepOscillation; 
      
            break;
      
          case 3: // Six sines.
        
            sample = sintable[phase[2]>>8];
            sample = sample + sintable[phase[3]>>8];
            sample = sample + sintable[phase[4]>>8];
            sample = sample + sintable[phase[5]>>8];
            sample = sample + sintable[phase[6]>>8];
            sample = sample + sintable[phase[7]>>8];
            
            sample = sample - 128*6; // Center sample around 0.
            sample = sample >> 3; // Divide by 8, because dividing by 6 is too costly.

            phase[2] += phaseStep[2] + pitchStepOscillation; 
            phase[3] += phaseStep[3] + pitchStepOscillation; 
            phase[4] += phaseStep[4] + pitchStepOscillation; 
            phase[5] += phaseStep[5] + pitchStepOscillation; 
            phase[6] += phaseStep[6] + pitchStepOscillation; 
            phase[7] += phaseStep[7] + pitchStepOscillation; 
            
            break;
            
        }

        // At this point, sample is an signed value -128..128. 
        /*
        if (DEBUG == 1) {
           Serial.print("sample = "); Serial.println(sample);
        }   
        */                

      // Oscillate volume:
      // (At a rate of 1-2hz, from 25% to 100%. Volume oscillates more slowly as PKE speeds up.  volume = ([0..255] * (64*256 + 192*[0..256]))>>16
      
        // Get value of volume oscillator, and increment its phase accumulator.
          sine = sintable[phase[8]>>8]; // [0..255]
          phase[8] += phaseStep[8];
          
        // Adjust min/max swing:
          sine = 128 + sine>>2; // [128..192] (Setting 192 as max allows us to avoid need to clamp sample below.)
        
        // Calculate new volume:
          volume = ((word)pkeVolumeFixed * sine) >> 8; // [0..255]
        
      // Adjust volume of sample.

        // sample = (sample * volume) >> 8; // sample = [-32640..32385]/256 = [-128..127] 
        // sample = sample<<2; // Multiply volume of sample by 4, since samples tend to max out around 63, rather than 255 which is where we want them.
        
        sample = (sample * volume) >> 6; // Same as above two equations, but in one step.  


      // Display highest value for sample generated so far:
      // (test result = 67)
        
        /*
        if (DEBUG == 1) {
          if (sample > largest) { 
            largest = sample; 
            Serial.print("sample = "); Serial.println(sample);
          }
        }        
        */


      // Place sample in the sound buffer.
      // Sound buffer requires 8-bit samples. [0..255]

        soundBuffer[nextIndex] = sample + 128;

        nextIndex++; // Increment the index.
        nextIndex = nextIndex & 1023; // Wrap nextIndex when it reaches the end of the buffer.  Bitwise AND is much faster than modulo.

      // Update which section we are in.

        sectionIndex++; // Increment the number of samples we have rendered in this section.
        if (sectionIndex >= sectionLength) { // If we have rendered all the samples needed for this section, move on to the next section.

          sectionIndex = 0;
          section++;
          
          if (section == 4) { // If we have gone past the last section, select a new sectionDivider, and start again from the first section.
            
            pulseCount++; // Keep track of the number of pulses (chord pairs) we have rendered at this speed.
            
            if (pulseCount >= 32) { 
              
              pulseCount = 0;        
              
              // At a 31250hz sample rate, and standard pulses being 1024 samples long, there are around 30 pulses per second. 
              // If we want to change the pulse rate every second or so, (as defined by sectionLength[section]>>sectionDivider) then changing the rate every 32 pulses will do the job.
              // Since sectionDivider allows us to speed up the pulses, if we want to wait for the same period of time before changing speeds, we must remember to multiply the desired pulseCount by sectionDivider, 
              // or divide pulseCount by sectionDivider before comparing it.
              
              // Random() is a slow function but since we're only using it once in a while, it might be okay.  

              switch ((int)(pkeSpeed*3)) { // With an (int) conversion, fractional portions are discarded, so everything except pkeSpeed = 1.0 is mapped to 0, 1, or 2.
              
                case 0: // Select from any of the three speeds at the "slowest" (quietest, lowest pitch) setting.
              
                  //sectionDivider = random(0, 3); // random(0, 3) returns a float greater than or equal to 0 and less than 3, so the range when typecast to (int) will be [0..2].
                  //noise = pgm_read_byte(noisetable + noiseIndex);
                  
                  // sectionDivider = 1;
                      
                  //sectionDivider = 0.75 + random(0,3)*0.5;
                  sectionDivider = 0.5 + random(0, 10)*0.05;
                  //volumePhaseMultiplier = random(0,3);
                  
                  break;
                
                case 1: // At "medium" setting, select from the two fastest speeds at random.
              
                  //sectionDivider = random(1, 3); 
                  //sectionDivider = 0.875;
                  
                  //sectionDivider = 0.75 + random(0,2)*0.5;
                  //sectionDivider = 0.5 + random(0, 10)*0.025;
                  //volumePhaseMultiplier = random(1,3);
                  
                  break;
               
                case 2: // At the "highest" setting, use only the fastest speed.
                case 3: // Only when pkeSpeed = 1.0 will the result be 3.
            
                  sectionDivider = 0.75 + random(0, 10)*0.01; 
                  //volumePhaseMultiplier = 2;
                  break;
                  
              }
              
            }
            
            section = 0; 
            
          }

          sectionLength = sectionLengthBase[section]*sectionDivider; // Calculate the length of the section we are about to render.

        }        

        
      }    


      /*
      if ((DEBUG == 1) && (difference > 100)) {
        Serial.print("samples behind = "); Serial.println(difference);
        Serial.println(soundBufferIndex);
        Serial.println(nextIndex);
        Serial.println();
      }    
      */
    
  }

Man the post size limit here is absurdly small.

Anyway...

The code seems much faster now, though this has changed the character of the sound that's being produced. Also, it has also sped up the animation of my leds. Or rather, it's no longer slowing them down. The LEDs are supposed to be updated at 30hz, and I hadn't noticed the original sound update code was running so slowly that it was affecting them.

I guess I'll have to retweak the sound after I do some more testing and optimizaton. I think I will count the number of samples rendered so far and the number of samples played so far, and compare the two by dividing one by the other to get a ratio which will tell me just how much faster the playback speed is than my update speed. Maybe I can now get it closer to the ideal I was shooting for which was to have it sound identical to the film, though.

Oh, and as for avr-objdump, I've been looking into that but have yet to find where the Arduino IDE is sticking the generated object files so I can disassemble them.

This thread talks about where the compilation directory is..

http://arduino.cc/forum/index.php/topic,46733.0.html

Good to hear the code is running faster :).