Was trying to create a faster bit shift, and failed...

Hey guys,

In my program I’m outputting samples to a dac for audio, and to do volume control requires a bit shift. I’ve learned that each bit shifted by is 1 instruction, so if I shift by a lot of bits that’s a lot of instructions. Also, bit shifting to do volume control doesn’t exactly provide smooth volume adjustment.

So, I got to thinking about how I might speed this up, and how I might gain fine control over the volume. I got to thinking about lookup tables, and came up with an idea, which I decided would probably not work with any old divisor as I’d hoped, but I know it will work when dividing by powers of 2, so I decided to benchmark it. But I’m not getting the results I’d hoped to see.

I thought this method would be a bit faster because memory lookups are only 2 cycles each, and the bitwise OR should be 1, so that’s 5… and the shift by 8 and the bit mask should be optimized out I think. The bitwise shift by 7 on the other hand would be 7 instructions.

But that estimate seems to be way off, because my lookup table method is actually twice as slow as the bit shifting.

Anyway, here’s my benchmark code:

void setup() {                

  unsigned int time;
  unsigned int x, y;
  uint16_t tmp;
  uint16_t dh[256];
  uint8_t dl[256];
  
  for (x=0; x<256; x++) {
    dl[x] = x >> 7;
    dh[x] = (x<<8) >> 10;
  }
 
  Serial1.begin(9600); // Set up Serial library for debugging via UART1
 
   time = millis();
   
   for (y=0; y<5; y++) {
   for (x=0; x<65535; x++) {
     tmp = x >> 10;
   }
   }
   
   time = millis() - time;
   
   Serial1.print("tmp = ");
   Serial1.println(tmp, DEC);
   Serial1.print("time = ");
   Serial1.println(time, DEC);
   
   time = millis();

   for (y=0; y<5; y++) {   
   for (x=0; x<65535; x++) {
     tmp = dh[x>>8] | dl[x&0xFF];
   }
   }
   
   time = millis() - time;
   
   Serial1.print("tmp = ");
   Serial1.println(tmp, DEC);
   Serial1.print("time = ");
   Serial1.println(time, DEC);   
  
}

// the loop routine runs over and over again forever:
void loop() {

}

So what do you think? Is the stuff in the brackets of my arrays being optimized out as I think it is, or is that where all the extra cycles come from?

I thought this method would be a bit faster because memory lookups are only 2 cycles each,

For each byte. Plus a few instructions to load the address in X, Y, or Z. Plus a few more instructions to calculate the offset from the base address.

and the bitwise OR should be 1,

For each byte.

so that's 5...

Not even close.

The bitwise shift by 7 on the other hand would be 7 instructions.

For each byte. Not considering other possible optimizations like using SWAP to shift by 4 bits.

What exactly are you trying to do? Shift a 16 bit value by 7 bits?

This looks to be a continued discussion of some kind. For those of us not aware of it's origins perhaps a little more discussion here of the goals?

Target board? Bit count of shifted value and it's result?

The board I'm using is one I built. It has an Atmega1284p and a dac on board, and I'm outputting 12 bit sound samples to it.

I was just looking for a fast way to adjust the volume. Ie, divide a 12 bit sample in a 16 bit variable by some arbitrary value.

Bit shifting is one way to do this, but it doesn't allow for fine control, and the compiler probably can't optimize it much if the value to be shifted by is variable. So in theory it might take up to 15 clock cycles when the volume is turned almost all the way down.

So I was just trying to see if I could come up with a faster way to adjust the volume. I'm sure a lot of work was done in this area years ago by people coding music players for old PCs and Amigas. I thought someone might know some trick that was used. My own efforts to find a trick failed miserably as you have seen. :-)

Bit shifting is one way to do this, but it doesn't allow for fine control, and the compiler probably can't optimize it much if the value to be shifted by is variable.

Exactly. The code is essentially a for loop shifting one bit at a time.

So in theory it might take up to 15 clock cycles when the volume is turned almost all the way down.

Is 15 clock cycles too much? What is the maximum number of clock cycles the application can tolerate?

I'm sure a lot of work was done in this area years ago by people coding music players for old PCs and Amigas.

The "sound system" on very old PCs had no volume control and the Amiga had a dedicated sound processor (Paula?).

A digital potentiometer is probably the cheapest easiest way to get processor controlled volume.

Thanks for the quick explanation.

Any such ‘tricks’ would be processor dependent and wouldn’t necessarily apply to any other processor. For instance the Amiga’s 32 bit 68000 has 32 bit wide registers and can do ‘integer’ division of 12 bit value in a single data register with several instructions at 16 MHz.

The "sound system" on very old PCs had no volume control and the Amiga had a dedicated sound processor

I am not talking about making beeps with the PC speaker. I am talking about wavetable synthesis. Ie, MOD or S3M files. These most certainly had volume control, on an individual sample level. One of the first ones I used could even use the PC speaker for output via PCM.

As for the Amiga, I don't know much about its sound processor.

What's the audio sample rate? If it is 44.1K samples/second (as on a CDROM) then you have 22.67 uS available to do everything you need, including getting the byte, shifting it, and outputting it. Time will be tight.

What was your debug output anyway?

lloyddean: Thanks for the quick explanation.

Any such 'tricks' would be processor dependent and wouldn't necessarily apply to any other processor. For instance the Amiga's 32 bit 68000 has 32 bit wide registers and can do 'integer' division of 12 bit value in a single data register with several instructions at 16 MHz.

Not necessarily true. I used one of their tricks on an earlier project for computing which sample to choose from a precomputed sine wave in order to generate and mix several tones. And bitwise math tricks are often portable.

Shifting right 4 bits:

 unsigned int j = i >> 4;
  86:	60 91 60 00 	lds	r22, 0x0060   (2)
  8a:	70 91 61 00 	lds	r23, 0x0061   (2)
  8e:	84 e0       	ldi	r24, 0x04	; 4  (1) x 4
  90:	76 95       	lsr	r23  (1) x 4
  92:	67 95       	ror	r22  (1) x 4
  94:	8a 95       	dec	r24  (1) x 4
  96:	e1 f7       	brne	.-8      	; 0x90 <setup+0x1e> (2, 2, 2, 1)

I count 27 machine cycles there.

You we're talking amplitude modification.

The Amiga's Paula Chip has four DMA-driven 8-bit PCM sample sound channels.

I don’t remember what the debug output was. I thnk it was like 400ms for my version and 172 for the bit shift or something.

As for the sample rate, I had to forego 44khz and go with 22khz. And yes, time is tight, that’s why I’m trying to optimize the lib, but it seems that the creator of WaveHC did his job because beyond converting the thing to use the SPI bus and direct port manipulation, I haven’t really been able to improve upon things.

Here’s the pertinent audio code btw:

//------------------------------------------------------------------------------
// timer interrupt for DAC
ISR(TIMER1_COMPA_vect) {
  if (!playing) return;

  if (playpos >= playend) { // If we've reached the end of the buffer...
    if (sdstatus == SD_READY) {
    
      // swap double buffers
      playpos = sdbuff;
      playend = sdend;
      sdbuff = sdbuff != buffer1 ? buffer1 : buffer2;
      
      sdstatus = SD_FILLING;
      // interrupt to call SD reader
	    TIMSK1 |= _BV(OCIE1B);
    }
    else if (sdstatus == SD_END_FILE) {
      playing->stop();
      return;
    }
    else {
      // count overrun error if not at end of file
      if (playing->remainingBytesInChunk) {
        playing->errors++;
      }
      return;
    }
  }

  uint8_t dh, dl;
  if (playing->BitsPerSample == 16) {
  
    // 16-bit is signed
    dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1.  So xoring 101 with 000 gives 101.  While xoring 101 with 100 gives 001.  I guess fipping this one bit is all that is needed to convert from signed to unsigned! 
    dl = playpos[0];
    playpos += 2;
  }
  else {
  
    // 8-bit is unsigned
    dh = playpos[0];
    dl = 0;
    playpos++;
  }
  
#if DVOLUME

	// If I move this volume code into the SD card reader code then there will be a significant delay when volume is changed.

	//uint16_t tmp = (dh << 8) | dl;
	//tmp >>= playing->volume;
	//dh = tmp >> 8;
	//dl = tmp;

  	//uint16_t tmp = ((dh << 8) | dl) >> 4;

	// All these bit shifts are expensive.  A shift by 4 requires 4 clocks to execute!	

 	uint16_t tmp = (dh << 4) | (dl >> 4);
	tmp >>= playing->volume;

#endif //DVOLUME

/*
  // dac chip select low
  mcpDacCsLow();
  
  // send DAC config bits
  mcpDacSdiLow();
  mcpDacSckPulse();  // DAC A
  mcpDacSckPulse();  // unbuffered
  mcpDacSdiHigh();
  mcpDacSckPulse();  // 1X gain
  mcpDacSckPulse();  // no SHDN
  
  // send high 8 bits
  mcpDacSendBit(dh,  7);
  mcpDacSendBit(dh,  6);
  mcpDacSendBit(dh,  5);
  mcpDacSendBit(dh,  4);

  mcpDacSendBit(dh,  3);
  mcpDacSendBit(dh,  2);
  mcpDacSendBit(dh,  1);
  mcpDacSendBit(dh,  0);
  
  // send low 4 bits
  mcpDacSendBit(dl,  7);
  mcpDacSendBit(dl,  6);
  mcpDacSendBit(dl,  5);
  mcpDacSendBit(dl,  4);
  
  // chip select high - done
  mcpDacCsHigh();
*/



     // Take the SS pin low to select the DAC.
             
     		PORTB &= ~0b10000; // Direct port access.  Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p  

     // Send the 16 bits needed to reconfigure the DAC.
		
		//SPI.transfer((tmp >> 8)|0b00110000);
		//SPI.transfer(tmp);

#if DVOLUME 

    		SPDR = (tmp >> 8)|0b00110000;
    		while (!(SPSR & _BV(SPIF)));
		
    		SPDR = tmp;
    		while (!(SPSR & _BV(SPIF)));

#else

    		SPDR = 0b00110000 | (dh >> 4);
    		while (!(SPSR & _BV(SPIF)));
		
    		SPDR = (dh << 4) | (dl >> 4);
    		while (!(SPSR & _BV(SPIF)));

#endif

     // Take the SS pin high to de-select the DAC:
      	     
      	PORTB |= 0b10000;

}

DVOLUME is 0 currently so the version I’m using doesn’t adjust the volume. If I can’t have smooth volume control I don’t think I want it at all. I can’t really spare the cpu cycles.
Something might be able to be done with reading the values directly from the array with an int cast rather than putting the values in dh and dl, but I don’t know if that would help at all.

Why aren't you using the SPI port? Can't get faster than the hardware specifically designed just for that.

Speaking as a compiler guy (but one who has so far not looked at the AVR instruction set), if you are running out of cycles doing a simple shift, it is time to consider going to a faster processor, such as an ARM (the Due, Teensy 3.0). I believe Arm/thumb has a single instruction that does a shift, though you need to look at the data sheets for the respective processors whether the shift occurs a bit/few bits at a time or is done via a barrel shifter in a fixed number of cycles.

The barrel shifter can add a lot of real estate to the chip, and presumably low cost/low power devices might not consider it a must have feature, and fall back to doing a single bit shift in a microcoded loop.

  // send high 8 bits
  mcpDacSendBit(dh,  7);
  mcpDacSendBit(dh,  6);
  mcpDacSendBit(dh,  5);
  mcpDacSendBit(dh,  4);

  mcpDacSendBit(dh,  3);
  mcpDacSendBit(dh,  2);
  mcpDacSendBit(dh,  1);
  mcpDacSendBit(dh,  0);
  
  // send low 4 bits
  mcpDacSendBit(dl,  7);
  mcpDacSendBit(dl,  6);
  mcpDacSendBit(dl,  5);
  mcpDacSendBit(dl,  4);

boggle

Well, problem solved then. That’s incredibly slow compared to using SPI.

Two solutions:

  • Use SPI, the time you save will give you time to adjust the volume
  • Recode for each volume level, eg.
  // send high 8 bits
  //  <---- send a zero here
  mcpDacSendBit(0,  7);

 // remaining 11 bits
  mcpDacSendBit(dh,  7);
  mcpDacSendBit(dh,  6);
  mcpDacSendBit(dh,  5);
  mcpDacSendBit(dh,  4);

  mcpDacSendBit(dh,  3);
  mcpDacSendBit(dh,  2);
  mcpDacSendBit(dh,  1);
  mcpDacSendBit(dh,  0);
  
  // send low 4 bits
  mcpDacSendBit(dl,  7);
  mcpDacSendBit(dl,  6);
  mcpDacSendBit(dl,  5);

 // <--- drop low-order bit

Make a function for each of the possible 12 volume levels (each bit shift). Or use a switch statement. Whatever.

CrossRoads: Why aren't you using the SPI port? Can't get faster than the hardware specifically designed just for that.

I am. Check the code again, the non-spi stuff is commented out. That's the original code the library used, because it used the SPI for reading the SD card. Since I am using a 1284p I have two USARTs to play with and am using one for the SD card, SPI for the DAC and TLC5947 led drivers, and the other UART I'm using for serial debugging. (Thank god I found a way to make USART1 work for that cause debugging by blinking an LED really sucked.)

Make a function for each of the possible 12 volume levels (each bit shift). Or use a switch statement. Whatever.

Interesting idea, but as I mentioned, that code is commented out and the SPI is far faster. But there's still shifts needed to do the SPI stuff that might be optimized out to some degree. And there's still volume control which would be nice to have. But not at the expense of 47 cycles or whatever it was someone mentioned.

I don’t suppose you want to store 12 copies of the file? One for each volume level? (precomputed)

"But there’s still shifts needed to do the SPI stuff that might be optimized out to some degree. "
Can you point to where that is in the code? I went thru the first page of the thread, maybe I’m just tired (been designing & routing all night) and I didn’t see SPI.transfer() anywhere.

Ah - buried at the end of reply #11. ok.

Yes, well.

@OP: when you post code on the forum, for advice, and it isn't colour-coded, perhaps you could omit lengthy sequences which are just comments, such as:

/*
  // dac chip select low
  mcpDacCsLow();
  
  // send DAC config bits
  mcpDacSdiLow();
  mcpDacSckPulse();  // DAC A
  mcpDacSckPulse();  // unbuffered
  mcpDacSdiHigh();
  mcpDacSckPulse();  // 1X gain
  mcpDacSckPulse();  // no SHDN
  
  // send high 8 bits
  mcpDacSendBit(dh,  7);
  mcpDacSendBit(dh,  6);
  mcpDacSendBit(dh,  5);
  mcpDacSendBit(dh,  4);

  mcpDacSendBit(dh,  3);
  mcpDacSendBit(dh,  2);
  mcpDacSendBit(dh,  1);
  mcpDacSendBit(dh,  0);
  
  // send low 4 bits
  mcpDacSendBit(dl,  7);
  mcpDacSendBit(dl,  6);
  mcpDacSendBit(dl,  5);
  mcpDacSendBit(dl,  4);
  
  // chip select high - done
  mcpDacCsHigh();
*/

OK, it's a comment. But we look at a lot of code each day, and this is a bit of a trick doing that.