bitshifting problem - slows down

Hi all,

I'm working on a synthesizer project where I'm using the DDS method (digital direct synthesis) to produce a very high frequency resolution. Here's a description of a similar project: http://interface.khm.de/index.php/lab/experiments/arduino-dds-sinewave-generator/ Everything worked fine till I wanted to use a 128byte wabetable instad of a 256byte one. For the new wavetable I hade to make a bitshift to the right of 25 bits, because I needed just 7bits (128steps) to step through the whole wavetable. But now, the ISR needs to much time, and the whole aplication gets very slow. That meens that the main loop get started too infrequently. I think I's because of the bitshifting that takes too much time, but why is it a problem with bitshifting 25bits and is no problem with 24bits. I'm very pleasd about any suggestion!

thanx lukas

volatile byte icnt;
volatile unsigned long phaccu = 0;     // pahse accumulator in 32Bit
volatile unsigned long tword_m = 0;     // dds tuning word m in 32Bit

ISR(TIMER2_OVF_vect) {

    phaccu = phaccu + tword_m; 

    //icnt=phaccu >> 24;     // use upper 8 bits for phase accu as frequency information for 256byte wavetable
    icnt=phaccu >> 25;         // use upper 7 bits for phase accu as frequency information for 128byte wavetable

    //OCR2A=(float) pgm_read_byte_near(sine256 + icnt);
    OCR2A=(float) pgm_read_byte_near(sine128 + icnt);

   if(count++ == 125) {     //timing for the main loop
        divide = true;      //main loop starts when "divide == true"
        count=0;
    }
}

is this line - OCR2A=(float) pgm_read_byte_near(sine128 + icnt); - not the problem?

An explicit convertion to float may cost time too : line - OCR2A=(float) pgm_read_byte_near(sine128 + icnt); - As OCR2A is an "integer" type the question is if the conversion to float is needed.

the line - icnt=phaccu >> 25; - might be sped up --- see code -- by mapping the var also to a byte array in an union so only 1 shift is needed iso 25

you might give it a try...

union t
{
  byte x[4];
  unsigned long phaccu;
};

volatile byte icnt;
volatile t temp;
temp.phaccu = 0;

volatile unsigned long tword_m = 0;     // dds tuning word m in 32Bit

ISR(TIMER2_OVF_vect) {

    phaccu = phaccu + tword_m; 

    //icnt=phaccu >> 24;     // use upper 8 bits for phase accu as frequency information for 256byte wavetable
    //icnt=phaccu >> 25;     // use upper 7 bits for phase accu as frequency information for 128byte wavetable

        icnt = temp.x[0] >> 1;    //  icnt=phaccu >> 25 


    //OCR2A=(float) pgm_read_byte_near(sine256 + icnt);
    OCR2A=(float) pgm_read_byte_near(sine128 + icnt);

   if(count++ == 125) {     //timing for the main loop
        divide = true;      //main loop starts when "divide == true"
        count=0;
    }
}

And as robtillaart already mentioned, you've got a lot of work ahead of you to speedup the code if you plan to do high frequency work as you mentioned. Some things to consider:

  • The extra 128 bytes of a 256 table is negligible if the code is already debugged and working - you may wish to go back to it and optimize after you're seeing positive results (the number 256 also gives you some machine code optimizations you may need later, like no masking of BYTEs, saving an instruction or two now and then).

  • Float on the Arduino is bad - a huge amount of instructions, and if you can avoid it until absolutely necessary, you'll see a big difference in speed.

Back to this question

but why is it a problem with bitshifting 25bits and is no problem with 24bits.

24 bits is exactly 3 bytes , for computers a "round" number, which can be optimized (by a similar assembly trick as I proposed) the long is 4 bytes and shifting 24 means just "copy" the first byte of the 4

25 bits is 3 1/8 byte, not easy optimizable ... so probably it isn't. (I hope my handcoded optimization works.)

A quick test on an 328 16Mhz

void setup()
{  
  Serial.begin(115200);
  volatile unsigned long x = 0xAA555555;
  volatile byte y = 0;
  
  unsigned long start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = x >> 24;
  }
  Serial.println(millis() - start);
  
  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = x >> 25;
  }
  Serial.println(millis() - start);
  
}

void loop()
{}

Output
168
1246

SO the shift 25 takes > 7 x much time

Now with the optimizing trick:

union t
{
    byte s[4];
    unsigned long x;
} volatile temp;

void setup()
{  
  Serial.begin(115200);
  temp.x = 0xAA555555;
  volatile byte y = 0;
  
  unsigned long start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.x >> 24;
  }
  Serial.println(millis() - start);

  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.s[0];
  }
  Serial.println(millis() - start);

  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.x >> 25;
  }
  Serial.println(millis() - start);
  
  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.s[0] >> 1;
  }
  Serial.println(millis() - start);
  
}

void loop()
{}

Output
168
107
1245
126

This means the optimization of the shift 25 will probably work in your code too - factor 10! that is more than I expected.
Furthermore the shift 24 can be faster too; from 168 → 107 = ~35% faster

think this helps :wink:

fun!

Incredible results - I did not expect the bit shift to be so slow (or put another way, the compiler optimization for the other items to be so good).

I added one test to the list, using a bitfield struct combined with the union:

union t2
{
    struct
    {
      byte s: 7;
      byte y: 25;
    } z;
    unsigned long x;
} volatile temp2;

...

temp2.x = 0xAA555555;
  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp2.z.s; 
  }
  Serial.println(millis() - start);

However, the result was 113 - not a great deal diff from 126.

byte y: 25; a bit with the size of 25 bits ??? that is quite good compression scheme :slight_smile: I understand it is a filler

The test missing is a test to see shifting from 1-31 [ OK we did 1 allready]

With a variable (add this to prev sketch

  for (int a=0; a <32; a++)
  {
    start = millis();
    for (long  i=0; i< 100000L; i++)
    {
      y = temp.x >> a;
    }
    Serial.println(millis() - start);
  }

176 - 0
221
264
308
352

396 - 5
441
484
529
572

615 - 10
661
704
749
793

835 - 15
880
923
968
1013

1057 - 20
1100
1143
1188
1233

1277 -25
1321
1365
1408
1452

1496 - 30
1540
Average = 858

for shift 25 : use a constant: 1246 use a variable: 1277

tried to optimize with the var.

  for (int a=20; a <32; a++)
  {
    start = millis();
    for (long  i=0; i< 100000L; i++)
    {
      if (a < 25)     // should be 24 in fact !! can also be optimized.
        y = temp.x >> a;
      else
        y = temp.s[0] >> (a-24);
    }
    Serial.println(millis() - start);
  }

Output:
1088 - 20
1131
1176
1219
1264 - 24
201 - 25 optimized !! YES !!
232
264
295
327
357
391

Think if the union is extended to include an unsigned int too, the code could map shift 16-23 upon 0-7 (int) and shift 24-31 upon 0-7 (byte)

OK, next version with an int and byte array in the union

union t
{
  byte s[4];
  int  t[2];
  unsigned long x;
} volatile temp;

void setup()
{  
  Serial.begin(115200);

  temp.x = 0xAA555555;
  volatile byte y = 0;

  unsigned long start = millis();

  for (long  i=0; i< 100000L; i++)
  {
    y = temp.x >> 24;
  }
  Serial.println(millis() - start);

  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.s[0];
  }
  Serial.println(millis() - start);


  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.x >> 25;
  }
  Serial.println(millis() - start);

  start = millis();
  for (long  i=0; i< 100000L; i++)
  {
    y = temp.s[0] >> 1;
  }
  Serial.println(millis() - start);

  Serial.println("====================");
  Serial.println("Optimized shift 0-31");
  Serial.println("====================");
  for (int a=0; a <32; a++)
  {
    start = millis();
    for (long  i=0; i< 100000L; i++)
    {
      if (a > 23) y = temp.s[0] >> (a-24);
      else if (a > 15) y = temp.t[0] >> (a-16);
      else y = temp.x >> a;
    }
    Serial.println(millis() - start);
  }
}

void loop()
{
}

Output:

Optimized shift 0-31

227
271
314
359
402
447
490
535
578
623
665
710
755
798
842
887
208
239
270
302
334
364
397
428
176
208
238
271
302
334
364
397
Average = 429.2 (compared to Average = 858 above)

Observations & Conclusions:

  1. the shift 8-15 are relatively high
  2. looking at all possible values for the shift speed is up factor 2.

Final version; replace relevant part

This optimizes the shift 8…15, and the average is further down!

  Serial.println("====================");
  Serial.println("Optimized shift 0-31");
  Serial.println("====================");
  for (int a=0; a <32; a++)
  {
    start = millis();
    for (long  i=0; i< 100000L; i++)
    {
      if (a > 23) y = temp.s[0] >> (a-24);
      else if (a > 15) y = temp.t[0] >> (a-16);
      else if (a > 7) 
      {
        temp.s[3] = temp.s[2];
        temp.s[2] = temp.s[1];
        temp.s[1] = temp.s[0];
        temp.s[0] = 0;
        y = temp.x >> (a-8);
      }
      else y = temp.x >> a;
    }
    Serial.println(millis() - start);
  }

Output:
258
302
347
389
433
478
522
567
352
396
440
485
527
573
616
661
213
244
277
307
339
371
402
434
183
214
245
277
308
340
372
402
Average: 384

Compared to the original 858 above = 384/858 => factor 2.23 on average

If speed is an issue the rightshift can be optimized quite a bit, (factor depends on the #bits).

I leave the (investigation and) optimizing of the leftshift open for the reader. :wink:

An explicit convertion to float may cost time too : line - OCR2A=(float) pgm_read_byte_near(sine128 + icnt); - As OCR2A is an "integer" type the question is if the conversion to float is needed.

Unfortunately, I think I have to do this, because the whole code looks like this:

OCR2A=(float) pgm_read_byte_near(sine128 + icnt) * adsr;

adsr is a float varable to controle the volume of the wavetable. If anyone has a suggestion for optimizing, I would be very happy!

robtillaart, thanx a lot for this detailed optimizing reserch! But i don't know how i can deal with the byte-array in my case. How can I do this line:

phaccu = phaccu + tword_m;

if "tword_m" is a float and "phaccu" is a byte-array, without a lot of other bitshifting. The purpos of this line is that the "phaccu" get filled constantly with the step of "tword_m" and when it's over the 32bits, it starts from the beginning.

(sorry for my terrible english, I hope you can understand what I mean!) :)

if "tword_m" is a float and "phaccu" is a byte-array, without a lot of other bitshifting. The purpos of this line is that the "phaccu" get filled constantly with the step of "tword_m" and when it's over the 32bits, it starts from the beginning.

In your original code, tword_m is not a float. If tword_m is an array step size, it must be an integer type.

adsr is a float varable to controle the volume of the wavetable. If anyone has a suggestion for optimizing, I would be very happy!

A fixed-point variable?

In your original code, tword_m is not a float.

Oh yea, your right, I mean tword_m is a unsigned long. But I still don't know how to do the line

phaccu = phaccu + tword_m;

when the phaccu is a byte-array, without other bitshifting operations for tword_m.

A fixed-point variable?

Hmm this sounds very obvious! I realy dont need so many digits after the comma :slight_smile: But I’m sorry, what are fixed-point variables again! I’m quite a newbee in programming! :blush:

But I still don’t know how to do the line…when the phaccu is a byte-array, without other bitshifting operations for tword_m.

You can’t add set a whole array to anything in one statement. So, the fundamental question is meaningless.

Please explain what you are trying to do.

You can't add set a whole array to anything in one statement.

Yea, thats exactly my problem. :)

The two main operations i have to do witch phaccu and tword_m are these:

phaccu = phaccu + tword_m; icnt=phaccu >> 25;

robtillaart made the suggestion to do the bitshifting with phaccu as a byte array instead of an unsigned long. That's why I'm looking for a way to do the addition of a "unsigned long" with a byte-array. And for me, that seams to get more time-consuming than the bitshift of 25.

That's why I'm looking for a way to do the addition of a "unsigned long" with a byte-array.

The second "byte array" is just another unsigned long. That's how phase accumulators work.

Ah ok, now I got it! I didn't know the funktion of union. Ok, thats a really cool tool!:)

Thnx everybody for helping!

robtillaart, thanx a lot for this detailed optimizing reserch! But i don't know how i can deal with the byte-array in my case. How can I do this line:

Here a rewrite of your original routine, I added the "union trick", leaving your code intact except for the calculation of icnt so minimal change. If phaccu is not used outside this routine it can be even a tiny bit faster.

Please check if it is fast enough to solve your original problem.

volatile byte icnt;

// added to speed up shift 
union t
{
  byte b[4];
  unsigned long l;
} volatile temp;

unsigned long phaccu = 0;      // phase accumulator in 32Bit

volatile unsigned long tword_m = 0;     // dds tuning word m in 32Bit

ISR(TIMER2_OVF_vect)
{
        phaccu = phaccu + tword_m; 
        temp.l = phaccu;    // work copy

    // icnt=phaccu >> 24;     // use upper 8 bits for phase accu as frequency information for 256byte wavetable
        // icnt = temp.b[0];

    // icnt=phaccu >> 25;         // use upper 7 bits for phase accu as frequency information for 128byte wavetable
        icnt = temp.b[0] >> 1;

    //OCR2A=(float) pgm_read_byte_near(sine256 + icnt);
    OCR2A = pgm_read_byte_near(sine128 + icnt);   

   if (count++ == 125) {     //timing for the main loop
        divide = true;      //main loop starts when "divide == true"
        count=0;
    }
}

OCR2A=(float) pgm_read_byte_near(sine128 + icnt) * adsr; adsr is a float varable to controle the volume of the wavetable. If anyone has a suggestion for optimizing, I would be very happy!

What is the range of adsr ?

[u]Assumption based on discussion so far[/u] [u]adsr is a float that represents the volume and it goes from 0.0 - 1.0 you can use a byte instead, lets name it adsrBYTE range 0..255.[/u] [u]If the volume is read from analogRead(pin) you have a value from 0..1023 which needed to be divided by 4 to get adsrBYTE;[/u] [u]OCR2A = pgm_read_byte_near(sine128 + icnt) * adsrBYTE / 256;[/u]