Pages: 1 2 [3]   Go Down
Author Topic: Another optimization question - can I speed up this 32 bit multiply?  (Read 2228 times)
0 Members and 1 Guest are viewing this topic.
Global Moderator
Offline Offline
Brattain Member
*****
Karma: 474
Posts: 18696
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

Does that mean the \t isn't needed?  

Being after the newline, effectively it adds whitespace at the start of the next line. The optiboot example had spaces before the opcodes. I presume that if you have something before a space it is a label. Personally I like to see my lines in one place. In other words, if you want leading spaces you have them on that line, not as a trailing space on the previous line.
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1355
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Does that mean the \t isn't needed?  

Being after the newline, effectively it adds whitespace at the start of the next line. The optiboot example had spaces before the opcodes. I presume that if you have something before a space it is a label. Personally I like to see my lines in one place. In other words, if you want leading spaces you have them on that line, not as a trailing space on the previous line.

Makes sense to me.
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1355
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Just some disassemblies I did while trying to optimize the C code:

Here was the original code, where I did the obvious thing and shifted out the unnecessary bits and moved the high byte into place for my 12-bit sample:

Code:
// Calculate 12-bit sample:
sample = ((0X80 ^ playpos[1])<<4) | playpos[0]>>4;
    1e64: 81 81        ldd r24, Z+1 ; 0x01
    1e66: 80 58        subi r24, 0x80 ; 128
    1e68: 48 2f        mov r20, r24
    1e6a: 50 e0        ldi r21, 0x00 ; 0
    1e6c: b4 e0        ldi r27, 0x04 ; 4
    1e6e: 44 0f        add r20, r20
    1e70: 55 1f        adc r21, r21
    1e72: ba 95        dec r27
    1e74: e1 f7        brne .-8      ; 0x1e6e <__vector_13+0xfc>
    1e76: 80 81        ld r24, Z
    1e78: 90 e0        ldi r25, 0x00 ; 0
    1e7a: a4 e0        ldi r26, 0x04 ; 4
    1e7c: 95 95        asr r25
    1e7e: 87 95        ror r24
    1e80: aa 95        dec r26
    1e82: e1 f7        brne .-8      ; 0x1e7c <__vector_13+0x10a>
    1e84: 48 2b        or r20, r24
    1e86: 59 2b        or r21, r25


Then I figured that if I were optimizing it by hand, surely it would be easier to optimize out a shift by 8 instead, so I tried that:

Code:
sample = 0X80 ^ playpos[1];
    1e64: 81 81        ldd r24, Z+1 ; 0x01
sample = sample << 8;
    1e66: 38 2f        mov r19, r24
    1e68: 30 58        subi r19, 0x80 ; 128
    1e6a: 20 e0        ldi r18, 0x00 ; 0
sample = sample | playpos[0];
    1e6c: 80 81        ld r24, Z
    1e6e: 48 2f        mov r20, r24
    1e70: 50 e0        ldi r21, 0x00 ; 0
    1e72: 42 2b        or r20, r18
    1e74: 53 2b        or r21, r19
sample = sample >> 4;
    1e76: a4 e0        ldi r26, 0x04 ; 4
    1e78: 56 95        lsr r21
    1e7a: 47 95        ror r20
    1e7c: aa 95        dec r26
    1e7e: e1 f7        brne .-8      ; 0x1e78 <__vector_13+0x106>

I haven't actually counted the cycles, but I see one less jump in there and I think four fewer loops as a result.  And it looks simpler.
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1355
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

And here's the code if I generate a 16 bit sample instead:

Code:
sample = 0X80 ^ playpos[1];
    1e64: 81 81        ldd r24, Z+1 ; 0x01
sample = sample << 8;
    1e66: 38 2f        mov r19, r24
    1e68: 30 58        subi r19, 0x80 ; 128
    1e6a: 20 e0        ldi r18, 0x00 ; 0
sample = sample | playpos[0];
    1e6c: 80 81        ld r24, Z
    1e6e: 48 2f        mov r20, r24
    1e70: 50 e0        ldi r21, 0x00 ; 0
    1e72: 42 2b        or r20, r18
    1e74: 53 2b        or r21, r19

Now why generate a 16 bit sample if I need a 12 bit sample?  Well, I need to do a shift later for the volume anyway, and the 8x16->24 multiply routine can handle a 16 bit sample without overflowing...  So maybe if I shift by 12...

Code:
uint32_t tmp;
MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
    1e86: 2f 85        ldd r18, Y+15 ; 0x0f
    1e88: 00 27        eor r16, r16
    1e8a: 24 9f        mul r18, r20
    1e8c: c0 01        movw r24, r0
    1e8e: a0 2f        mov r26, r16
    1e90: b0 2f        mov r27, r16
    1e92: 25 9f        mul r18, r21
    1e94: 90 0d        add r25, r0
    1e96: a1 1d        adc r26, r1
    1e98: b0 1f        adc r27, r16
    1e9a: 11 24        eor r1, r1

//sample = tmp >> 8; // For 12-bit sample input.  This should be optimized to just copying some bytes.
sample = tmp >> 12; // For 16-bit sample input.  (output 12 bit sample)
    1e9c: 2c e0        ldi r18, 0x0C ; 12
    1e9e: b6 95        lsr r27
    1ea0: a7 95        ror r26
    1ea2: 97 95        ror r25
    1ea4: 87 95        ror r24
    1ea6: 2a 95        dec r18
    1ea8: d1 f7        brne .-12      ; 0x1e9e <__vector_13+0x12c>
    1eaa: 28 2f        mov r18, r24

Hm, well how does that compare to when I shifted before I did the volume adjustment?

Code:
// Adjust 12-bit sample by volume:
uint32_t tmp;
MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
    1e9c: 2f 85        ldd r18, Y+15 ; 0x0f
    1e9e: 00 27        eor r16, r16
    1ea0: 24 9f        mul r18, r20
    1ea2: c0 01        movw r24, r0
    1ea4: a0 2f        mov r26, r16
    1ea6: b0 2f        mov r27, r16
    1ea8: 25 9f        mul r18, r21
    1eaa: 90 0d        add r25, r0
    1eac: a1 1d        adc r26, r1
    1eae: b0 1f        adc r27, r16
    1eb0: 11 24        eor r1, r1
sample = tmp >> 8; // This should be optimized to just copying some bytes.
    1eb2: 89 2f        mov r24, r25
    1eb4: 9a 2f        mov r25, r26
    1eb6: ab 2f        mov r26, r27
    1eb8: bb 27        eor r27, r27
    1eba: 28 2f        mov r18, r24

Hm...

I wonder how the shift by 12 compares to doing a shift by 8 and then a shift by 4?

Code:
// Adjust sample by volume:
uint32_t tmp;
MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume
    1e86: 2f 85        ldd r18, Y+15 ; 0x0f
    1e88: 00 27        eor r16, r16
    1e8a: 24 9f        mul r18, r20
    1e8c: c0 01        movw r24, r0
    1e8e: a0 2f        mov r26, r16
    1e90: b0 2f        mov r27, r16
    1e92: 25 9f        mul r18, r21
    1e94: 90 0d        add r25, r0
    1e96: a1 1d        adc r26, r1
    1e98: b0 1f        adc r27, r16
    1e9a: 11 24        eor r1, r1

sample = tmp >> 8; // For 12-bit sample input.  This should be optimized to just copying some bytes.
    1e9c: 89 2f        mov r24, r25
    1e9e: 9a 2f        mov r25, r26
    1ea0: ab 2f        mov r26, r27
    1ea2: bb 27        eor r27, r27
    1ea4: 9c 01        movw r18, r24
//sample = tmp >> 12; // For 16-bit sample input.  (output 12 bit sample)
sample = sample >> 4; // For 16-bit sample input.  (output 12 bit sample)
    1ea6: 84 e0        ldi r24, 0x04 ; 4
    1ea8: 36 95        lsr r19
    1eaa: 27 95        ror r18
    1eac: 8a 95        dec r24
    1eae: e1 f7        brne .-8      ; 0x1ea8 <__vector_13+0x136>
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1355
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Shift by 12:

Code:
(1)    1e9c: 2c e0        ldi r18, 0x0C ; 12
(1)    1e9e: b6 95        lsr r27
(1)    1ea0: a7 95        ror r26
(1)    1ea2: 97 95        ror r25
(1)    1ea4: 87 95        ror r24
(1)    1ea6: 2a 95        dec r18
(1/2) 1ea8: d1 f7        brne .-12      ; 0x1e9e <__vector_13+0x12c>
(1)    1eaa: 28 2f        mov r18, r24

I count 8 cycles in that loop, x12, -1 so 95 cycles +1 for the mov, so 96.

And now the shift by 8 then a shift by 4:

Code:
(1)    1e9c: 89 2f        mov r24, r25
(1)    1e9e: 9a 2f        mov r25, r26
(1)    1ea0: ab 2f        mov r26, r27
(1)    1ea2: bb 27        eor r27, r27
(1)    1ea4: 9c 01        movw r18, r24
sample = sample >> 4; // For 16-bit sample input.  (output 12 bit sample)
(1)    1ea6: 84 e0        ldi r24, 0x04 ; 4
(1)    1ea8: 36 95        lsr r19
(1)    1eaa: 27 95        ror r18
(1)    1eac: 8a 95        dec r24
(1/2) 1eae: e1 f7        brne .-8      ; 0x1ea8 <__vector_13+0x136>

Jeez, I can already tell before I even add it up that that is going to be way less.

Let's see, 5 cycles for the shift by 8, and 6 cycles in that loop x4, -1 (unless I shouldn't count the ldi instruction?  I can't tell where the jump is jumping back to but I think maybe it's to lsr)
That gives me a total of.... 28 cycles!  Holy crap that is 3x as fast.

I guess the lesson here is if there's a shift by 8 that can be done, do it explicitly, because the compiler ain't gonna do it for ya.  And don't shift by 4 twice if you can do a shift by 8 instead.
Logged

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1355
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

New and improved hardware SPI DAC update code for WaveHC, with fine volume control:

(volume needs to be changed to a uint8_t in the wave class if it isn't, I forget what it defaulted to)

Code:
// Beware hidden spaces after \'s in defines!
// I guess no ; is needed after the asm because when you call MultiU8X16to24() you will put one after it, and that will just be replaced with the code below.

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"mul %A1, %B2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)


//------------------------------------------------------------------------------
// timer interrupt for DAC
ISR(TIMER1_COMPA_vect) {
  if (!playing) return;

  if (playpos >= playend) { // If we've reached the end of the buffer...
    if (sdstatus == SD_READY) {
   
      // swap double buffers
      playpos = sdbuff;
      playend = sdend;
      sdbuff = sdbuff != buffer1 ? buffer1 : buffer2;
     
      sdstatus = SD_FILLING;
      // interrupt to call SD reader
    TIMSK1 |= _BV(OCIE1B);
    }
    else if (sdstatus == SD_END_FILE) {
      playing->stop();
      return;
    }
    else {
      // count overrun error if not at end of file
      if (playing->remainingBytesInChunk) {
        playing->errors++;
      }
      return;
    }
  }

  //uint8_t dh, dl;

  uint16_t sample;

  if (playing->BitsPerSample == 16) {
 
    // 16-bit is signed
    //dh = 0X80 ^ playpos[1]; // Flip most significant bit. ^ = XOR, and XOR sets bits that are the same to be 0, and different to be 1.  So xoring 101 with 000 gives 101.  While xoring 101 with 100 gives 001.  I guess fipping this one bit is all that is needed to convert from signed to unsigned!
    //dl = playpos[0];
   
// Calculate 16-bit sample:

sample = 0X80 ^ playpos[1];
sample = sample << 8;
sample = sample | playpos[0];

playpos += 2;

  }
  else {
 
    // 8-bit is unsigned
    //dh = playpos[0];
    //dl = 0;

// Calculate 16-bit sample:
sample = playpos[0]<<8;   

    playpos++;

  }
 
#if DVOLUME

// dh, dl -> 16 bit tmp:
//uint16_t tmp = (dh << 8) | dl;
//tmp >>= playing->volume;

// dh, dl -> 12 bit tmp:
  //uint32_t tmp = (dh << 4) | (dl >> 4);
//tmp = (tmp * playing->volume) >> 10; // Volume is 10 bit, so 0..1023.  4095*1023 / 1024 = 4091 max.

// Adjust sample by volume:
uint32_t tmp;
MultiU8X16to24(tmp, playing->volume, sample); // (uint32_t) tmp = (uint16_t) sample * (uint8_t) volume

sample = tmp >> 8; // Divide by 256 to finish adusting sample by volume.   
sample = sample >> 4; // Convert 16 bit sample to 12 bit.

// It is 3x faster to shift by 8 and then a shift by 4 than to let the compiler figure out it can do the same with a shift by 12!

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
      PORTB &= ~0b10000;

     // Send 16 bits to the DAC.
// First 4 bits are configuration bits. 0011 = (DAC A, unbuffered, 1x gain, not muted)
// Last 12 bits specify the output voltage. (MSB first) 

//SPI.transfer((tmp >> 8)|0b00110000);
//SPI.transfer(tmp);

#if DVOLUME

    SPDR = (sample >> 8)|0b00110000; // sample is 12-bit
    while (!(SPSR & _BV(SPIF)));

    SPDR = sample; //SPDR = tmp;
    while (!(SPSR & _BV(SPIF)));

#else

// This code is broken because I no longer store dh and dl seperately.

    SPDR = 0b00110000 | (dh >> 4);
    while (!(SPSR & _BV(SPIF)));

    SPDR = (dh << 4) | (dl >> 4);
    while (!(SPSR & _BV(SPIF)));

#endif

     // Take the SS pin high to de-select the DAC:
      PORTB |= 0b10000;

}

This could probably be optimized quite a bit more if pure assembler was used stating from the point where the sample data was loaded into the sample variable.  If some of those shifts could be removed that would save a lot of cycles.
« Last Edit: November 06, 2012, 06:41:13 am by scswift » Logged

Valencia, Spain
Offline Offline
Faraday Member
**
Karma: 143
Posts: 5316
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

This thread leads me to another question:

If the chip has a hardware multiplier which takes two clock cycles...why doesn't the compiler use it to do shift operations instead of creating a loop of single-bit shifts?

Logged

No, I don't answer questions sent in private messages (but I do accept thank-you notes...)

Manchester, New Hampshire
Offline Offline
Edison Member
*
Karma: 4
Posts: 1355
Propmaker
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hm... tried to get rid of one assembly instruction and have the OR operation done to r19 instead of copying it to another register and doing it on that, but the compiler instead added a movw command earlier in the code to move the data into new registers anyway:

Code:
sample = tmp >> 8; // For 12-bit sample input.  This should be optimized to just copying some bytes.
    1e9c: 89 2f        mov r24, r25
    1e9e: 9a 2f        mov r25, r26
    1ea0: ab 2f        mov r26, r27
    1ea2: bb 27        eor r27, r27
    1ea4: 9c 01        movw r18, r24
//sample = tmp >> 12; // For 16-bit sample input.  (output 12 bit sample)
sample = sample >> 4; // For 16-bit sample input.  (output 12 bit sample)
    1ea6: 84 e0        ldi r24, 0x04 ; 4
    1ea8: 36 95        lsr r19
    1eaa: 27 95        ror r18
    1eac: 8a 95        dec r24
    1eae: e1 f7        brne .-8      ; 0x1ea8 <__vector_13+0x136>

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
      PORTB &= ~0b10000;
    1eb0: 2c 98        cbi 0x05, 4 ; 5
//SPI.transfer(tmp);

#if DVOLUME

//SPDR = (tmp >> 8)|0b00110000; // tmp = 12-bit.  Does the C compiler optimize out the shift here?
    SPDR = (sample >> 8)|0b00110000; // sample = 12-bit
    1eb2: 83 2f        mov r24, r19
    1eb4: 80 63        ori r24, 0x30 ; 48
    1eb6: 8e bd        out 0x2e, r24 ; 46
    while (!(SPSR & _BV(SPIF)));
    1eb8: 0d b4        in r0, 0x2d ; 45
    1eba: 07 fe        sbrs r0, 7
    1ebc: fd cf        rjmp .-6      ; 0x1eb8 <__vector_13+0x146>

Code:
sample = tmp >> 8; // Divide by 256 to finish adusting sample by volume.   
    1e9c: 89 2f        mov r24, r25
    1e9e: 9a 2f        mov r25, r26
    1ea0: ab 2f        mov r26, r27
    1ea2: bb 27        eor r27, r27
sample = sample >> 4; // Convert 16 bit sample to 12 bit.
    1ea4: 24 e0        ldi r18, 0x04 ; 4
    1ea6: 96 95        lsr r25
    1ea8: 87 95        ror r24
    1eaa: 2a 95        dec r18
    1eac: e1 f7        brne .-8      ; 0x1ea6 <__vector_13+0x134>
// It is 3x faster to shift by 8 and then a shift by 4 than to let the compiler figure out it can do the same with a shift by 12!

#endif //DVOLUME

     // Take the SS pin low to select the DAC. (Bit 4 on port B is the hardware SPI slave select pin on the Atmega1284p)
      PORTB &= ~0b10000;
    1eae: 2c 98        cbi 0x05, 4 ; 5
//SPI.transfer((tmp >> 8)|0b00110000);
//SPI.transfer(tmp);

#if DVOLUME

sample = sample | 0b0011000000000000;
    1eb0: 90 63        ori r25, 0x30 ; 48
SPDR = sample >> 8;
    1eb2: 9e bd        out 0x2e, r25 ; 46
    //SPDR = (sample >> 8)|0b00110000; // sample is 12-bit
    while (!(SPSR & _BV(SPIF)));
    1eb4: 0d b4        in r0, 0x2d ; 45
    1eb6: 07 fe        sbrs r0, 7
    1eb8: fd cf        rjmp .-6      ; 0x1eb4 <__vector_13+0x142>
Logged

Global Moderator
Offline Offline
Brattain Member
*****
Karma: 474
Posts: 18696
Lua rocks!
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

If the chip has a hardware multiplier which takes two clock cycles...why doesn't the compiler use it to do shift operations instead of creating a loop of single-bit shifts?

I don't know. Possibly we don't have the latest version of the compiler. Possibly that particular optimization was omitted from the code generation section.
Logged

Pages: 1 2 [3]   Go Up
Jump to: