To shift or not to shift?

Dividing by 128 would be better as it can be replaced by a bitshift. You would need to adjust 270 and 50 by a factor 128/100..
Bitshifts are far faster than division (and also more predictable).

With AVR processors, that is worth testing before making such assertions.

Maybe so, but the complexity of the code required to perform the bitshift division will gobble up whatever is gained by the bitshift. The simplicity and transparency of the other suggestions provided in this thread is worth a couple of "lost" (or well spend?) clocks - if any are lost at all.

Responding on an Arduino forum to a question about a modest AVR processor with...

mov edx, -1840700269
mov eax, edi
imul    edx
lea eax, [rdx+rdi]
sar eax, 2
sar edi, 31
sub eax, edi

...Intel assembly is ... well ... many things. Let's just go with "a fallacy".

On the one hand you tell us to trust the compiler.

Then tell us not to trust the compiler.

At this point your posts are "noise". I'm splitting the thread and moving the remnants to Bar Sport.

A petty excuse.

The thread stays here until there is more than "I so say" as evidence.

I deleted all my posts on this as you seem not to be able to take my inputs seriously.

The correct choice is to ask a moderator to remove the thread. Someone other than me will decide what to do.

I am taking you seriously. i hope you take the time to run the test. Everyone using an AVR processor will benefit.

What you fail to realize is that AVR processors do not have efficient barrel shifters. Subtracting two numbers has the same cost has a single bit shift. It is possible that dividing by a constant like 100 is about the same cost as shifting by 7 bits.

Here's some AVR code demonstrating what actually happens...

  uint8_t x8;
  uint16_t x16;
  uint32_t x32;

;; First, lets try 8bit values.

  x8 = PORTB   ;; Get a fresh value to prevent optimization
  80:   85 b1           in      r24, 0x05
  PORTD = x8 / 128;
;; Not only is the compiler smart enough to turn the division into a shift,
;;  it's also smart enough to know that shifting an 8bit number right 7 bits
;;  is the same as just shifting the high bit left into a new register.
;;  (Pretty impressive.)
  82:   88 1f           adc     r24, r24  ; shift high bit into carry
  84:   88 27           eor     r24, r24  ; zero the reg
  86:   88 1f           adc     r24, r24  ; shift the carry into reg
  88:   8b b9           out     0x0b, r24
  x8 = PORTB;
  8a:   85 b1           in      r24, 0x05
;; Division by 100 is apparently too difficult to optimize, so a standard
;;  division function is called.  __udivmodqi4 is 24 bytes of code.
  PORTD = x8 / 100;
  8c:   64 e6           ldi     r22, 0x64       ; 100
  8e:   0e 94 83 00     call    0x106   ; 0x106 <__udivmodqi4>
  92:   8b b9           out     0x0b, r24

;; Now for 16 bits.
  x16 = PINB * PIND;  // create an unknown 16bit value.
  94:   83 b1           in      r24, 0x03
  96:   99 b1           in      r25, 0x09
  98:   89 9f           mul     r24, r25
  9a:   c0 01           movw    r24, r0
  9c:   11 24           eor     r1, r1
  PORTD = x16 / 128;
;; again, we do some optimized shifting in the "opposite" direction.
  9e:   88 0f           add     r24, r24
  a0:   89 2f           mov     r24, r25
  a2:   88 1f           adc     r24, r24
  a4:   99 0b           sbc     r25, r25
  a6:   91 95           neg     r25
  a8:   8b b9           out     0x0b, r24
  x16 = PINB * PIND;
  aa:   83 b1           in      r24, 0x03
  ac:   99 b1           in      r25, 0x09
  ae:   89 9f           mul     r24, r25
  b0:   c0 01           movw    r24, r0
  b2:   11 24           eor     r1, r1
  PORTD = x16 / 100;
  b4:   64 e6           ldi     r22, 0x64       ; 100
  b6:   70 e0           ldi     r23, 0x00
;; But divide by 100 is still generic divide. __udivmodhi4 is 40 bytes of code.
  b8:   0e 94 8f 00     call    0x11e   ; 0x11e <__udivmodhi4>
  bc:   6b b9           out     0x0b, r22       ; 11

;; Now for 32bit
  x32 = PINB * PIND;
  be:   83 b1           in      r24, 0x03
  c0:   99 b1           in      r25, 0x09
  c2:   89 9f           mul     r24, r25
  c4:   c0 01           movw    r24, r0
  c6:   11 24           eor     r1, r1
  c8:   09 2e           mov     r0, r25
  ca:   00 0c           add     r0, r0
  cc:   aa 0b           sbc     r26, r26
  ce:   bb 0b           sbc     r27, r27
;; divide by 128 is still optimized to a shift, but the shift is no longer
;;  "clever."  Just a brute-force 32bit shift, done 7 times in a loop :-(
;;  (The loop is 14 bytes of code.)
  PORTD = x32 / 128;
  d0:   37 e0           ldi     r19, 0x07
  d2:   b6 95           lsr     r27
  d4:   a7 95           ror     r26
  d6:   97 95           ror     r25
  d8:   87 95           ror     r24
  da:   3a 95           dec     r19
  dc:   d1 f7           brne    .-12            ; 0xd2 <main+0x52>
  de:   8b b9           out     0x0b, r24
  x32 = PINB * PIND;
  e0:   63 b1           in      r22, 0x03
  e2:   89 b1           in      r24, 0x09
  e4:   68 9f           mul     r22, r24
  e6:   b0 01           movw    r22, r0
  e8:   11 24           eor     r1, r1
  ea:   07 2e           mov     r0, r23
  ec:   00 0c           add     r0, r0
  ee:   88 0b           sbc     r24, r24
  f0:   99 0b           sbc     r25, r25
PORTD = x32 / 100;
;; And divide by 100 is still a generic divide function.  divmodsi4 is 68 bytes
  f2:   24 e6           ldi     r18, 0x64       ; 100
  f4:   30 e0           ldi     r19, 0x00
  f6:   40 e0           ldi     r20, 0x00
  f8:   50 e0           ldi     r21, 0x00
  fa:   0e 94 a3 00     call    0x146   ; 0x146 <__udivmodsi4>
  fe:   2b b9           out     0x0b, r18
}
 100:   90 e0           ldi     r25, 0x00  ;; (return 0)
 102:   80 e0           ldi     r24, 0x00
 104:   08 95           ret
4 Likes

FWIW, ARM CM0 (which has a barrel shifter but no divide instruction) still uses a division function for divide by 100. Since ARM doesn't "DO" 8bit stuff, it uses the same 266byte 32bit division routine for all three variable lengths. (wouldn't THAT be an interesting thing to optimize!)

ARM CM4 has a divide instruction, and still uses shifting for /128 and the division instruction for /100.

3 Likes

Here a small program to prove that division by 100 takes far more time than division by 128 on Uno. Division by 100 takes 13.80 micro seconds. Division by 128 takes 1.32 micro seconds.
7 bit shifts to the right take 1.32 micro seconds as well.
I must admit, this does not prove that it is implemented with bit shift, you would have to look into the assembly code like @westfw did.
@Coding_Badly: Point stands that division by 128 is far faster than division by 100.
It might be that division by 128 is actually implemented like:
Take high byte.
Copy into int.
Bitshift left.
Add most significant byte from low byte.
This is just a theory that can explain that division by 16 takes more time.... (2.36 micro seconds) even though only 4 right shifts instead of 7 would be needed. Division by 256 is even faster (0.84). That would be: take high byte.

I see now that this is exactly what @westfw has shown...

const int inputPin = A3;
const int denom = 128;  //change to 100 if interested
const bool bitshift = true;

void setup() {
  Serial.begin(9600);
}

void loop() {
    uint16_t x[100];
    randomSeed(analogRead(inputPin));  
    for (int i=0; i<100; i++) {
        x[i] = random(65535);
    }
    uint32_t t_start = micros();
    if (bitshift) {
        for (int i=0; i<100; i++) {
            x[i] = x[i] >> 7;
        }
    }
    else {
        for (int i=0; i<100; i++) {
            x[i] = x[i]/denom;  
        }
    }
    uint32_t t_end = micros();
    for (int i=0; i<100; i++) {
        Serial.println(x[i]);
    }
    Serial.println(t_end - t_start);
    delay(500);
}
1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.