fast divide of signed number using SHIFT ?

I thought it might be quicker to use shift to divide my signed number. I want the quotient and the remainder.

I've already changed from divide by 10 to divide by 8 to make it easier. And for other reasons I've changed from int to char. So these timing are mostly char.

My remainder code is possibly very clunky? there might be a very simple method.

I got the quotient part from here:

I think my timing code is OK ??? variables are volatile. I think I've avoided the optimizer spoiling it.

one of my loops:

for (int i = 0; i < 1000; i++) {
x= i % 128 -120;
q= (x + ((x>>7)&0b00000111))>>3;
r= ((x>>4)& 0b11111000)|(x & 0b00000111);
}

my results:

1000 loops just assign x 828
1000 loops divide by 10 with / 7404
1000 loops divide by 8 with / 1388
1000 loops divide by 8 with >> 2976
1000 loops divide by 8 unsigned >> 2212

1000 loops q,r /10 %10 13972
1000 loops q,r /8 %8 2020
1000 loops q,r by >> 3800

1000 loops q,r /8%8 INT 18180
1000 loops q,r by >> INT 5508

So it looks like:

for CHAR
/ and % are damn fast when 8 is the divisor
not so fast with 10.
q = x/8 beats beats q= x>>3 this surprised me

for INT
using >> beats / % by a fair bit

DIVIDE_SHIFT.ino (4.04 KB)

It could be that the compiler already optimizes “divide by a constant 8” into shifts.
Inspect the object code...

Looking at object code is too deep for me, was just hoping someone might know or have an idea about the reason.

Todays timings indicate that >> is not optimised for char (converts to int ??) but /8, /4 is.

These results are net ( take away the assign loop time)

1000 loops char /4 516 net
1000 loops char /8 576 net
1000 loops char /16 632 net
1000 loops char /32 708 net

1000 loops char >>2 708 net
1000 loops char >>3 1396 net
1000 loops char >>4 1712 net
1000 loops char >>5 2032 net

1000 loops int /4 948 net
1000 loops int /8 1648 net
1000 loops int /16 1960 net
1000 loops int /32 2272 net
1000 loops int /10 15512 net

1000 loops int >>2 764 net
1000 loops int >>3 1464 net
1000 loops int >>4 1776 net
1000 loops int >>5 2092 net

jimmer:
I thought it might be quicker to use shift to divide my signed number. I want the quotient and the remainder.

What makes you think that ?

Deva_Rishi:
What makes you think that ?

I was expecting / to be a dumb general purpose algorithm. So eg I wasn't expecting /8 to be loads faster than /10.

The divide operator is verrryyyy sloooooowwwww so using right shift should be maybe 2 orders of magnitude faster.

I don't think the compiler optimizes divisions by powers of 2.

...R

Looking at object code is too deep for me, was just hoping someone might know or have an idea about the reason.

The very best way to know the reason is to look at the assembly language listing output of the compiler. It will be obvious whether the compiler chooses to use shift operators versus the divide function.

The Arduino IDE doesn't make getting that output easy, but Atmel Studio does. So, you can spend your time fiddling around with little programs and getting nowhere, or learning a technique that is actually useful.

I won't say that's a bad idea, but Priorities. Learning atmel studio and a new assembly language and maybe c+ is a lot of work to answer a curiosity.

Summary of results :

/8 (also /4 /16 etc) is about 10x faster than /10 for both char and int

char/8 is just over twice as fast as int/8

byte/8 is faster still (20%) than char/8

is slow with char (same speed as int so probably converts to int)

jimmer:
I won't say that's a bad idea, but Priorities. Learning atmel studio and a new assembly language and maybe c+ is a lot of work to answer a curiosity.

Summary of results :

/8 (also /4 /16 etc) is about 10x faster than /10 for both char and int

char/8 is just over twice as fast as int/8

byte/8 is faster still (20%) than char/8

is slow with char (same speed as int so probably converts to int)

If the divisor is not a power of 2, then your code must include SUBTRACTS of the divisor for those parts that are not power of two. Used all the time on ancient microprocessors.

Paul

Here is my small contribution. Note how division of long variables does not appear to be optimized for a power-of-two divisor.

Results ...

Starting DivideSpeedTest.ino
All tests are for 1000 iterations and time is microsecs

INTEGER TESTS ---
divide by 17 ...
Time 17548   Total -726000

divide by 16 ...
Time 3036   Total -771000                            
                                                     
Right shift 4 ...                                    
Time 2848   Total -772000                            

LONG TESTS ---
divide by 17 ...                                                                
Time 43132   Total -726000                                                      
                                                                                
divide by 16 ...                                                                
Time 42716   Total -771000                                                      
                                                                                
Right shift 4 ...                                                               
Time 3856   Total -772000                                                       
                                                                                
 ----DONE----

Program

// python-build-start
// action, upload
// board, arduino:avr:mega:cpu=atmega2560
// port, /dev/ttyACM0
// ide, 1.8.6
// python-build-end

// arduino:avr:uno
// arduino:avr:mega:cpu=atmega2560

// program to explore the speed difference using divide or right shift
//      https://forum.arduino.cc/index.php?topic=586534.0




unsigned long startMicros;
unsigned long endMicros;
int numRepeats = 1000;


void setup() {
    Serial.begin(500000);
    Serial.println("Starting DivideSpeedTest.ino");

    Serial.println("All tests are for 1000 iterations and time is microsecs");

    Serial.println("\nINTEGER TESTS ---");
    integerTests();
    Serial.println("\nLONG TESTS ---");
    longTests();

}

void integerTests() {
    volatile int myVal = -12345;
    volatile int answer;
    long total;

    Serial.println("divide by 17 ...");
    total = 0;
    startMicros = micros();
    for (int n = 0; n < numRepeats; n++) {
        answer = myVal / 17L;
        total += answer;
    }
    endMicros = micros();
    Serial.print("Time "); Serial.print(endMicros - startMicros);
    Serial.print("   Total "); Serial.println(total);
    Serial.println();

    Serial.println("divide by 16 ...");
    total = 0;
    startMicros = micros();
    for (int n = 0; n < numRepeats; n++) {
        answer = myVal / 16L;
        total += answer;
    }
    endMicros = micros();
    Serial.print("Time "); Serial.print(endMicros - startMicros);
    Serial.print("   Total "); Serial.println(total);
    Serial.println();

    Serial.println("Right shift 4 ...");
    total = 0;
    startMicros = micros();
    for (int n = 0; n < numRepeats; n++) {
        answer = myVal >> 4;
        total += answer;
    }
    endMicros = micros();
    Serial.print("Time "); Serial.print(endMicros - startMicros);
    Serial.print("   Total "); Serial.println(total);

}

void longTests() {
    volatile long myVal = -12345;
    volatile long answer;
    long total;

    Serial.println("divide by 17 ...");
    total = 0;
    startMicros = micros();
    for (int n = 0; n < numRepeats; n++) {
        answer = myVal / 17L;
        total += answer;
    }
    endMicros = micros();
    Serial.print("Time "); Serial.print(endMicros - startMicros);
    Serial.print("   Total "); Serial.println(total);
    Serial.println();

    Serial.println("divide by 16 ...");
    total = 0;
    startMicros = micros();
    for (int n = 0; n < numRepeats; n++) {
        answer = myVal / 16L;
        total += answer;
    }
    endMicros = micros();
    Serial.print("Time "); Serial.print(endMicros - startMicros);
    Serial.print("   Total "); Serial.println(total);
    Serial.println();

    Serial.println("Right shift 4 ...");
    total = 0;
    startMicros = micros();
    for (int n = 0; n < numRepeats; n++) {
        answer = myVal >> 4;
        total += answer;
    }
    endMicros = micros();
    Serial.print("Time "); Serial.print(endMicros - startMicros);
    Serial.print("   Total "); Serial.println(total);

    Serial.println("\n ----DONE----");

}

void loop() {
}

...R

Yep, I found that too but forgot to mention because I was interested in the small fast numbers.

Truly horrendous times for long!

I found all sorts of clever solutions for dividing by specific numbers. by 10 is particularly useful and there is a whole raft of stuff on that. The simplest general purpose speedup if you know you are going to be repeatedly dividing by a value is to create a float = 1/value and multiply by that. See below, still slow but only half as slow.

1000 loops x=(i%125)-62; 15144
1000 loops long q= x/4 40296 net
1000 loops long q= x/8 40204 net
1000 loops long q= x/16 40104 net
1000 loops long q= x/32 40008 net

1000 loops long q= x/10 40016 net
1000 loops long q=x*over10 18576 net

1000 loops long q= x>>2 1892 net
1000 loops long q= x>>3 2344 net
1000 loops long q= x>>4 2784 net
1000 loops long q= x>>5 3232 net

Fast divides threads, awesome code.

Arduino Forum > Development > Other Software Development > divmod10() : a fast replacement for /10 and %10 (unsigned)
http://forum.arduino.cc/index.php?topic=167414.0

Arduino Forum > Development > Other Software Development > divu5() and divu3(). Fast replacements for unsigned division by 3, 5, 6, 10, 15
http://forum.arduino.cc/index.php?topic=172635.0

Looking at object code is too deep for me...
:
Learning atmel studio and a new assembly language and maybe c+ is a lot of work to answer a curiosity.

It's not THAT bad. You CAN disassemble the binary from the Arduino IDE, you just have to learn where stuff is. And you don't need to "learn" the assembly language to be able to tell the difference between shifting and division...

Sigh. Here:

  char c;
   int i;
  long u;
  c = PORTB;
  i = (c + (PORTB << 8)) + micros();
  u = (i + (PORTB << 16)) + millis();
  PORTD = c / 8;
  PORTD = i / 8;
  PORTD = u / 8;
  PORTD = c / 10;
  PORTD = i / 10;
  PORTD = u / 10;

Produces:

  PORTD = c / 8;
 230:   8c 2f           mov     r24, r28
 232:   c7 ff           sbrs    r28, 7
 234:   02 c0           rjmp    .+4             ; 0x23a <main+0x116>
 236:   87 e0           ldi     r24, 0x07       ; 7
 238:   8c 0f           add     r24, r28
 23a:   85 95           asr     r24              ;;;  ***** SHIFTS *****
 23c:   85 95           asr     r24
 23e:   85 95           asr     r24
 240:   8b b9           out     0x0b, r24       ; 11
  PORTD = i / 8;
 242:   c8 01           movw    r24, r16
 244:   17 fd           sbrc    r17, 7
 246:   07 96           adiw    r24, 0x07       ; 7
 248:   23 e0           ldi     r18, 0x03       ; 3
 24a:   95 95           asr     r25              ;;;  ***** SHIFTS in a loop *****
 24c:   87 95           ror     r24
 24e:   2a 95           dec     r18
 250:   e1 f7           brne    .-8             ; 0x24a <main+0x126>
 252:   8b b9           out     0x0b, r24       ; 11
  PORTD = u / 8;
 254:   c7 01           movw    r24, r14
 256:   b6 01           movw    r22, r12
 258:   28 e0           ldi     r18, 0x08       ; 8
 25a:   30 e0           ldi     r19, 0x00       ; 0
 25c:   40 e0           ldi     r20, 0x00       ; 0
 25e:   50 e0           ldi     r21, 0x00       ; 0
 260:   0e 94 71 01     call    0x2e2   ; 0x2e2 <__divmodsi4>  ;;;;   ****** Divide *****
 264:   2b b9           out     0x0b, r18       ; 11
  PORTD = c / 10;
 266:   8c 2f           mov     r24, r28
 268:   6a e0           ldi     r22, 0x0A       ; 10
 26a:   0e 94 4f 01     call    0x29e   ; 0x29e <__divmodqi4>  ;;;;   ****** Divide *****
 26e:   8b b9           out     0x0b, r24       ; 11
  PORTD = i / 10;
 270:   c8 01           movw    r24, r16
 272:   6a e0           ldi     r22, 0x0A       ; 10
 274:   70 e0           ldi     r23, 0x00       ; 0
 276:   0e 94 5d 01     call    0x2ba   ; 0x2ba <__divmodhi4>  ;;;;   ****** Divide *****
 27a:   6b b9           out     0x0b, r22       ; 11
  PORTD = u / 10;
 27c:   c7 01           movw    r24, r14
 27e:   b6 01           movw    r22, r12
 280:   2a e0           ldi     r18, 0x0A       ; 10
 282:   30 e0           ldi     r19, 0x00       ; 0
 284:   40 e0           ldi     r20, 0x00       ; 0
 286:   50 e0           ldi     r21, 0x00       ; 0
 288:   0e 94 71 01     call    0x2e2   ; 0x2e2 <__divmodsi4>  ;;;;   ****** Divide *****
 28c:   2b b9           out     0x0b, r18       ; 11

Interestingly (?) divide-by-8 of an unsigned quantity results in shifts for all variable sizes...

The divide operator is verrryyyy sloooooowwwww

Note that this is a CPU-dependent statement. Divide is pretty fast on a Due with it's ARM CM3 chip (but NOT on an Arduino Zero, where the CM0+ cpu lacks a divide instruction.)

Also note that right-shifting a signed integer is "implementation defined" in C. It might not be equivalent to division on some cpus.

Right shifting is never equivalent to division for negative numbers, as C has truncate-to-zero
semantics for divide, whereas shifting right truncates to floor. Most division hardware has
truncate to zero semantics.

Often when dividing in the integer domain you get a more intuitive result by adding in
half the power of two:

int divide8 (int a)
{ 
  return (a+4) >> 3 ;
}

This is roughly equivalent to rounding to nearest, although note divide8(4) rounds to +1,
divide8(-4) rounds to 0.

Note the same trick is useful for with '/', but that its more complicated as there are
different cases for positive and negative.

int divide7(int a)
{
  if (a < 0)
    return (a-3)/7 ;
  else
    return (a+3)/7 ;
}

Right shifting is never equivalent to division for negative numbers, as ... shifting right truncates to floor.

Thus, "implementation defined" (and a good example of why!) The "arithmetic shift" instruction of your processor may or may not round differently. I believe that it would also be permissible under the "implementation defined" header if right shift of a signed quantity worked exactly like an unsigned shift (ie not at all what people expect...)
If you look more closely than 'shifts vs divide function' at the assembly code above, you'll notice that the divide-by-8 isn't JUST a shift right...