Speeding up division

I took a shot at a faster division algorithm for unsigned ints, but am not certain that I really want to believe the timings I ended up with - and would appreciate some peer review. Here's my complete test code (n:numerator, d:denominator, q:quotient):

volatile unsigned q = 999;          /* Quotient (result) variable          */
/*-------------------------------------------------------------------------*/
/* udiv() - performs unsigned int division                                 */
/*-------------------------------------------------------------------------*/
unsigned udiv(unsigned n,unsigned d)
{  unsigned char s = 1;             /* Shift counter for early out testing */
   unsigned r = 1, m = ~(~0u >> 1); /* Bottom and top bit masks            */
   /* unsigned */ q = 0;            /* Quotient initialized to zero        */
   if (d)                           /* If denominator isn't zero           */
   {  if (n >= d)                   /* If numerator > denominator          */
      {  while (!(n & m)) m >>= 1;  /* Find numerator MSB                  */
         while (!(d & m))           /* Until denominator MSB matches mask  */
         {  d <<= 1;                /* Shift denominator left one bit pos  */
            r <<= 1;                /* Shift quotient bit to suit          */
            ++s;                    /* Increase shift counter              */
         }                          /*  end: denominator MSB search        */
         while (n && s--)           /* Until numerator zero or end shift   */
         {  if (n >= d)             /* If n < d then n is remainder (done) */
            {  n -= d;              /* Subtract partial quotient and       */
               q |= r;              /* Insert bit into quotient            */
            }                       /*  end: if n >= d                     */
            d >>= 1;                /* Shift divisor right one position    */
            r >>= 1;                /* Shift quotient bit to match         */
         }                          /*  end: until numerator zero          */
      }                             /*  end: if numerator > denominator    */
      else q = 0;                   /* Numerator < denominator => zero     */
   }                                /*  end: if not dividing by zero       */
   else q = ~0;                     /* Very ugly - division by zero!       */
   return q;                        /* Return the quotient to the caller   */
}                                   /*  end: udiv()                        */
/*-------------------------------------------------------------------------*/
void setup(void)
{  unsigned n,d;
   unsigned long t;
   float a;

   Serial.begin(115200);
   t = micros();
   
   for (n=0; n<=100; n++)
   {  for (d=0; d<=100; d++)
      {  // printf("%u/%u = %u\n",n,d,udiv(n,d));
         q = udiv(n,d);
      }
   }
   
   t = micros() - t;
   a = (float)t / 101 / 101;
   
   Serial.print("elapsed = ");
   Serial.print(t);
   Serial.print("usec, average = ");
   Serial.print(a);
   Serial.println("usec");
}
void loop(void)
{  Serial.println(q);
   for (;;);
}

Here's what I got on my Serial monitor:

elapsed = 78580usec, average = 7.70usec
1

Edit: provided q with an initial value, added comments to udiv().

My test shows that the compiler-supplied division with that datatype would take 13.2 uS each, so your code appears to be about twice as fast.

Disclaimer: I haven't tested it against all possible values to make sure it works correctly.

Nick - I just tried it with the following addition following the udiv() call:

         if (q != (n/d))
         {  Serial.print("Error at ");
            Serial.print(n);
            Serial.print("/");
            Serial.println(d);
         }

and got no reported errors.

That's still not all possible values though. I must be getting tired. :grin:

Me too. I'm up to numerator 1618 (out of 65535), but might have to go to bed before it finishes.

best to check division with multiplication - two division routines might have the same bug due to using the same approach, less likely that multiply and divide are both broken in complementary ways.

Also this is definitely a candidate for asm.

Other thread: http://arduino.cc/forum/index.php/topic,92684.new.html#new
Regarding results, according to AVR200

Code Size Execution Time(Words) (Cycles)

16 / 16 = 16 + 16 bit unsigned (Code Optimized) 19 243
16 / 16 = 16 + 16 bit unsigned (Speed Optimized) 196 173
16 / 16 = 16 + 16 bit signed (Code Optimized) 39 255

So the best results 173 x 62.5 ns = 10812.5, which is slower than posted 7.7 usec.
To find a cause of a difference, I'd dig up recent updates on "nested loops", could be that formula
"a = (float)t / 101 / 101" is not quite correct?

At 62.5ns per clock (16MHz), I'm having a really hard time believing those timings are correct. I'd like to see the generated assembly because I've got 32 clocks on average just to find the MSB, and that alone is 2uS.

The other issue is that division is very dependent, both in hardware and software, on the values given. Values with fewer bits take less time. Try all 4 billion combinations. If the answer really is 7.7uS per op, you'll be done in 9 or 10 hours. If not, well, you were wrong about the timing :wink:

At 62.5ns per clock (16MHz), I'm having a really hard time believing those timings are correct.

It's running a correctness check (and will be for a while). When that finishes, I'll set up a 64-bit usec clock and re-run the timing calculations (micros() overflows every hour and 11 minutes or so).

I hear you on the MSB search. There are a number of things that make estimating interesting: anytime the denominator/divisor is larger than the numerator/dividend, the routine simply returns zero; as the numerator grows larger less preliminary shifting is needed; as the denominator grows larger the number of comparisons and subtractions decreases; and the calculation process always watches for an "early" end.

If the answer really is 7.7uS per op, you'll be done in 9 or 10 hours. If not, well, you were wrong about the timing smiley-wink

I don't mind being wrong - so far it's something I've managed to survive - and if I am, I'd rather find out soonest and learn from the experience than have it catch me by surprise later. :grin:

I'm up to 26720 x n (8 hours later). Still got a bit to go. No errors that I can see so far.

So far so good. It got through the functional validation without any errors, and is now running a timing metric over all values.