Is the Russian Peasant Division really faster than the AVR200 or / operator?

From testing AVR200 vs the Russian Peasant Division algorithm vs the / operator in C the results seem to show that the Russian Peasant Division algorithm is the fastest:
RussianPeasantMultiplicationDivisionAssemblyArduino

The tests were made in an old version of the compiler and the different optimization levels were not tested.

Anyone willing to test it?

Can you provide a clear C implementation of this algorithm rather just assembly and
a rough description?

Is this unsigned only?

joogaa:
Anyone willing to test it?

I wrote some benchmark code for your 16-bit multiplication.
As your multiplication returns a 32-bit value, I've done a 'long' multiplication in Arduino for comparison and same results.

// benchmark test for the Russian peasant multiplication
#include <Arduino.h>
#include "rmulfun.h"

#define LOWBOUND 1210
#define HIGHBOUND 1520

unsigned int test1()
{
  unsigned int c=0;
  for (int i=LOWBOUND;i<HIGHBOUND;i++) 
  {
    for (int j=LOWBOUND;j<HIGHBOUND;j++)
    {
      uint32_t a = rmul16u(i, j);
      if (bitRead(a,1)==1) c++;
    }
  }
  return c;
}

unsigned int test2()
{
  unsigned int c=0;
  for (int i=LOWBOUND;i<HIGHBOUND;i++) 
  {
    for (int j=LOWBOUND;j<HIGHBOUND;j++)
    {
      uint32_t a = (long)i * j;
      if (bitRead(a,1)==1) c++;
    }
  }
  return c;
}

void setup()
 {
 Serial.begin(9600);
 Serial.println("Start");
 unsigned long t1=micros();
 unsigned int x=test1();
 t1=micros()-t1;
 unsigned long t2=micros();
 unsigned int y=test2();
 t2=micros()-t2;
 
 Serial.print("Test 1 result: ");
 Serial.print(x);
 Serial.print("  Duration: ");
 Serial.println(t1/1000000.0,6);

 Serial.print("Test 2 result: ");
 Serial.print(y);
 Serial.print("  Duration: ");
 Serial.println(t2/1000000.0,6);
}

void loop(){
}

Of course, it is impossible to write a software multiplication (even not in assembler) that is faster than the hardware multiplication that is built-in within the AVR controller.

But your 32-bit division really seems to beat the default AVR 32-bit division.
Sometimes, at least.

Here is some code for testing 16bit*16bit=32bit multiplication as well as unsigned long division:

// benchmark test for the Russian peasant multiplication
#include <Arduino.h>
#include "rmulfun.h"
#include "rdivfun.h"

#define LOWBOUND 1210
#define HIGHBOUND 1520

#define LOWDIVIDEND 123456789
#define HIGHDIVIDEND (LOWDIVIDEND+500)
#define LOWDIVISOR  234567
#define HIGHDIVISOR (LOWDIVISOR+500)

unsigned int testRussMul()
{
  unsigned int c=0;
  for (int i=LOWBOUND;i<HIGHBOUND;i++) 
  {
    for (int j=LOWBOUND;j<HIGHBOUND;j++)
    {
      uint32_t a = rmul16u(i, j);
      if (bitRead(a,1)==1) c++;
    }
  }
  return c;
}

unsigned int testArduMul()
{
  unsigned int c=0;
  for (int i=LOWBOUND;i<HIGHBOUND;i++) 
  {
    for (int j=LOWBOUND;j<HIGHBOUND;j++)
    {
      uint32_t a = (long)i * j;
      if (bitRead(a,1)==1) c++;
    }
  }
  return c;
}

unsigned int testRussDiv()
{
  unsigned int c=0;
  for (uint32_t i=LOWDIVIDEND;i<HIGHDIVIDEND;i++) 
  {
    for (uint32_t j=LOWDIVISOR;j<HIGHDIVISOR;j++)
    {
      uint32_t a,b;
      rdiv32u(&a, &b, i, j);
      if (bitRead(a,1)==1) c++;
    }
  }
  return c;
}

unsigned int testArduDiv()
{
  unsigned int c=0;
  for (uint32_t i=LOWDIVIDEND;i<HIGHDIVIDEND;i++) 
  {
    for (uint32_t j=LOWDIVISOR;j<HIGHDIVISOR;j++)
    {
      uint32_t a= i/j;
      if (bitRead(a,1)==1) c++;
    }
  }
  return c;
}


void setup()
 {
 Serial.begin(9600);
 Serial.println("Start");
 delay(100);  // Wait until serial message is sent
 unsigned long t1=micros();
 unsigned int mulRuss=testRussMul();
 t1=micros()-t1;
 unsigned long t2=micros();
 unsigned int mulArdu=testArduMul();
 t2=micros()-t2;
 
 unsigned long t3=micros();
 unsigned int divRuss=testRussDiv();
 t3=micros()-t3; 

 unsigned long t4=micros();
 unsigned int divArdu=testArduDiv();
 t4=micros()-t4; 
 
 Serial.print("Test Russian Pheasant Multiplication Result: ");
 Serial.print(mulRuss);
 Serial.print("  Duration: ");
 Serial.println(t1/1000000.0,6);

 Serial.print("Test Built-In Multiplication result: ");
 Serial.print(mulArdu);
 Serial.print("  Duration: ");
 Serial.println(t2/1000000.0,6);

 Serial.print("Test Russian Pheasant Division Result: ");
 Serial.print(divRuss);
 Serial.print("  Duration: ");
 Serial.println(t3/1000000.0,6);
 
 Serial.print("Test Built-In Division Result: ");
 Serial.print(divArdu);
 Serial.print("  Duration: ");
 Serial.println(t4/1000000.0,6);
}

void loop(){
}

Congratulations! Your 32-bit division really can be faster than what the AVR LIBC provides for '/' 32-bit divisions!

BUT: While the execution time of the built-in function always seems to be the same, this is not so with your algorithm. Your function needs more time the higher the resulting quotient is. Just change the parameters from the test code above against this:

#define LOWDIVIDEND 123456789
#define HIGHDIVIDEND (LOWDIVIDEND+500)
#define LOWDIVISOR  2
#define HIGHDIVISOR (LOWDIVISOR+500)

After doing so your new algorithm is SLOWER than the built-in function.

So with your code it depends on the division: It can be faster or slower than the built-in '/' function. It heavily depends on dividend and divisor (and the resulting quotient). The speed of the built-in function is always the same.

Please post the output from these tests.

I would like to know not only which is faster, but how much faster.

Also, I would like to know if the compiler optimizes division by a constant. For example, is division by 2 turned into a bit shift? What, if anything, is done with division by 10?

In particular, what is the fastest way to get at the base-10 digits of a number?

I will code it in C when I have time.

I made my tests on an old version of the compiler:

Time in microsec
/ is the compiler division
rdiv8u is the Russian Peasant Division
div8u is the AVR200b division

8 bit division
/ time = 140688 rdiv8u time = 41108 div8u time = 88068

rdiv16u is the Russian Peasant Division
div16u is the AVR200b division

16 bit division
rdiv16u time = 324 / time = 720 div16u time = 996

I did not benchmark the 32 bit division because with my current compiler version, the result of the division of two 32 bit numbers is incorrect, the compiler only uses the first 16 bits of each. Does this happen in the latest version? If so, then that explains the slower results.

I'm working on code even faster than this.

It's unsigned because it was made to be compared with the code from AVR200b. It should be easy to make a signed version, if the code is considered worth it.
If more people test it and find it accurate and fast then I can make a signed version.

That multiplication method isn't anything interesting at all. It's exactly the same as the decimal multiplication you probably learned in the 2nd grade. It's exactly the algorithm you would implement on a computer which doesn't have even a single byte multiplication capability.

I a still trying to figure out what I think about the division.

The Russian peasant multiplication consists of multiplying the multiplier by two, while dividing the multiplicand by two, until the multiplicand is less than or equal one. If the multiplicand is odd then it is added to the multiplier.

There are some demonstrations of this online, which don't do what the preceding statement says, but they do work. Apart from swapping the "mulitplier" and "multiplicand", as far as I can tell, it is exactly the same as normal multiplication.

The Russian peasant division consists of multiplying the quotient by two, while dividing the divisor by two, if the dividend is greater than the divisor then subtract the divisor from the dividend.
The remainder is the value of the dividend on the last subtraction.

This doesn't even make sense, and I can't find any examples or descriptions of this online at all.

joogaa:
Time in microsec
/ is the compiler division
rdiv8u is the Russian Peasant Division
div8u is the AVR200b division

8 bit division
/ time = 140688 rdiv8u time = 41108 div8u time = 88068

rdiv16u is the Russian Peasant Division
div16u is the AVR200b division

16 bit division
rdiv16u time = 324 / time = 720 div16u time = 996

I did not benchmark the 32 bit division because with my current compiler version, the result of the division of two 32 bit numbers is incorrect, the compiler only uses the first 16 bits of each. Does this happen in the latest version? If so, then that explains the slower results.

I've done some extended version of the benchmark posted above.
The rdiv32u() function seems to work fine, I cound not find any difference to the results using '/' with division of 'long' variables. The test whether the results are the same are also included in the benchmark sketch.

This is the Arduino code I used:

// benchmark test for the Russian peasant multiplication
#include <Arduino.h>
#include "rmulfun.h"
#include "rdivfun.h"
volatile int dummy=0;

#define LOWFACTOR1 (1234L+dummy)
#define HIGHFACTOR1 (LOWFACTOR1+500)
#define LOWFACTOR2 (4567L+dummy)
#define HIGHFACTOR2 (LOWFACTOR2+500)

#define LOWDIVIDEND (123456789L+dummy)
#define HIGHDIVIDEND (LOWDIVIDEND+500)
#define LOWDIVISOR  (2345678L+dummy)
//#define LOWDIVISOR  (7L+dummy)
#define HIGHDIVISOR (LOWDIVISOR+500)

void compareMulError(uint32_t loFactor1, uint32_t hiFactor1, uint32_t loFactor2, uint32_t hiFactor2)
{
  long sameresult=0,errors=0;
  for (int i=loFactor1;i<hiFactor1;i++) 
  {
    for (int j=loFactor2;j<hiFactor2;j++)
    {
      uint32_t a = rmul16u(i, j);
      if (a== (long)i*j) sameresult++;
      else errors++;
    }
  }
  Serial.print("Multiplication Same Result: ");Serial.println(sameresult);
  Serial.print("Multiplication Error Count: ");Serial.println(errors);
  if (errors==0) Serial.println("Multiplication Test PASSED OK");
  else Serial.println("Multiplication Test ERROR");
  Serial.println();
}

void compareDivError(uint32_t loDividend, uint32_t hiDividend, uint32_t loDivisor, uint32_t hiDivisor)
{
  long sameresult=0,errors=0;
  for (uint32_t i=loDividend;i<hiDividend;i++) 
  {
    for (uint32_t j=loDivisor;j<hiDivisor;j++)
    {
      uint32_t a,b;
      rdiv32u(&a, &b, i, j);
      if (a== i/j) sameresult++;
      else errors++;
    }
  }
  Serial.print("Division Same Result: ");Serial.println(sameresult);
  Serial.print("Division Error Count: ");Serial.println(errors);
  if (errors==0) Serial.println("Division Test PASSED OK");
  else Serial.println("Division Test ERROR");
  Serial.println();
}

long testRussMul(uint32_t loFactor1, uint32_t hiFactor1, uint32_t loFactor2, uint32_t hiFactor2)
{
  long c=0;
  for (int i=loFactor1;i<hiFactor1;i++) 
  {
    for (int j=loFactor2;j<hiFactor2;j++)
    {
      uint32_t a = rmul16u(i, j);
      c+= bitRead(a,0)+bitRead(a,31);
    }
  }
  return c;
}

long testArduMul(uint32_t loFactor1, uint32_t hiFactor1, uint32_t loFactor2, uint32_t hiFactor2)
{
  long c=0;
  for (int i=loFactor1;i<hiFactor1;i++) 
  {
    for (int j=loFactor2;j<hiFactor2;j++)
    {
      uint32_t a = (long)i * j;
      c+= bitRead(a,0)+bitRead(a,31);
    }
  }
  return c;
}

long testRussDiv(uint32_t loDividend, uint32_t hiDividend, uint32_t loDivisor, uint32_t hiDivisor)
{
  long c=0;
  for (uint32_t i=loDividend;i<hiDividend;i++) 
  {
    for (uint32_t j=loDivisor;j<hiDivisor;j++)
    {
      uint32_t a,b;
      rdiv32u(&a, &b, i, j);
      c+= bitRead(a,0)+bitRead(a,31);
    }
  }
  return c;
}

long testArduDiv(uint32_t loDividend, uint32_t hiDividend, uint32_t loDivisor, uint32_t hiDivisor)
{
  long c=0;
  for (uint32_t i=loDividend;i<hiDividend;i++) 
  {
    for (uint32_t j=loDivisor;j<hiDivisor;j++)
    {
      uint32_t a= i/j;
      c+= bitRead(a,0)+bitRead(a,31);
    }
  }
  return c;
}

void showResult(char* testname, long result, float time)
{
  Serial.print("Test: ");
  Serial.print(testname);
  Serial.print("   Result: ");
  Serial.print(result);
  Serial.print("   Time: ");
  Serial.println(time,6);
}


void setup()
{
  long result;
  unsigned long time;
  Serial.begin(9600);

  Serial.println("\nStart Multiplication Compare Test:");
  compareMulError(LOWFACTOR1, HIGHFACTOR1, LOWFACTOR2, HIGHFACTOR2);
  Serial.println("Start Multiplication Benchmark Test:");
  delay(100);  // Wait until serial message is sent
  
  time=micros();
  result=testRussMul(LOWFACTOR1, HIGHFACTOR1, LOWFACTOR2, HIGHFACTOR2);
  time=micros()-time;
  showResult("Russian MUL",result,time/1000000.0);
  delay(100);  // Wait until serial message is sent
  
  time=micros();
  result=testArduMul(LOWFACTOR1, HIGHFACTOR1, LOWFACTOR2, HIGHFACTOR2);
  time=micros()-time;
  showResult("Arduino MUL",result,time/1000000.0);
  delay(100);  // Wait until serial message is sent
  
  Serial.println("\nStart Division Compare Test:");
  compareDivError(LOWDIVIDEND, HIGHDIVIDEND, LOWDIVISOR, HIGHDIVISOR);
  Serial.println("Start Division Benchmark Test:");
  delay(100);  // Wait until serial message is sent
  
  time=micros();
  result=testRussDiv(LOWDIVIDEND, HIGHDIVIDEND, LOWDIVISOR, HIGHDIVISOR);
  time=micros()-time;
  showResult("Russian DIV",result,time/1000000.0);
  delay(100);  // Wait until serial message is sent

  time=micros();
  result=testArduDiv(LOWDIVIDEND, HIGHDIVIDEND, LOWDIVISOR, HIGHDIVISOR);
  time=micros()-time;
  showResult("Arduino DIV",result,time/1000000.0);
  delay(100);  // Wait until serial message is sent

}

void loop(){
}

The output created from this code is:

Start Multiplication Compare Test:
Multiplication Same Result: 250000
Multiplication Error Count: 0
Multiplication Test PASSED OK

Start Multiplication Benchmark Test:
Test: Russian MUL   Result: 62500   Time: 4.666188
Test: Arduino MUL   Result: 62500   Time: 1.970244

Start Division Compare Test:
Division Same Result: 250000
Division Error Count: 0
Division Test PASSED OK

Start Division Benchmark Test:
Test: Russian DIV   Result: 0   Time: 6.368096
Test: Arduino DIV   Result: 0   Time: 10.360857

So if the result of the divisions is a relatively small number, the rdiv32u() function can be faster than the '/' division.

BUT: If the resulting number is bigger, i.e. chaning the divisor to a small number like:

#define LOWDIVISOR  (7L+dummy)

then the benchmark result changes to:

Start Multiplication Compare Test:
Multiplication Same Result: 250000
Multiplication Error Count: 0
Multiplication Test PASSED OK

Start Multiplication Benchmark Test:
Test: Russian MUL   Result: 62500   Time: 4.666188
Test: Arduino MUL   Result: 62500   Time: 1.970244

Start Division Compare Test:
Division Same Result: 250000
Division Error Count: 0
Division Test PASSED OK

Start Division Benchmark Test:
Test: Russian DIV   Result: 123512   Time: 13.541772
Test: Arduino DIV   Result: 123512   Time: 10.699304

While the time for the Arduino '/' division stays nearly the same, not depending much on how big the resulting number is (10.360857 seconds vs. 10.699304 seconds), the calculation time with the rdiv32u() division will depend very much on the dividend, divisor and result calculated: While the calculation is fast when the resulting number is small (6.368096 seconds), the time will increase while the division result is increasing (13.541772 seconds).

BTW: When developing the benchmark code I had to do some extra coding to make the Arduino compiler really calculate at execution time and prevent the compiler from including the final result for '/' divisions calculated at compile time, which would lead to a timing, that 250000 'long' divisions would just need 4 microseconds.