Go Down

Topic: divmod10() : a fast replacement for /10 and %10 (unsigned) (Read 25763 times) previous topic - next topic

Paul Stoffregen


'all' we need to do is document up and make available in a form every one can use easily.
thats that hard part,


How about the Print.cpp and Print.h Tom posted on reply #51 or I posted on #49 and #54 ?

All you have to do to use these is drop them into your Arduino's hardware/arduino/cores/arduino folder (or hardware/teensy/cores/teensy), replacing the Print.cpp and Print.h that are already there.

On a Mac, you'll need to control-click on Arduino and choose "show package contents", then navigate or search until you find the folder with Print.cpp and Print.h.  I believe it's Contents/Resources/Java/hardware/arduino/cores/arduino, but I don't have a Mac handy at the moment to check.

Admittedly, the copy I posted on #54 had test code enabled.  Here's a "final" version with all the test stuff commented, as ready-to-use as it gets.  These are tested with Teensy but probably will work with official Arduino boards.

aarondc

When Arduino release a new IDE, etc, those libraries you have upgraded are potentially going to get blown away, and require work to reinstall / overwrite.

One potential solution would be if you could set up pathing so modified libraries were linked if they existed in rather than default libraries.
Windows serial port monitor: Tellurium | Arduino serial port debugging library: DBG | Cusom LCD char generator | Technical questions will only be answered in forum threads

Paul Stoffregen

I restarted my exhaustive test, using the asm macro version.  In a few days I should be able to confirm if it matches the C library version for all 2^32 possible inputs....

robtillaart

Code: [Select]
Unless there's more algorithm magic, I believe it's safe to say this 83 cycle version is the fastest possible implementation.
That's why I asked several posts back if one of the earlier C version - shorter in terms of C - could be faster is asm as it might be less statements.

Code: [Select]
In a few days I should be able to confirm if it matches the C library version for all 2^32 possible inputs....
how do you do this test? Generate strings and compare them?
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Paul Stoffregen


how do you do this test? Generate strings and compare them?


Compute the quotient and modulus using both and check if either is different.  Basically, like this:

Code: [Select]

                q1 = n / 10;
                m1 = n - q1 * 10;
                q2 = n;
                divmod10_asm(q2, tmp, m2);
                if (q1 != q2 || m1 != m2) {
                        errorcount++;
                }


Here is the complete code.   It should run on any AVR-based Arduino compatible board.

Paul Stoffregen

Opps, forgot this.  To run it on non-Teensy boards, you'll also have to add this .h file.  Or you could comment out the stuff that prints dots and blinks the LED, but that would not be any fun.

robtillaart


would it not be faster to
Code: [Select]

for(unsigned long(n = 0; n < 0xFFFFFFFF; n++)
{
  divmod10_asm(n, d, m);
  if (n != (10*d + m)) Serial.println(i);
}

as division is quite expensive as we know ;)
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Paul Stoffregen

That would be much faster, aside from a couple minor code problems.  But does only comparing the output to itself really form a valid test?

The slow way will complete in a few days, which really isn't very long in the grand scheme of things.

robtillaart

Quote
But does only comparing the output to itself really form a valid test?

no there is a small chance in theory that there exist a (d, m) pair that would also solve the n = 10*d + m equation   (e.g. n = 100, d = 0,  m = 100)
but given the fact that the code is a white box and we know how the code works this chance is minimal.

To prevent this theoretical issue one could add a test for m to be smaller than 10.

Code: [Select]

for(unsigned long(n = 0; n < 0xFFFFFFFF; n++)
{
  divmod10_asm(n, d, m);
  if (n != (10*d + m) || m >= 10 ) Serial.println(n);
}
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

Tom Carpenter

Even still, using a formula to prove itself is circular reasoning, which is not a proof at all. (http://en.wikipedia.org/wiki/Circular_reasoning)
~Tom~

Paul Stoffregen

The verification only needs to be run once, so it's probably not worth optimizing.  It's already been running over 12 hours and will be done in a couple days.

I believe a much better optimization effort would involve looking into why Arduino's Print takes 15 to actually do something with each digit that's generated in only 5 us.

stimmer

Is there any reason for avoiding using mul in the assembly code? I know not all atmels have mul (I think some attinys don't) but it's there on the Uno and Mega.

Here's a version using mul and fixed point arithmetic:
Code: [Select]

void setup() {
  Serial.begin(115200);
}

void loop() {
 
  uint32_t i=1;
  uint32_t q=0;
  uint8_t r=1;
  while(1){
    uint32_t j=i;
    uint8_t k;
    uint8_t t;
    asm volatile(
   
    // this code fragment works out i/10 and i%10 by calculating i*(51/256)*(256/255)/2 == i*51/510 == i/10
   
    // by "j.k" I mean 32.8 fixed point, j is integer part, k is fractional part
    // j.k = ((j+1.0)*51.0)/256.0
    // (we add 1 because we will be using the floor of the result later)
   
      " ldi %2,51     \n\t"
      " mul %A0,%2    \n\t"
      " clr %A0       \n\t"
      " add r0,%2     \n\t"     
      " adc %A0,r1    \n\t"
      " mov %1,r0     \n\t"
      " mul %B0,%2    \n\t"
      " clr %B0       \n\t"
      " add %A0,r0    \n\t"
      " adc %B0,r1    \n\t"
      " mul %C0,%2    \n\t"
      " clr %C0       \n\t"
      " add %B0,r0    \n\t"
      " adc %C0,r1    \n\t"
      " mul %D0,%2    \n\t"
      " clr %D0       \n\t"
      " add %C0,r0    \n\t"
      " adc %D0,r1    \n\t"
      " clr r1        \n\t" 
     
     
    // j.k = j.k*256.0/255.0
    // (actually the result will always be slightly low, which we need because we use the floor of the result)
     
      " add %1,%A0    \n\t"
      " adc %A0,%B0   \n\t"
      " adc %B0,%C0   \n\t"
      " adc %C0,%D0   \n\t"
      " adc %D0,r1    \n\t"
      " add %1,%B0    \n\t"
      " adc %A0,%C0   \n\t"
      " adc %B0,%D0   \n\t"
      " adc %C0,r1    \n\t"
      " adc %D0,r1    \n\t"
      " add %1,%D0    \n\t"
      " adc %A0,r1    \n\t"
      " adc %B0,r1    \n\t"
      " adc %C0,r1    \n\t"
      " adc %D0,r1    \n\t"     
     
   
    // j.k = j.k / 2.0   
   
      " lsr %D0       \n\t"
      " ror %C0       \n\t"
      " ror %B0       \n\t"
      " ror %A0       \n\t"
      " ror %1        \n\t"   
   
   
   // now i*51.0/510.0 < j.k < (i+1.0)*51.0/510.0
   // therefore j == i/10 (since j == floor(j.k))
   // and (k*10)>>8 == i % 10
     
      " ldi %2,10     \n\t"
      " mul %1,%2     \n\t"
      " mov %1,r1     \n\t"
     
      " clr r1        \n\t"


     
      :"+r"(j),"=d"(k),"=d"(t)
      :
      :  );
     
      if((j!=q) || (k!=r)){
        Serial.print(i);Serial.print(" ");
        Serial.print(j);Serial.print(" ");
        Serial.print(k);Serial.println(" ");
      }
      if((i & 16777215)==0){Serial.print(i);Serial.print(" ");Serial.println(millis());
        if(i==0)return;}
      i++;
      r++;if(r==10){r=0;q++;}
     
  }
}


I've tested it on various values but the full test will take 6 hours to run.

My way of testing for correctness is to keep track of the expected quotient and remainder, incrementing the remainder each time the divisor is incremented and when the remainder gets to 10, increase the quotient:

Code: [Select]


  uint32_t i=1;
  uint32_t q=0;
  uint8_t r=1;
  while(1){

[[ work out div/mod 10 of i here and check against q and r]]

      i++;
      r++;if(r==10){r=0;q++;}
     
  }



This is fast and simple, and I hope it's agreed that it is obviously a correct way of checking.
Due VGA library - http://arduino.cc/forum/index.php/topic,150517.0.html

robtillaart

#72
Jun 26, 2013, 10:23 pm Last Edit: Jun 26, 2013, 10:26 pm by robtillaart Reason: 1
@stimmer
Can you wrap your code in a similar function as divmod10() and run a performance test for comparison?

please use Toms code as starter - http://forum.arduino.cc/index.php?topic=167414.msg1288779#msg1288779 -
Rob Tillaart

Nederlandse sectie - http://arduino.cc/forum/index.php/board,77.0.html -
(Please do not PM for private consultancy)

stimmer

As a normal function it would be dominated by overhead and marshalling. As an inline function or macro it should take 3us. The calculation is 43 instructions long, 5 of which are mul, so it takes 48 cycles in total. In a quick real-world test the result I got was 3.1245us as an inline function.

The full 32-bit range test has now completed without error :)
Due VGA library - http://arduino.cc/forum/index.php/topic,150517.0.html

Tom Carpenter

Lol. This is starting to get insane.

Credit where credit is due, thats an amazing optimisation stimmer  :D.
~Tom~

Go Up