Arduino generating prime numbers

tmd3:
It ran significantly faster with Numbler4 on the Uno, with execution time less than 40% of the BigNumber version. It ran a little slower than BigNumber on the ARMs. On the Due, Numbler4 ran about 5-8% slower. There was less difference with the Stellaris, but it was still bit slower. I can qualitatively understand how eliminating a lot of modulo operations can speed things up on the Uno; my understanding of the internals of the ARMs is too thin to help me explain the results with the Due and Stellaris. I'm fool enough to speculate, though: TI says that the Stellaris' Cortex-M3 has hardware divide, and Atmel says the same thing about the SAM3X. My guess is that division on those processors is fast enough to be competitive with the alternative.\

All right, it looks like we nailed it.

If you want even more speed out of BigNumber, get rid of the Knuth trick in the division routine and instead use my method (in Numbler4) for getting at the next digit of the quotient. I don't know if that will speed up the division much, but it will certainly speed up the task of getting at the remainder. This is because if you do things my way, you are using long division, and long division gives you the remainder directly. Knuth division doesn't, hence the need for an extra multiplication step in the modulo routine in BigNumber. To clarify:
Dividing BigNumber x by BigNumber y gives you (x/y), so if it's modulo you're after, you end up computing x-((x/y)*y).
With Numbler4, computing x/y automatically gives you x%y as a by-product.

If you want more speed than that, try replacing the long division with what is called "scratch" or "galley" division. I would have done it myself, but I'm no good with pointers.

One more tweak. This one doesn't use BigNumber, and it runs nearly twice as fast.

We used BigNumber to get around the fact that intermediate results can overflow an unsigned long long in the Miller-Rabin test. That test requires calculating ab % n for large numbers. Using numbers that are less than 1016 - 54 bits - the results always fit into a long long, but the intermediate products that generate the exponentials can be bigger than 264. BigNumber gets around the restriction, but at some cost in speed and storage.

This code uses function "mulmodll()," that produces the product of two unsigned long long's, modulo a third unsigned long long, without overflowing on intermediate products. It generates the product internally as two unsigned long long's, and calculates the modulo by repeated shifts and mods. It has its own overhead - it has to take % as many as seven times. But, it's working in native number formats, rather than in BCD, and it goes faster.

This code depends heavily on the fact that numbers bigger than 1016 aren't investigated. That fact simplifies the 128-bit multiplication, in that certain intermediate results can't overflow, and it simplifies the mod operations, too. The code isn't general. It's assumptions are that the lower half of any argument times the upper half of any argument will be less than 263, and that the modulus will always be less than 256. With 1016 = 253.15, the problem fits, with a bit or two to spare.

Results: Running on the Uno, 100000 primes starting at 961,748,929, and comparing to the fiftieth million from this list - The first fifty million primes - found an exact match, with calculation time averaging about 170 ms per prime. Using the tests of reply #80, calculation time for 16-digit numbers added up to 134,578 ms, for an improvement of a factor of a little better than 5 over using BigNumber, and about 1.9 times as fast as using Numbler4. For 12-digit numbers, calculation time was 52,472 ms, for a similar speed improvement. For numbers near the top of the range, starting at 9,999,999,999,990,001, the average calculation time for 100 primes was around 1.5 seconds - about three times what Nick was looking for.

The Due was ferocious. It averaged 10.6 ms for 1,000 12-digit primes starting at 100000000001; 25.8 ms for 16-digit starting at 1000000000000001, and 33.4 starting at 9900000000036791. My Due ran these tests, and then stopped working - I'm hoping that it's just bored.

There's room for improvement here. I think there's a sweet spot for the number of trial divisions, I think it depends to some extent on the size of the numbers under test, and I don't think I've found it here. There may be a way to reduce the number of modulo calls in function powmodll(), but I can't think of any yet.

Here's the code. The included file is attached at reply #70:

PROGMEM prog_uchar primes[] = {
    3,  5,  7, 11, 13, 17, 19, 23, 29, 31, 
   37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 
   83, 89, 97,101,103,107,109,113,127,131,137,
  139,149,151,157,163,167,173,179,181,191,193,
  197,199,211,223,227,229,233,239,241,251
};
const byte nPrimes = sizeof(primes)/sizeof(primes[0]);

#include "Pseudoprimes.h";
const uint16_t nPSP = (sizeof(psps)/sizeof(psps[0])) >> 3;
  
uint64_t candidate;

unsigned long start;
long found = 0; // Number of primes found

void setup() 
{
  Serial.begin(115200); 
  Serial.println ("Starting.");

  candidate = 961748927ULL;
}  // end setup()

void loop() 
{
  uint64_t nextPSP;
  int nextPSPindex;
  for (int i=0;i<nPSP;i++) {
    nextPSP = fetchPSP(i);
    if ((nextPSP == candidate) || (nextPSP > candidate)) {
      nextPSPindex = i;
      break;
    }
    nextPSP = 0;
    nextPSPindex = 0;
  }
  start = millis();
  
  for ( ; candidate < 9999999999999999ULL; candidate += 2) {
    if (nextPSP != candidate) {
      if (isPrime(candidate)) {
        found++;
        showPrime (candidate);
      }
    }
    else {
      nextPSPindex++;
      if (nextPSPindex < nPSP) {
        nextPSP = fetchPSP(nextPSPindex);
      }
    }
  }
  found = 1;
  showPrime(1);
  found++;
  showPrime(2);
  candidate = 3;  // back to start
}  // end loop()

bool Miller(uint64_t source)
{
  if(source <= 3) {
    return true;
  }
  if((source & 1) == 0) {
    return false;
  }

  uint64_t d = source - 1;
  int s = 0;

  while((d & 1) == 0)
  {
    d >>= 1;
    s ++;
  }

  uint8_t a;
  const byte bases[] = {2, 3, 5, 7,};// 11, 13, 17, 19, 23 };
  const byte nbases = sizeof(bases)/sizeof(bases[0]);
  for(int i = 0; i < nbases; i++)
  {
    a = bases[i];
    uint64_t x = powmodll(a, d, source);
   if((x == 1) || (x == (source - 1))) {
      continue;
    }

    bool stillOK = false;
    for(int r = 1; r < s; r++)
    {
      x = mulmodll(x, x, source);
      if(x == 1)
        return false;
      if(x == (source - 1)) {
        stillOK = true;
        break;
      }
    }

    if(!stillOK)
      return false;
  }

  return true;
}  // end Miller()

bool isPrime(uint64_t n) {
  if (!check(n)) {
    return false;
  }
  if (n > 0x10000) {
    if (!Miller (n)) {
      return false;
    }
  }
  return true;
}

void showPrime (uint64_t n) 
{
  printll(n);
  Serial.print(" is prime. Took ");
  Serial.print(millis () - start);
  Serial.print (" mS. Found ");
  Serial.print (found);
  Serial.println (" primes.");
  start = millis ();

}  // end showIfPrime()

bool check(uint64_t n) {
  if (n <= 3) {
    return true;
  }
  if (!(n & 1)) {
    return false;
  }
  uint8_t i = 1;
  uint8_t f = pgm_read_byte_near(primes);
  while ((i < nPrimes) && (f < n)) {
    if ((n % f) == 0) {
      return false;
    }
    f = pgm_read_byte_near(primes+i);
    i++;
  }
  return true;
} // end check()

uint64_t fetchPSP(int i) {
  i <<= 3;
  uint64_t n = 0;
  for (int8_t k=0;k<sizeof(n);k++) {
    n = n <<= 8;
    n |= pgm_read_byte_near(psps+i+k);
  }
  return n;
} // end fetchPSP()

void printll(uint64_t n) {
  // print an unsigned long long
  if (n == 0) {
    Serial.print('0');
    return;
  }
  const byte DIGITS = 21;
  char s[DIGITS+1];
  for (byte i=0;i<DIGITS;i++) {
    s[i] = 'X';
  }
  s[DIGITS] = 0;
  byte i = DIGITS-1;
  while (n > 0) {
    s[i] = (n % 10) | '0';
    n = n / 10;
    i--;
  }
  i++;
  char *z = &s[i];
  Serial.print(z);
} // end printll()

uint64_t powmodll(uint8_t a, uint64_t b, uint64_t c) {
  // Calculate a**b % c,
  uint64_t x = 1;
  uint64_t y = a;

  while(b > 0) {
    if(b & 1) {
       x = mulmodll(x, y, c);
      }
      y = mulmodll(y, y, c);
      b >>= 1;
    }
    return x;
} // end powmodll()

uint64_t mulmodll(uint64_t a, uint64_t b, uint64_t c) {
  // Calculate a*b %c
  uint64_t imh; // Intermediate results for multiplication
  uint64_t iml;
  uint64_t imm;

  // Multiply a and b
  // Routine is not general:
  //   relies on fact that MSB of inputs is zero
  // Product stored as two unsigned long long's
  imm = (a & 0xffffffffUL) * (b >> 32);
  iml = (a >> 32) * (b & 0xffffffffUL);
  imm += iml; // overflow for general inputs
  iml = (a & 0xffffffffUL) * (b & 0xffffffffUL);
  imm += iml >> 32;
  iml = (iml & 0xffffffffUL) + (imm << 32);
  imh = (a >> 32) * (b >> 32);
  imh += imm >> 32;

  if (imh == 0) {
    return (iml % c);
  }
  for (uint8_t i=0;i<8;i++) {
    imh = imh << 8;
    imh |= iml >> 56;
    iml = iml << 8;
    if ((imh >> 56) != 0) {
      if (imh > c) {
        imh %= c;
      }
      else {
        if (imh == c) {
          imh = 0;
        }
      }
    }
  }
  if (imh > c) {
    imh %= c;
  }
  return imh;
} // end mulmodll()

Here's Uno output for comparison to reply #72, with clock speed the standard 16 MHz:

1000000000054141 is prime. Took 1433 mS. Found 1 primes.
1000000000054147 is prime. Took 778 mS. Found 2 primes.
1000000000054259 is prime. Took 1594 mS. Found 3 primes.

deep bow :slight_smile:
well done!

update: 7 modulo's ....

wrt mulmodll()

Isn't there a theorem that says: x = (a*b) % c <==> x = ((a%c) * (b%c) ) %c ?
That might speed it up ..

Thanks.

Wikipedia says so - Modulo - Wikipedia. But, in this program, whenever (a*b)%c is calculated, a is either a small prime, or it's %c; and b is either equal to a, or it's %c. So, already a%c=a, and b%c=b whenever big numbers are multiplied and modded. I can't see how to gain ground using that equivalency.

Can you elaborate?

you're right,
if (a%c) == a and (b%c) == b
there is no gain with this formula.


my idea was to rewrite

uint64_t mulmodll(uint64_t a, uint64_t b, uint64_t c) {
  // Calculate a*b %c
  uint64_t imh; // Intermediate results for multiplication
  uint64_t iml;
  uint64_t imm;

  // Multiply a and b
  // Routine is not general:
  //   relies on fact that MSB of inputs is zero
  // Product stored as two unsigned long long's
  imm = (a & 0xffffffffUL) * (b >> 32);
  iml = (a >> 32) * (b & 0xffffffffUL);
  imm += iml; // overflow for general inputs
  iml = (a & 0xffffffffUL) * (b & 0xffffffffUL);
  imm += iml >> 32;
  iml = (iml & 0xffffffffUL) + (imm << 32);
  imh = (a >> 32) * (b >> 32);
  imh += imm >> 32;

  if (imh == 0) {
    return (iml % c);
  }
  for (uint8_t i=0;i<8;i++) {  // <<<<<<<<< this loop can cause multiple % math
    imh = imh << 8;
    imh |= iml >> 56;
    iml = iml << 8;
    if ((imh >> 56) != 0) {
      if (imh > c) {
        imh %= c;  // <<<<<<<<<<<<<<<<<<<<<<<<<< here
      }
      else {
        if (imh == c) {
          imh = 0;
        }
      }
    }
  }
  if (imh > c) {
    imh %= c;
  }
  return imh;
} // end mulmodll()

to

uint64_t mulmodll(uint64_t a, uint64_t b, uint64_t c) 
{
  return ((a%c)*(b%c))%c;
}

I think I managed to clean up your print routine a little.
WARNING: untested code!

void printll(uint64_t n) {
  // print an unsigned long long
  char figs[21];
  unsigned long exa = 0, giga = 0, ones = 0;
  ones = n;
  if (n >= 1000000000) {
    n /= 1000000000;
    giga = n;
    ones -= 1000000000 * giga;
    if (n >= 1000000000) {
      exa = n / 1000000000;
      giga -= 1000000000 * exa;
    }
  }
  if (exa) sprintf(figs, "%d%09d%09d", exa, giga, ones);
  else if (giga) sprintf(figs, "%d%09d", giga, ones);
  else sprintf(figs, "%d", ones);
  Serial.print(figs);
}

tmd3:
We used BigNumber to get around the fact that intermediate results can overflow an unsigned long long in the Miller-Rabin test. That test requires calculating ab % n for large numbers. Using numbers that are less than 1016 - 54 bits - the results always fit into a long long, but the intermediate products that generate the exponentials can be bigger than 264. BigNumber gets around the restriction, but at some cost in speed and storage.

This code uses function "mulmodll()," that produces the product of two unsigned long long's, modulo a third unsigned long long, without overflowing on intermediate products. It generates the product internally as two unsigned long long's, and calculates the modulo by repeated shifts and mods. It has its own overhead - it has to take % as many as seven times. But, it's working in native number formats, rather than in BCD, and it goes faster.

This code depends heavily on the fact that numbers bigger than 1016 aren't investigated. That fact simplifies the 128-bit multiplication, in that certain intermediate results can't overflow, and it simplifies the mod operations, too. The code isn't general. It's assumptions are that the lower half of any argument times the upper half of any argument will be less than 263, and that the modulus will always be less than 256. With 1016 = 253.15, the problem fits, with a bit or two to spare.

If you are going to take many numbers modulo the same, uh, modulus, and within the limitations you describe, then we can optimize thus:
Let's call our modulus "m". We operate under the assumption that m < MAX/512, where MAX is the biggest number our datatype can hold. If m is bigger than MAX/512, our intermediate results are likely to overflow, and we can't have that.

In the situation described in the quote above, it appears we are safe, as MAX = (264)-1 > 1019, and m is at most 10**16. Let's continue with this.

As in this case our MAX is (264)-1, we make a table of 264 mod m, 272 mod m, 280 mod m, etc. To prevent overflows, we will also want the value of 2**63 mod m.

To find the value of ((264)*hi + lo) mod m, with hi < 264 and lo < 2**64, we use this algorithm:

split hi into bytes hib[0] (the lowest byte), through hib[7] (the highest byte)
let r = lo mod m
let r = r + (hib[0] * (2**64 mod m))
if (r >= 2**63) {
  let r = r - 2**63 (a bit mask will work here)
  let r = r + (2**63 mod m)
}
let r = r + (hib[1] * (2**72 mod m))
if (r >= 2**63) {
  let r = r - 2**63 (a bit mask will work here)
  let r = r + (2**63 mod m)
}
let r = r + (hib[2] * (2**80 mod m))
if (r >= 2**63) {
  let r = r - 2**63 (a bit mask will work here)
  let r = r + (2**63 mod m)
}
... until we've gone through all the bytes of hi
... and finally,
let result = r mod m

This kills most of the modulo operations.
Of course, it's untested, but I don't see why it wouldn't work.

Indeed, the Miller-Rabin test calculates a whole lot of (x*y) % n's, where n is the number under test. There's benefit to be had from doing something first with n if it will simplify the mod operations.

Here's "Barrett reduction," and it does just that, though without making a table: Barrett reduction - Wikipedia. It lets us calculate an integer function of the number under test, about the same size, and use it to get modulo results by multiplication, shifting, and subtraction. It adds complexity, but reduces the effort of multiple modulo calculations with the same modulus. The problem is that for this project, the multiplication yields a 3-uint64_t product. Routines for multiple uint64_t's are inefficient compared to calculating with native data types, as they can't see the carry bit. Barrett provides a speed improvement, but not the whopping acceleration that I hoped for. It takes about 60% as long to calculate a 16-digit prime, average, as the version from reply #81. It beats one second per prime, average, on the Uno.

Barrett numbers k and m are added the cheesy way - as global variables, just before the Miller-Rabin tests. There are a bunch of tweaks that would improve the speed a little bit, but the ones that I can see are in routines that run once per candidate, and they're unlikely to make a big difference. The code spends two-thirds of its time in function multll(), which multiplies two uint64_t's; I can't see way to improve that routine much. There may be ways that I can't see, or ways to eliminate calls to it.

Barrett reduction was something of a chore to implement. I suspect that it would be a lot easier with BigNumber, but it doing it would require us to get into the raisemod() function in number.c. I'm not sure I want to try it, and I suspect that it wouldn't beat this speed, anyway.

The code is so long that I can't fit it in with this post. It's attached, [Edit: attached file, changed this] along with the included file.

Here's output for the representative primes that have become the staple of this thread, from the Uno:

1000000000054141 is prime. Took 849 mS. Found 4 primes. 
1000000000054147 is prime. Took 432 mS. Found 5 primes. 
1000000000054259 is prime. Took 995 mS. Found 6 primes.

The Due, of course, ran very fast. Times were about 10 ms for integers in the high end of the 16-digit range. Accuracy looks great: five million primes, from the Due, exactly matched the 46th through 50th million primes from this page: The first fifty million primes.

[Edit: Fixed error in attached file.]

PrimesTrialPseudoBarrett.ino (7.55 KB)

Pseudoprimes.h (29.6 KB)

From multll():

*imh += 11ULL << 32;

Eleven??

Eleven bad. That should be a one. If that statement ever executed, it should have caused an error; nevertheless, a test with five million primes ran without a hitch last night.

Here's what I did to test it:

  • Added a "print something and stop" inside those brackets. The Due went to 120K primes found, and that didn't execute.
  • Generated a whole bunch of uint64_t random pairs, multiplied them, and checked about 40K results in Perl with Math::GMPz. Again, the print and stop didn't execute. All the products were correct.
  • Ran 100K random pairs, this time with the most significant nybble set to 0xF. Again, no print and stop, and products checked out.
  • Rewrote the routine to operate on bytes, work on nybbles, and generate integer products; ran it against all 65K combinations of inputs, and checked. Again, no print and stop, and all the products checked.
  • Multiplied 0xFFFFFFFFFFFFFFFF by itself. Nothing went wrong.

Apparently, that statement never executes - there's never a carry after the second addition to the intermediate product. That product captures the products of the high half of one input multiplied by the low half of the other. That happens twice, and those products are added, and that may generate a carry - that's the first test. Then the high half of the product of the low halves of the inputs is added to the result. I'd expect that addition to sometimes generate a carry, but it appears that it never does. I can see why it won't when the inputs are restricted to 54 bits, but I can't see how it never gets a carry with arbitrary inputs. If I can figure out why that test is unnecessary, I can remove it.

[Edit: Changed "11ULL" to "1ULL" in file attached to reply #87; fixed spelling.]

odometer:

[quote author=Nick Gammon link=topic=192209.msg1465943#msg1465943 date=1384298509]
No I didn't, the library basically uses the standard number.c file which I found somewhere.

I thought you said you modified that library for Arduino.
[/quote]

I took the standard number.c file and added a class wrapper around it so you didn't have to muck around with malloc and free all over the place. So the low-level computations are the same.

#include "Pseudoprimes.h";

Where is this file?

The text at reply #87, where the latest version of the sketch is attached, says that the included file - Pseudoprimes.h - still slumbers at reply #70. But it didn't mean to say that; it meant to say it was attached to some other reply, since it's not at #70, and it was at reply #69 last time I saw it.

Fixed #87; attached a copy of sketch and included file here.

PrimesTrialPseudoBarrett.ino (7.55 KB)

Pseudoprimes.h (29.6 KB)

Ah yes, that's a lot faster. :slight_smile:

Estimated Nick Gammon,

I LOVE prime numbers. In order to make the arduino works faster, I would like to program a prime number generating routine (just for fun purpose) using someting that make the arduino run faster.

Is there a way to do it in C? I mean a simple way, to program the arduino in C language to make it run faster calculus programs?
After many days of research, I did not found a tutorial clear enough. I even turning myself to PIC ,as there is lots of tutorial to program those on C with a 9$ programmer.

However, I think that with a FTDI breakout, a direct C programming should be easy.

Thank for helping me.
PS: I just love this project qnd the porting for big numbers!!!

One word: Mersenne!

How would you be able to generate prime numbers with the push of a button using an impulse counter?

Ok. So I am late to the party I have cobbled this together but have all 8 LEDs fully lit including the decimal points. It could be that I have a wrong connection

I have connected

VCC to 5V
GRD to GRD
DIN to pin 12
CS to pin 10 and
CLK to pin 13

If that is all good can anyone suggest where I have strayed?

Ok.

I think I have figured it out

I had

VCC to 5V
GRD to GRD
DIN to pin 12
CS to pin 10 and
CLK to pin 13

I have changed it to

VCC to 5V
GRD to GRD
DIN to pin 11
CS to pin 10 and
CLK to pin 13

and now it is counting primes.