Arduino generating prime numbers

If you want to modify my code further up you are welcome to demonstrate this.

I should point out that this is using long long type (64 bits) which probably makes the processor work fairly hard. I don't have memory for a lot of inverses, and indeed the problem with a prime number like 100000000063 is that you don't get any quick matches, so it has to do a lot of divisions (obviously excluding even numbers) to make sure it is prime.

?100000000063 / 4 = 79057

So I suppose it is OK that it takes 23 seconds to do 79057 long long divisions. But I am certainly open to faster methods, bearing in mind such methods only have 2 kB of RAM to work in.

First of all, I hope your square root is working properly.
If you are going to convert to a float before taking the square root, you lose precision, and possibly your square root will come out too low-- you know what that means.
In which case, you might want to get at your limit using something like:

limit = (long long)sqrt((float)n)
limit = max(limit, n/limit)

What I meant by avoiding division was this:

Begin by assuming that the division does come out even, and go ahead and divide.

For example, let's say you are testing 21005 for divisibility by 5. You and I can at once see it divides, but remember, the Arduino is testing 0x520D for divisibility by 0x05.

We go ahead and divide.

Hex number ends in: 0 1 2 3 4 5 6 7 8 9 A B C D E F
 Quintuple ends in: 0 5 A F 4 9 E 3 8 D 2 7 C 1 6 B
 
Which we can rearrange as:

 Quintuple ends in: 0 1 2 3 4 5 6 7 8 9 A B C D E F
Hex number ends in: 0 D A 7 4 1 E B 8 5 2 F C 9 6 3

Note that this, as well, is an arithmetic progression modulo sixteen.
Multiply the "quintuple" by 0xD to get the "original" number.
So we don't need the whole table, just the entry for 1.

0x520D ends in D. 0xD times 0xD gives us 0xA9. So our quotient ends in 9.
0x520D - (0x9 * 0x5) = 0x51E0, and we slough off the final 0.
0x51E ends in E. 0xE times 0xD gives us 0xB6. So our next quotient digit is 6.
0x51E - (0x6 * 0x5) = 0x500, and we slough off the final 0.
0x50 ends in 0. 0x0 times 0xD gives us 0x0. So our next quotient digit is 0.
0x50 - (0x0 * 0x5) = 0x50, and we slough off the final 0.
0x5 ends in 5. 0x5 times 0xD gives us 0x41. So our next quotient digit is 1.
0x5 - (0x1 * 0x5) = 0x0, so we're done.
The quotient is 0x1069.
Check by multiplication: 0x1069 * 0x5 = 0x520D

Because our dividend has only four digits, we know we must terminate within four digits. If the algorithm had not terminated soon enough, we would know that we were getting a garbage result, and that the division did not come out even after all.

But I am certainly open to faster methods, bearing in mind such methods only have 2 kB of RAM to work in.

you can check the 30K + i method as I referred to in #2 it should be a bit faster as it only tests 8 numbers of every 30 (is approx 1 in 4) where the 6k+i test 2 numbers in 6 (1 in 3) ~~> gain 10%?

Sure, but the version using longs does one every 1/2 seconds which I am fine with, and for the long long version saving 10% over 27 seconds isn't that impressive.

I was hoping for one of those "probability" primality tests to work, but my brain isn't quite up to the maths right now.

A substantial gain could be made if you could get from long long (64) to long (32)
Not tested, but just an idea

first one tests 6k-1 if (n % (k-1) == 0) return
then one test 6k+1 if (n % (k+1) ==0) retrun

One might do the following first:
let p = n /(k+1);
then n%(k+1) <==> n - p*(k+1);

This takes an equal amount of time, but with p we have an estimation for q = n/(k-1): q >= p

With this estimation we can simplify the #digits
then n%(k-1) <==> (n-p*(k-1)) % (k-1)

the multiplication p*(k-1) is relative fast and if (n- p*(k-1)) would fit in a long the remaining % would be 32 bit iso 64 bit.

Note a long % is 6 times faster than a long long %

Do you even need to use the divide or modulo operators?
If the / and % operators were forbidden to you, how then would you go about this?

a well known method without division is the sieve

  1. make a bool array on 2..n, set them all to true

  2. take next element from array (first elm == 2) and print it

  3. remove all multiples of this element from the array

  4. goto step 2

array = 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ...
take next element (2)
remove all multiples
array = 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 ...
take next element (3)
remove all multiples
array = 5 7 11 13 15 17 19 23 25 29 31 35 ...
take next element (5)
remove all multiples
array = 7 11 13 15 17 19 23 29 31 ...... Note this partial list are all prime

the sieve is very slow in the begin and uses a lot of memory ( at least 1 bit per number). So an UNO with 2000 bytes RAM can hold max 16000 numbers
(we can skip the even ones) so 0.. 32000

I didn't mean "don't divide" or "don't calculate remainders".
What I meant was, don't use the divide or modulo operators provided as part of the programming language.
Again, see: old
Also see: Exact Division (GNU MP 6.2.1)

Again, see: http://www.steike.com/code/bits/division/

This trick is well known,
If I understand it correctly one needs to know the inverse of the divider in advance. As the prime generator goes into the millions the compiler cannot generate all this numbers in advance. The program would not fit the memory. Calculating them runtime is almost as expensive as the division in itself (I assume).

Also see: Exact Division (GNU MP 6.3.0)

assume one test: x % n == 0;
One could implement a quick check to see if it is a possible divider, just by checking if the last byte of x is a possible multiple of n. For this one a lookup table with the possible last bytes of multiples of n. 128 tables needed (odd numbers only) of 256 bits (=32bytes) ==> 2^7 * 2^5 = 4096 bytes. Too much for UNO, but doable for a MEGA,

pseudo-code
p = lastbyte of X;
if (array[n/2][p] == false) ==> division not possible

fascinating...

update: a test showed that all outcomes occur when looking at the last n bits (iso last digit in decimal system). This is due to the fact that odd numbers (which are checked as dividers) are relative prime to powers of 2 (basic number theory - long ago) which is the anount of possible values for the last n bits.
see simple example for 2 bits below

01 * 00 = 00 
01 * 01 = 01
01 *  10 = 10
01 * 11 = 11
11 * 00 = 00
11 * 01 = 11
11 * 10 = 10
11 * 11 = 01

All outcomes are in the last column, so it cannot be used as a selection.

In short this trick does not work for binary or hex representations, (Or convince me that I'm wrong :wink:

robtillaart:
a well known method without division is the sieve

  1. make a bool array on 2..n, set them all to true

I trust you are jesting here. I have 2 kB of RAM, and I am trying to calculate a 16 decimal digit prime number. Even representing each number as a bit, they are not going to fit.

It was not a solution for you , it is an prime find algorithm that doesn't use division.

However you could use RAM to hold as much primes as possible to test less numbers. An array of 900 primes would take 1800bytes RAM -
assume this fits your sketch you could do

bool isPrime(uint64_t number)
{
   for (int i=0; i<900; i++)
  {
    if (number % primeArray[i] == 0) return false;
  }
  for (uint32_t  cnum = 7001 ; cnum < 0xFFFFFFF0; cnum += 6) 
  {
    if (number % (cnum-1) == 0) return false;
     if (number % (cnum+1) == 0) return false;
   }
return true;
}

The 900th prime is 6997. // http://primes.utm.edu/lists/small/1000.txt
to test until 6997 with the 6+-1 method would 2332 steps, now it take 900. => factor 2.5 faster for that part.

Another optimization is coming soon (have to clean the code first)

The % operator is one of the slowest operators, especially for uint64_t.
Here a timing table of different (unsigned) types for 1000 % operations in micros (UNO).

u64 % u64:	414200
u64 % u32:	409600
u64 % u16:	409412
u64 % u8 :	416396
u32 % u32 :	42276
u32 % u16 :	41964
u32 % u8 :	43468
u16 % u16 :	16492
u16 % u8 :	16564
u8 % u8 :	16552

from the table we see

  1. the biggest type is what counts.
  2. 64 bit % is 10x slower than 32 bit
  3. 32 bit % is 2.5x slower than 16 bit
  4. 16 and 8 bit is equally fast.

For the prime generator / testing it therefor helps to keep the code 32 bit as far as possible and only go to 64 bit after 32 bits is exhausted.

This raised the question if it was possible to improve the 64bit % math?
I found a trick that will speed up the % operation if the denominator is in the uint16_t range (0..65535) with a factor 3++

It works as follows (explained with a decimal example)

43211234 % 13 The long number 43211234 is split in 3 overlapping parts which are % separately.
=> 4321 % 13 = 05 replace 4321 with 05
=> 0512 % 13 = 05 replace 0512 with 05
=> 0534 %13 = 1

if we do this trick on a uint64_t we can split it in three overlapping uint32_t parts.
To ensure results fit in the max denominator is an uint16_t

IN theory the timing changes from 414 uSec to 3x 42 usec (126) which is a factor 3.3

Proof of concept code (does also contain the modulo timing above)

//    FILE: faster_mod_uint64_t.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: proof of concept
//    DATE: 2013-10-30
//     URL:
//
// Released to the public domain
//

void setup() 
{
  Serial.begin(115200);
  Serial.println("Start ");

  volatile uint64_t num = 12345676123123123LL;
  volatile uint64_t cnum = 6666;
  volatile uint64_t x;

  x = num % (cnum - 1); 
  Serial.println(x);

  uint32_t start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num % (cnum - 1); 
  }
  uint32_t stop = micros();
  Serial.print("u64 % u64:\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num % ((uint32_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u64 % u32:\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num % ((uint16_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u64 % u16:\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num % ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u64 % u8 :\t");
  Serial.println(stop - start);

  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint32_t) num) % ((uint32_t)(cnum-1));
  }
  stop = micros();
  Serial.print("\nu32 % u32 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint32_t) num) % ((uint16_t)(cnum-1));
  }
  stop = micros();
  Serial.print("u32 % u16 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint32_t)num) % ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u32 % u8 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint16_t)num) % ((uint16_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("\nu16 % u16 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint16_t)num) % ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u16 % u8 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint8_t)num) % ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("\nu8 % u8 :\t");
  Serial.println(stop - start);

  // THE TRICK NEEDS AN UNION
  union 
  {
    uint64_t ll;
    struct
    {
      uint32_t lb;
      uint32_t lh;
    } 
    a;
    struct
    {
      uint16_t ia;
      uint32_t lm;
      uint16_t ib;
    } 
    b;
  } 
  tt;

  start = micros();
  for (int i=0; i<1000; i++)
  {
    if (cnum < 0xFFFF)
    {
      tt.ll = num;
      uint16_t t = cnum - 1;
      if (tt.a.lh >= t) tt.a.lh %= t;
      if (tt.b.lm >= t) tt.b.lm %= t;
      if (tt.a.lb >= t) x = tt.a.lb % t;
      else x = tt.a.lb;
    }
    else
    {
      x =  num % (cnum -1);
    }    
  }
  stop = micros();
  Serial.print("\nsmart:\t");
  Serial.println(stop - start);

  tt.ll = num;
  uint16_t t = cnum - 1;
  if (tt.a.lh >= t) tt.a.lh %= t;
  if (tt.b.lm >= t) tt.b.lm %= t;
  if (tt.a.lb >= t) x = tt.a.lb % t;
  else x = tt.a.lb;
  Serial.println(x);
}

void loop() 
{
}

The example sketch runs the modulo calculation in 123 usec which matches the theory quite well.

This way at least all numbers from 0..65535 can be tested faster.

Please give it a try

(not investigated if extended to byte is faster)

update: a similar trick speeds up division: u64 / u16 from 408 millis to 91 millis

//
//    FILE: faster_div_uint64_t.ino
//  AUTHOR: Rob Tillaart
// VERSION: 0.1.00
// PURPOSE: proof of concept
//    DATE: 2013-10-30
//     URL:
//
// Released to the public domain
//

void setup() 
{
  Serial.begin(115200);
  Serial.println("Start ");

  volatile uint64_t num = 12345676123123123LL;
  volatile uint64_t cnum = 6666;
  volatile uint64_t x;

  x = num / (cnum - 1); 
  Serial.println(x);

  uint32_t start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num / (cnum - 1); 
  }
  uint32_t stop = micros();
  Serial.print("u64 / u64:\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num / ((uint32_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u64 / u32:\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num / ((uint16_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u64 / u16:\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = num / ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u64 / u8 :\t");
  Serial.println(stop - start);

  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint32_t) num) / ((uint32_t)(cnum-1));
  }
  stop = micros();
  Serial.print("\nu32 / u32 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint32_t) num) / ((uint16_t)(cnum-1));
  }
  stop = micros();
  Serial.print("u32 / u16 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint32_t)num) / ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u32 / u8 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint16_t)num) / ((uint16_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("\nu16 / u16 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint16_t)num) / ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("u16 / u8 :\t");
  Serial.println(stop - start);


  start = micros();
  for (int i=0; i<1000; i++)
  {
    x = ((uint8_t)num) / ((uint8_t)(cnum - 1)); 
  }
  stop = micros();
  Serial.print("\nu8 / u8 :\t");
  Serial.println(stop - start);


  union 
  {
    uint64_t ll;
    struct
    {
      uint32_t lb;
      uint32_t lh;
    } 
    a;
    struct
    {
      uint16_t ia;
      uint32_t lm;
      uint16_t ib;
    } 
    b;
  } 
  tt;

  uint64_t quot;

  start = micros();
  for (int i=0; i<1000; i++)
  {
    quot = 0;
    if (cnum < 0xFFFF)
    {
      tt.ll = num;
      uint16_t t = cnum - 1;
      uint32_t q = tt.a.lh / t;
      quot = q;
      tt.a.lh -= q*t;
      quot <<= 32;

      q = tt.b.lm / t;
      quot += ((uint64_t)q) << 16;
      tt.b.lm -= q*t;

      q = tt.a.lb / t;
      quot += q;
      //tt.a.lb -= q*t;
    }
    else
    {
      x =  num / (cnum -1);
    }    
  }
  stop = micros();
  Serial.print("\nsmart:\t");
  Serial.println(stop - start);

  quot = 0;
  tt.ll = num;
  uint16_t t = cnum - 1;
  uint32_t q = tt.a.lh / t;
  tt.a.lh -= q*t;
  quot = ((uint64_t)q) << 32;

  q = tt.b.lm / t;
  tt.b.lm -= q*t;
  quot += ((uint64_t)q) << 16;

  q = tt.a.lb / t;
  tt.a.lb -= q*t;
  quot += q;
  Serial.println(quot);
}

void loop() 
{
}

Without a delay in the code, it took nearly 7.4 days to roll over.

This way at least all numbers from 0..65535 can be tested faster.

I tried a simple test of doing the calculations up to 2^32 using unsigned long and then switching to long long. It didn't make any noticeable difference (still took 27 seconds per prime, roughly).

Checking how long it takes to reach 89999999 shows that the early calculations are fast anyway (around 800 mS) so improving that 800 mS a bit doesn't change the overall result. It's the remaining 26 seconds we are worried about.

So what I think is needed is one of those primality probability tests, that we run half-a-dozen times, hopefully taking somewhat less than 27 seconds overall.

Let's put it this way, testing 2 out of 6 numbers up to 2^32 =

 (sqrt(2^32) ) / 4 = 16384 tests

Testing up to the target prime of 100000000063 requires:

 (sqrt (100000000063) ) / 4 = 79057  tests

Difference is 62673 tests. So there are still a lot needed doing with long long. Still that is only around 4 times as many tests. But we have to test more than one number (otherwise we wouldn't need any tests!). The average gap between primes is ln(p) namely:

 ln (100000000063) / 4 = 6.33

So given we need to test 6 numbers, and they take 4 times as long each, taking 27 seconds is in the ball park.

(I think I'm wrong about dividing by 4 above, but I'm waiting for someone to challenge me on that) :stuck_out_tongue:

What my above code does is speed up a part of the primality test when you are in the long long domain for the number under test. so every long long tested will benefit from that.
Testing ..65535 will reject a lot of numbers much faster, but not all that's true.

How much?:
for every prime long long found there will be 21845 faster tests (1/3 of 65535 using the 6k+-1 method)
savings per calculation in this range is 414-126 uSec = 288 usec
21845 * 288 usec = 6.291.360 usec or about 6.2 seconds per prime found. (>20% of the 27 seconds)

As there is also a gain - far more difficult to calculate - for all the non primes.
I expect this to be at least 10%, probably substantial more, as most non primes are filtered by the lower numbers.
but the proof is in the pudding test.

some first numbers of the pudding test

  1. reference version (I removed the EEPROMWriteAnything)
    ======================
    100000000003 is prime. Took 27744 mS. Found 5 primes.
    100000000019 is prime. Took 33283 mS. Found 6 primes.
    100000000057 is prime. Took 37525 mS. Found 7 primes.
    100000000063 is prime. Took 32755 mS. Found 8 primes.

  2. optimized division
    ========================
    100000000003 is prime. Took 23964 mS. Found 5 primes.
    100000000019 is prime. Took 29147 mS. Found 6 primes.
    100000000057 is prime. Took 30536 mS. Found 7 primes.
    100000000063 is prime. Took 28974 mS. Found 8 primes.

  3. 32bit cnum loop 1.0.4 (as the sqrt(64bit) is maximum a 32 bit)
    ===========================
    100000000003 is prime. Took 23510 mS. Found 5 primes.
    100000000019 is prime. Took 28685 mS. Found 6 primes.
    100000000057 is prime. Took 30016 mS. Found 7 primes.
    100000000063 is prime. Took 28521 mS. Found 8 primes.
    100000000069 is prime. Took 28520 mS. Found 9 primes.
    100000000073 is prime. Took 28520 mS. Found 10 primes.
    100000000091 is prime. Took 28524 mS. Found 11 primes.
    100000000103 is prime. Took 28533 mS. Found 12 primes.
    100000000129 is prime. Took 28593 mS. Found 13 primes.

  4. removed 2 lines delay(500)
    ===============================
    100000000003 is prime. Took 23510 mS. Found 5 primes.
    100000000019 is prime. Took 23685 mS. Found 6 primes.
    100000000057 is prime. Took 25017 mS. Found 7 primes.
    100000000063 is prime. Took 23520 mS. Found 8 primes.
    100000000069 is prime. Took 23521 mS. Found 9 primes.

Seems not to be the predicted 6 seconds in practice, but worth the diff??

It says "Found 5 primes" because I was compensating for not starting at 2. :slight_smile:

I appreciate the work you are putting into this, but to be honest an order of magnitude would excite me. That is, getting it down to 2 or so seconds.

So I really think the repeated division/modulus has to go out the window.

I tried the Miller-Rabin Primality Test from this page:

http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=primalityTesting

const long long FIRST_PRIME_TO_USE = 961748941LL;  

long long candidate;

void setup() {
  Serial.begin(115200); 
  
  while (!Serial) { }
  
  Serial.println ();
  Serial.println ();
  Serial.println ("Starting.");
  
  candidate = FIRST_PRIME_TO_USE;
  }

void loop() 
  {

  for ( ; candidate < 9999999999999999LL; candidate += 2)
    showIfPrime (candidate);
 
  candidate = 3;  // back to start!
  }  // end of loop


/* this function calculates (a*b)%c taking into account that a*b might overflow */
long long mulmod(long long a,long long b,long long c){
    long long x = 0,y=a%c;
    while(b > 0){
        if(b%2 == 1){
            x = (x+y)%c;
        }
        y = (y*2)%c;
        b /= 2;
    }
    return x%c;
}

/* This function calculates (ab)%c */
int modulo(int a,int b,int c){
    long long x=1,y=a; // long long is taken to avoid overflow of intermediate results
    while(b > 0){
        if(b%2 == 1){
            x=(x*y)%c;
        }
        y = (y*y)%c; // squaring the base
        b /= 2;
    }
    return x%c;
}

/* Miller-Rabin primality test, iteration signifies the accuracy of the test */
bool Miller(long long p,int iteration){
    if(p<2){
        return false;
    }
    if(p!=2 && p%2==0){
        return false;
    }
    long long s=p-1;
    while(s%2==0){
        s/=2;
    }
    for(int i=0;i<iteration;i++){
        long long a=rand()%(p-1)+1,temp=s;
        long long mod=modulo(a,temp,p);
        while(temp!=p-1 && mod!=1 && mod!=p-1){
            mod=mulmod(mod,mod,p);
            temp *= 2;
        }
        if(mod!=p-1 && temp%2==0){
            return false;
        }
    }
    return true;
}

const byte DIGITS = 9;

void showIfPrime (long long num) 
  {
  
  if (!Miller (num, 50))
    return;
    
  char buf [20] = { 0 };
  long long temp = num;
  byte pos = 0;
  
  // make printable
  for (pos = 0; pos < DIGITS; pos++)
    {
    if (temp == 0)
      break;
    char digit = temp % 10;
    buf [DIGITS - 1 - pos] = digit | '0';
    temp /= 10;
    }

// insert leading spaces
  for (; pos < DIGITS; pos++)
    buf [DIGITS - 1 - pos] = ' '; //  = 15;
 
 static int count = 0;
 
 Serial.print (' ');
 Serial.print (buf);
 if (++count >= 8)
   {
   Serial.println (' ');
   count = 0;  
   }
}  // end of showIfPrime

It runs quite fast (a line every second or so) so that's great, right? Except that not all of these are prime:

Starting.
 961748941 961748943 961748947 961748949 961748951 961748955 961748957 961748959 
 961748961 961748963 961748965 961748967 961748971 961748973 961748975 961748979 
 961748981 961748983 961748987 961748989 961748991 961748993 961748995 961748997 
 961748999 961749003 961749005 961749007 961749011 961749013 961749015 961749019 
 961749021 961749023 961749025 961749027 961749029 961749031 961749035 961749037 
 961749039 961749043 961749045 961749047 961749051 961749053 961749055 961749059 
 961749061 961749063 961749067 961749069 961749071 961749075 961749077 961749079 
 961749083 961749085 961749087 961749089 961749091 961749093 961749095 961749099 
 961749101 961749103 961749107 961749109 961749111 961749115 961749117 961749119 
 961749121 961749123 961749125 961749127 961749131 961749133 961749135 961749139 
 961749141 961749143 961749147 961749149 961749151 961749153 961749155 961749157 
 961749159 961749163 961749165 961749167 961749171 961749173 961749175 961749179 
 961749181 961749183 961749187 961749189 961749191

For example, the second one (961748943) is not prime.

Compare to:

               The Fiftieth 1,000,000 Primes (from primes.utm.edu)

 961748941 961748947 961748951 961748969 961748987 961748993 961749023 961749037 
 961749043 961749067 961749079 961749091 961749097 961749101 961749121 961749157 
 961749167 961749193 961749199 961749221 961749227 961749247 961749253 961749293

I thought that the data types in modulo were probably too small, so I made them larger, eg.

int modulo(long a,long b,long c){

Now, for no particularly obvious reason, it outputs nothing at all, ie.

Starting.

Hi Nick,

I agree, but I did not found the silver bullet yet (but I like optimizing / squeezing code as you know)

I'm still thinking that the result of num %(cnum-1) can be used for num % (cnum +1) or other way around as the results are close.
Moving from the 64 bit to 32 bit domain gains the factor of 10 6 you're looking for.

I tried the Miller-Rabin Primality Test from this page:

MR is a probability test, not an exact test and the reason why it is used is that it detects numbers that are easy to factorize . That is the key to modern encryption to have a number that are hard to factorize. Ideally these are the product of two large primes. In practice the number nerds found that the product of any two hard to factorize numbers work pretty well too. Stil it is a probability, not a proof for primality.

In the meantime found some more small optimizations resulting in:
5) %3 in faster version

100000000003 is prime. Took 23503 mS. Found 5 primes.
100000000019 is prime. Took 23677 mS. Found 6 primes.
100000000057 is prime. Took 25001 mS. Found 7 primes.
100000000063 is prime. Took 23514 mS. Found 8 primes.
100000000069 is prime. Took 23515 mS. Found 9 primes.

  1. cnum uint32_t

100000000003 is prime. Took 23409 mS. Found 5 primes.
100000000019 is prime. Took 23584 mS. Found 6 primes.
100000000057 is prime. Took 24914 mS. Found 7 primes.
100000000063 is prime. Took 23420 mS. Found 8 primes.
100000000069 is prime. Took 23420 mS. Found 9 primes.

  1. void showIfPrime (uint64_t num)

100000000003 is prime. Took 21912 mS. Found 5 primes.
100000000019 is prime. Took 22087 mS. Found 6 primes.
100000000057 is prime. Took 23417 mS. Found 7 primes.
100000000063 is prime. Took 21922 mS. Found 8 primes.
100000000069 is prime. Took 21923 mS. Found 9 primes.

to be continued...

I would have written code myself, but I was busy working on another project.

Since both of you gentlemen seem to be missing the point:

Yes, a lookup table will come in handy.
No, it will not need to have zillions of entries: 128 entries will do.

The lookup table will have entries for odd indices only: 0x01, 0x03, 0x05, and so on, up to 0xFF. For each entry, the index multiplied by the entry will end in ...01 (hexadecimal). The table will begin 0x01, 0xAB, 0xCD, 0xB7, ...

Let's try "dividing" 0x18979 by 0x107.
We begin by "dividing" 1 by 0x107. From our table we see the quotient ends in ...B7. In a sort of "mirror image" of how division is taught in school, we multiply 0x107 by 0xB7, giving us 0xBC01, which we then subtract from 0x0001, giving us 0x...FFFF4400. We drop the two zeros from the end, giving us 0x...FFFFFF44. We then divide this by 0x107. We multiply 0x44 by 0xB7, giving 0x309C, which tells us our next two hex digits are 9C. We multiply 0x107 by 0x9C, giving 0xA044, which we then subtract from 0x...FFFFFF44, giving 0x...FFFF5F00. (Our quotient so far is 0x...9CB7.) Again, we drop two zeros, this time from 0x...FFFF5F00, giving us 0x...FFFFFF5F. Now we can grab even more digits at a time. We will now grab four digits. We multiply 0xFF5F by 0x9CB7, giving 0x9C5470E9, so the four hex digits we grab are 70E9. We multiply 0x107 by 0x70E9, giving 0x73FF5F, and... I think we have enough digits for now. We now know that 1 "divided" by 0x107 equals 0x...70E99CB7.
Now, we multiply 0x18979 by 0x70E99CB7. We get 0xAD8C,0000,017F. Since we were only working to 8 hex digits of "precision", we can only trust the last 8 hex digits, so we have 0x0000017F. This is a reasonably small answer, given the size of our operands, and 0x17F times 0x107 does indeed equal 0x18979.
If we had tried dividing a different number, let's say 0x1897D, by 0x107 using this method, we would have got 0x1897D multiplied by 0x70E99CB7 equals 0xAD8D,C3A6,745B, which is obvious garbage: this tells us that dividing 0x1897D by 0x107 is itself a "garbage" operation (i.e. the result is not an integer, and therefore 0x1897D is not a multiple of 0x107).