Looking for variant approach/sketch

Now, everything is in line and thanks to @robtillaart for adjusting my test sketch of #21 to produce output which clearly shows that iterative method to compute factorial is faster than recursive method.

Remember that iterative uses less stack space which in real projects is often more important than the performance gain.

when stack and heap meet, they "exchange" information ...

Just because I am curious...
How fast does this run?

void factorialI(uint32_t &fac, uint8_t n)
{
  fac = n;
  while ((--n) > 1) fac *= n;
}

Usually optimization is done for sorting algorithms and the like.
Some scale with n, some with n^2, some with nlog(n).
For n=1000 that makes a huge difference.
But here n is small. And the differences found so far are also small. So all of this is just for fun...

for 12!, the above codes takes 921 cycles which is same as taken by the following codes of @robtillaart:

uint32_t factorialI(uint8_t n)
{
  uint32_t fac = n;
  while ((--n) > 1) fac *= n;
  return fac;
}

OUTPUT:

1011
479001600
921
479001600
921
479001600

Will for() loop for iteration instead of post decremental while() loop produce identical result?

1 Like

@GolamMostafa
If you want to improve on the iterative version you should find a way to make the multiplications faster as they are 32 bit. For the first 7 iteration 16 bit math would work and could be faster but it would definitely increase footprint.

More interesting is that the recursive version can be modified to use 50% less stack space. Price is a slower (expectation) and slightly larger footprint.

1 Like

Testing is faster than asking.

Mmm, almost any algorithm may be optimizeable, however as you said is it worth the effort? If code works and is fast enough, why optimize?

I recall we had a lot of fun several years ago optimizing divmod10. A function to do an integer div 10 and a mod 10 in one go. The gain was huge (about 4x faster IIRC) and with assembly even more. The function is meant e.g. to display integers faster. Saving time on IO, gives extra cycles for other tasks.

It would be intersting to see the application of the above function how it seperates a digit from 25 in one go (concurrent execution of % and / operators) instead of using the conventional two-line approach that I follow.

byte y = 25; //d0d1 = 25

byte d1 = y%10;  //d1 = 5
y = y/10
//------------
byte d0 = y%10;  //d0 = 2
y = y/10;
  • Interesting, when we use the code below, the scope measures 186ns. to get the answer. i.e facValue = factorialR(7); //5040
    When the code is written so optimization doesn't happen we get 32.9us.
    .
    So yes the compiler is sticking their grimy little fingers into the pie :smiling_face_with_sunglasses:

optimized 186ns verses not optimized 32.9us

facValue = factorialR(7);

optimized 61.8us verses not optimized 59.7us

facValue = factorialR(12);





optimized 186ns verses not optimized 28.6us

facValue = factorialI(7);


optimized 56.2us verses not optimized 56us

facValue = factorialI(12);

Optimized version...

uint32_t fac(uint8_t n) {
   static uint32_t facvals[]={1,2,6,24,120,etc};
   return facvals[n];
}

:+1:
Think it can't be faster, maybe add keyword inline?
the implementation misses fact(0) by the way

inline uint32_t fac(uint8_t n) {
   static uint32_t facvals[]={1, 1, 2, 6, 24, 120, etc};
   return facvals[n];
}