Double Precision Advice

I read that for the Arduino, there is no differene between float and double precision. Does this mean that double and float occupy the same number of bytes? If not, and there is no other significant difference between them, it seems the advice should always be to use flot. Is that a reasonable conclusion?

Secondly, I am having problems performing calculations on some big numbers (> 10^^6) where there are significant changes in the 3d decimal place and onwards. The values to the left of the dp remain unchanged; only the values to the right change.

Even when I perform two separate calculations (integer part and frction part) and combine the result, the answer is no different from performing the operation on the original number. And this is a bit of a stumbling block.

To illustrate, the following sketch illustrates the problem:

#include <math.h>

double big = 10;
double small = 0.1;

void setup () {
  // fire up the serial interface for the monitor
  Serial.begin(9600);      // slower than we need but sound enough
float t;
double u;
  Serial.println("Double Precision");
  Serial.println();
  t=1234.1234;
  u=1234.1234;
  Serial.print("This is a big float number:  1234.1234           : "); Serial.println(t,9);
  Serial.print("This is a big double number: 1234.1234           : "); Serial.println(u,9);
  t=12345.12345;
  u=12345.12345;
  Serial.print("This is a big float number:  12345.12345         : "); Serial.println(t,9);
  Serial.print("This is a big double number: 12345.12345         : "); Serial.println(u,9);
  t=123456.123456;
  u=123456.123456;
  Serial.print("This is a big float number:  123456.123456       : "); Serial.println(t,9);
  Serial.print("This is a big double number: 123456.123456       : "); Serial.println(u,9);
  t=1234567.1234567;
  u=1234567.1234567;
  Serial.print("This is a big float number:  1234567.1234567     : "); Serial.println(t,9);
  Serial.print("This is a big double number: 1234567.1234567     : "); Serial.println(u,9);
  t=12345678.12345678;
  u=12345678.12345678;
  Serial.print("This is a big float number:  12345678.12345678   : "); Serial.println(t,9);
  Serial.print("This is a big double number: 12345678.12345678   : "); Serial.println(u,9);
  t=123456789.123456789;
  u=123456789.123456789;
  Serial.print("This is a big float number:  123456789.123456789 : "); Serial.println(t,9);
  Serial.print("This is a big double number: 123456789.123456789 : "); Serial.println(u,9);
  Serial.println();
}



void loop () {
  double j;
  // do some maths
  for (int i = 1; i < 5; i++)  {
    leadingZero(i); Serial.print(". ");
    //Serial.print("big: "); Serial.print(big,6); Serial.print("\tsmall: "); Serial.println(small,6);
    big = big*(i);
    small = small/(sqrt(i));
    j = combine(big, small);
    Serial.print("big: "); Serial.print(big,2); Serial.print("\t\tsmall: "); Serial.print(small,8); Serial.print("\tsum: "); Serial.println(j,8);
  }
  for (int i = 5; i < 11; i++)  {
    leadingZero(i); Serial.print(". ");
    //Serial.print("big: "); Serial.print(big,6); Serial.print("\tsmall: "); Serial.println(small,6);
    big = big*(i);
    small = small/(sqrt(i));
    j = combine(big, small);
    Serial.print("big: "); Serial.print(big,2); Serial.print("\tsmall: "); Serial.print(small,8); Serial.print("\tsum: "); Serial.println(j,8);
  }
  while (1);
}

double combine (double s, double b) {
    double c; 
    //Serial.print("small: "); Serial.print(s, 6); Serial.print (" ");
    //Serial.print("BIG  : "); Serial.print(b, 6); Serial.print (" ");
    c = s + b;
    //Serial.print("Sum  : "); Serial.print(c, 6); Serial.print (" ");
  return c;
}
void leadingZero(int digits) {
	if(digits < 10) Serial.print('0');
	Serial.print(digits);
}

Here is the output:

Double Precision

This is a big float number : 1234.1234 : 1234.123413085
This is a big double number : 1234.1234 : 1234.123413085
This is a big float number : 12345.12345 : 12345.123046875
This is a big double number : 12345.12345 : 12345.123046875
This is a big float number : 123456.123456 : 123456.125000000
This is a big double number : 123456.123456 : 123456.125000000
This is a big float number : 1234567.1234567 : 1234567.125000000
This is a big double number : 1234567.1234567 : 1234567.125000000
This is a big float number : 12345678.12345678 : 12345678.000000000
This is a big double number : 12345678.12345678 : 12345678.000000000
This is a big float number : 123456789.123456789 : 123456792.000000000
This is a big double number : 123456789.123456789 : 123456792.000000000

  1. big: 10.00 small: 0.10000001 sum: 10.10000038
  2. big: 20.00 small: 0.07071068 sum: 20.07071113
  3. big: 60.00 small: 0.04082483 sum: 60.04082489
  4. big: 240.00 small: 0.02041242 sum: 240.02041625
  5. big: 1200.00 small: 0.00912871 sum: 1200.00915527
  6. big: 7200.00 small: 0.00372678 sum: 7200.00390625
  7. big: 50400.00 small: 0.00140859 sum: 50400.00000000
  8. big: 403200.00 small: 0.00049801 sum: 403200.00000000
  9. big: 3628800.00 small: 0.00016600 sum: 3628800.00000000
  10. big: 36288000.00 small: 0.00005250 sum: 36288000.00000000

The first part of the output simply holds an increasingly big number with an increasingly higher precision in both float and double variables and prints the result. Even the number "12345.12345" has problems in the 4th dp.

The second part of the output illustrates the huge loss of significance by iteration 7.

Is there any way of increasing the precision? An alternative library?

Ric

ricm:
Is there any way of increasing the precision?

Why?

start with changing all the 'double' into 'unsigned long'

The arduino 32bit floats have about six digits of precision...

I read that for the Arduino, there is no differene between float and double precision. Does this mean that double and float occupy the same number of bytes?

Yes, it does. As could have been determined by looking at the documentation.

If not, and there is no other significant difference between them, it seems the advice should always be to use flot. Is that a reasonable conclusion?

No. Sometimes you want code to compile. In those cases, you would use float.

Even the number "12345.12345" has problems in the 4th dp.

See westfw's comment. That counts all the digits, before and after the decimal place.

Is there any way of increasing the precision?

No.

Steen:
start with changing all the 'double' into 'unsigned long'

How will that give me greater precision in the decimal places?

.. you may use double (64bit, 15 dec places) - ie IAR AVR does support it, or to use decimalfloat (ie DecNumber library, decNumber Library - 3.68 , arbitrary number of decimal places). I ported decNumber to stm32f100, it works fine, you can measure the size of our Universe with Planck lenght resolution.. :slight_smile:

you can measure the size of our Universe with Planck lenght resolution.

Can I borrow your magic tape measure?

..I've lent it to the LHC guys..

pito:
.. you may use double (64bit, 15 dec places) - ie IAR AVR does support it, or to use decimalfloat (ie DecNumber library, decNumber Library - 3.68 , arbitrary number of decimal places). I ported decNumber to stm32f100, it works fine, you can measure the size of our Universe with Planck lenght resolution.. :slight_smile:

Ummm, no. The AVR gcc compiler explicitly overrides the default and float, double, and long double are all 32-bits. Given the fact that the AVR processes things 8-bits at a time, and has limited memory, this is probably reasonable. It may be that if they enabled double to be 64-bits, the emulation code would fill most of the memory.

If you are doing a lot of floating point arithmetic and/or need more that 6 digits of precision, you should be looking at changing processors. The AVR just is not up to it.

Nope, IAR EW compiler for AVR (8bitters as well) does it (it does 64bit float double and long double when the "--64bit_doubles" option is used). The code size is roughly the same, the math functions execution speed is almost exactly half the 32bit float.. The code for 64bit float is not much bigger as the change against 32bit fp is maybe the number of bytes to be read (one or two parameters from the stack), and, the number in the loops within the fp routines changes as well what does not make the code bigger, that's it..

The limiting factor is that the AVR-GCC doesn't have long floats. If you want to switch compilers, then go for it. But that's not Arduino. As long as the requirement is "must have double float" then AVR-GCC + AVR-LIBC and therefore Arduino are not going to work.

If the O/P can give more specifics about what calculations are needed, with what precision and performance, then someone might be able to offer a workaround.

ricm:
Is there any way of increasing the precision? An alternative library?

Big Numbers!

Although there is a saying "after the 3rd decimal place no-one gives a damn". So I don't know if you really need that much precision.

Hello everyone,

I have been unexpectedly been dragged away from all of this for the past few days for personal reasons but, that aside, I'd like to thank you all for your recent input.

Perhaps a context might help. I am new to the Arduino and to C++, either of which ought to deter me. But I delight in learning new things and almost all the concepts are not unfamiliar, having worked with other programming languages. For those of you old enough to recall, my first exposure to microprocessors was the SC/MP, built on breadboard. It didn't even come with a bootloader. I had to write the assembler just to use assembly language; up till that point it was pure machine code. Primitive - but fun.

To the present... I have been interested in timepices for some time and my aim is to build a solar clock. To do that I want to explore the conceptual, design and practical parameters for building a prototype. The known position of the sun is a keystone in determining accuracy of such a clock.

Starting simply, with no correction for nutation, atmospheric refraction, etc., and bulding step by step a decent set of algorithms to achieve with determinate accuracy ("determinate" as opposed to unrealistically achieveable) the postion of the sun I can then, and only then, begin to optimise a cost-efective but accurate sensing head to track the sun.

The equations for determining solar position at time-of-day are complex, even if reduced to their simplest by ignoring obvious artefacts. The problem for me is that the significant changes to the very first calculations - that of Julian Day (JD) - are at the 3rd and 4th (and more) d.p. in a number > 2M. So... I have a very large number where the significant changes are in the 1000'ths.
I don't need help with the maths but my problem is two-fold:

  • the problem of 8-bit arithmetic
  • the current quantisation of precision in calculatesd values such as x.00, x.25, x.50, x.75 where x is >2M

This is the issue. The accurate calculation of JD is the start of the process. Lack of precision there results in horrendous errors further down the line. Errors accumulate.
It occurs to me the precision achieveable is library-dependent - hence the question about available/alternative libraries. With greater precision comes a processing overhead or penalty, at lesast an expected overhead, but this is inconsequential provided memory bounds are reasonable: I only need to calculate position once very minute or so, so I do have the time for the complex maths to execute.
Having just returned home I haven't had the time to look in detail at the responses - that I shall do first thing tomorrow, so please do not think I have ignored suggestions.

Thanks for the support so far
Kind regards,
Ric

Instead of trying to implement a double precision library within the arduino, consider purchasing a coprocessor like this; Micromega: uM-FPU64 that would let you off load all of the math to a second chip designed to perform that math with the precision you need... That would greatly simplify the infrastructure code you need for your project.

ricm:
the problem of 8-bit arithmetic

I am uncertain about the actual difficulty of most of what you mention, except this point. Although Arduino is an 8-bit platform, AVR-GCC and the Aduino has perfectly usable support for 16, 32 and 64-bit signed and unsigned integer arithmetic. This should be fine for ANY integer computation needs, such as the number of nanoseconds since the fall of the Roman republic or anything of that nature. Although the float support is not terrific, 32 bit floats on Arduino can be relied upon for accurate computation to one part in 10^5 or so.

There are quite a few algorithms in the old Sky and Telescope archives from the early to mid 80s (http://www.skyandtelescope.com/resources/software/3304911.html) that implement calculations something like what you describe, and seem to do so using regular floats or even integer arithmetic. I know I have a C utility I picked up somewhere in the late 80s to compute sunrise and sunset at a given date/lat/long that doesn't need anything more than a 32-bit float to work and actually does a bunch of computations using dates and times coded into integers in a way that I have never fathomed.

I doubt that Arduino will be wholly impractical for you.

From wikipedia: "for example, January 1, 2000 at 12:00:00 corresponds to JD = 2451545.0", and
JD = JDN + ... + min/1440
So a full Julian date doesn't fit in a 32bit float.

But it looks to me like it would fit ok in a 32-bit integer (for the day) and a 32bit float for the fraction of day.

Which gets back to an earlier answer that someone was asking about:

start with changing all the 'double' into 'unsigned long'

A 32bit long has about 8 digits of precision, compared to the 6 digits of a 32bit float. That's essentially because you have up two digits to hold the exponent.
People tend to think of "floating point" as the way to handle numbers with fractional parts, but the ACTUAL strength of floating point is its ability to handle very large or very small numbers; letting the decimal point "float" WAY off to the left or right. Want to know how many molecules are in the lethal dose of Sarin nerve gas? Floating point will easily give you a reasonable result for the calculation:

(17210-6 g(ld50) / 140 g/mol) * 6.0221023 molecules/mol

None of those numbers has more than 4 digits of precision, but it can't handle 1000.0001, because that needs 8 digits of precision.

wanderson:
Instead of trying to implement a double precision library within the arduino, consider purchasing a coprocessor like this; Micromega: uM-FPU64 that would let you off load all of the math to a second chip designed to perform that math with the precision you need... That would greatly simplify the infrastructure code you need for your project.

The solution looks promising. Have you successfully interfaced this chip to the Arduino and, if so, how?

Ric

This thread has some useful info and code Arduino Forum

Pete

I think there is none specific reason for not having 64bit fp in arduino. The execution time is half of the 32bit fp (acceptable) and the code might be 1.5x bigger than 32bit fp (acceptable). Especially with 1284p and Megas it creates new opportunities ie for astronomy fans. So the sw experts shall consider that and include that functionality into upcoming revisions.. It is funny to read about PIC24/PIC32MX math coprocessors needed when one wants to provide a few astronomy calcs with arduino :slight_smile:
PS: instead of using the PIC math coporcessor I'll take a pic32mx chip in DIL28 and the free C32 microchip compiler (with tons of various libraries).. and your life is much simpler then 8) Or, better, with uC32..