Manipulating IEEE754 32bit floats (incl mapping 64bit double)

After a discussion about exporting a 64bit double from Arduino to some PC application here - 4 byte float (double) to 8 byte - Programming Questions - Arduino Forum - I transformed the prototype code I posted into something more manageable and create a playground page for it.

Today I created an initial version of the page here - Arduino Playground - HomePage - and it will be 'completed' asap.

As always remarks, additions and comments are welcome.

update1: fixed code in .h file
update2: added Arduino sender + Python receiver
updateN: added Arduino echo application + Python send/receive

I think arduino needs a 64bit double to be fully implemented. Long time back I did with IAR for AVR which supports double and the double math is (only) 2x slower as the 32bit what would be fully acceptable..

I think arduino needs a 64bit double to be fully implemented.

It is often asked for, but have seen no code yet...
There are not so much sensors that give more significant digits than a float can handle. Note that 6 significant digits means a dynamic range of a factor 100000.

Why do you need doubles?

For next version of the IEEE754tools I am creating faster tests for isNAN and isINF.

bool IEEE_NAN(float number)  
{
    return (* ((uint16_t*) &number + 1) ) == 0x7FC0; 
}

bool IEEE_INF(float number)  
{
    return (* ((uint16_t*) &number + 1) ) == 0x7F80; 
}

bool IEEE_NegINF(float number)  
{
    return (* ((uint16_t*) &number + 1) ) == 0xFF80; 
}

IEEE_NAN() is ~1.9x faster than isnan()
IEEE_INF() is ~4.0x faster than isinf() but the interface is different // factor updated after new measurement

to be continued...

int IEEE_INF(float number)  
{
    uint16_t* x = ((uint16_t*) &number + 1);
    // if ((*x & 0x7F80) != 0x7F80) return 0;
    if (*x == 0x7F80) return 1;
    if (*x == 0xFF80) return -1;
    return 0;
}

on average of the 3 code paths is almost ~3.3x faster than isinf() (normal numbers 3.1x)
if the extra line is added the average is ~3.0x faster (normal numbers 3.3x). As most numbers are normal this might be the preferred version.
so these are slower than the dedicated versions.
==> just add them all, so the user has the choice, (linker will optimize anyway).

created a "byte" based comparison

int IEEE_INF2(float number)  
{
    uint8_t* x = ((uint8_t*) &number);
    if (*(x+2) != 0x80) return 0;
    if (*(x+3) == 0x7F) return 1;
    if (*(x+3) == 0xFF) return -1;
    return 0;
}

on average ~3.4x faster than isinf(), and only 2 bytes larger, this will be the one.

robtillaart:

I think arduino needs a 64bit double to be fully implemented.

It is often asked for, but have seen no code yet...
There are not so much sensors that give more significant digits than a float can handle. Note that 6 significant digits means a dynamic range of a factor 100000.

Why do you need doubles?

Well the simple thing is switch to the Due (or Teensy 3.0 or DigiX) which uses an Arm processor, and supports 64-bits.

You are likely never to get 64-bit double support in AVR processors, since in order to get them, you need the compiler to first enable 64-bit doubles (right now, it is hardwired to 32-bit). Then you have to have multilib support in the library to support existing people with 32-bit doubles and with 64-bit doubles. Then the IDE would need to move up to the newest compiler (the current one is several years old) and offer a choice between 32-bit and 64-bit doubles.

This is all doable, but it needs cooperation between several different groups. Due to release timing, even if everybody decided it was needed, it likely wouldn't get it to users until late 2014, if not 2015. Will AVR still be attractive for new designs in 2015, or will the world have switched to Arm?

Also if you are using a small memory AVR like an ATtiny (digispark, trinket, etc.) going to 64-bit floating point may cause some programs not to load, since the floating point emulator for 64-bits is likely bigger than the 32-bit emulator.

So people that need to transport raw binary 64-bit floating point values, will probably continue to have to use functions like this library.

yet another IEEE754 function

float IEEE_POW2(float number, int n)
{
    _FLOATCONV fl;
    fl.f = number;
    fl.p.e += n;
    return fl.f;
}

As the exponent is a power of 2 we can easily multiply the float with a power of 2 by adding the power to the exponent.
Please note the above code does not check for overflow/underflow of the exponent yet. The performance gain is ~3.5x

added overflow detection to see effect.

float IEEE_POW2x(float number, int n)
{
    _FLOATCONV fl;
    fl.f = number;
    int e = fl.p.e + n;
    if (e >= 0 && e < 256)
    {
        fl.p.e = e;
        return fl.f;
    }
    return fl.p.s * INFINITY;
}

Still performance gain ~2.7x faster.

robtillaart:

    return fl.p.s * INFINITY;

Wouldn't this return 0 or -INFINITY in the case of overflow? Perhaps:

  return (fl.p.s) ? -INFINITY : INFINITY;
Another way to do the infinity sign is to use an array:
float IEEE_POW2x(float number, int n)
{
    static float float_infinity[2] = { INFINITY, -INFINITY };
    _FLOATCONV fl;
    fl.f = number;
    int e = fl.p.e + n;
    if (e >= 0 && e < 256)
    {
        fl.p.e = e;
        return fl.f;
    }
    return float_infinity[fl.p.s];
}

MichaelMeissner:
...Perhaps:

  return (fl.p.s) ? -INFINITY : INFINITY;

Thanks Michael, definitely better code, no ambiguity

the array solution takes 8 bytes extra I guess (not tested); but would become interesting if we get more types of infinity :wink:

robtillaart:
Thanks Michael, definitely better code, no ambiguity

the array solution takes 8 bytes extra I guess (not tested); but would become interesting if we get more types of infinity :wink:

You won't get multiple types of infinity (other than +/-). Multiple types of Nan is possible due to the encoding, but so far computer makers have only done quiet and signaling NaNs (with nothing producing signaling Nans). Given Arduinos run without OS support, I don't know what you would do for a trapping NaN.

On some machines, the ? and : operator will win out, and on others the array will win out. It looks like from the assembly code that the ? and : operator wins out on the AVR, since the compiler does not do a load to load the constants and only loads up the negative or positive top byte via ? and : operator. On the Arm, the array access uses a smaller instruction size. Note the ? and : operator needs to load the negative constant from memory, but it can load the positive one inline so you have another 32-bit chunk for the constant in flash memory. On the powerpc, and x86, it needs to load both constants from memory, so the array access is fewer instructions, and it doesn't have a comparison/jump which can cause a pipeline bubble if the jump is not predicated correctly.

You won't get multiple types of infinity (other than +/-).

I have learned that besides + & -, there is 'countable' infinity (the numbers in N, Z or Q) and 'uncountable' infinity (numbers in R and C).

Countable infinity you can devise a way to enumerate all the items in an infinite set. Trivial for N, for Z you alternate between pos/neg. (Q is a bit more complex)
For uncountable sets like R there is always an infinite set between 2 values.

Back to the original subject, for IEEE754 I can imagine 2 (more) types of infinity:

FLOAT_INF: a value does not fit in a float (32bit)
DOUBLE_INF: a value does not fit in a double (64bit)
DOUBLE_DOUBLE_INF: idem 128 bit.
...
INFINITY = will not fit in any size float
(so there will be an infinite definitions of infinite :wink:

maybe name them F32_INF, F64_INF, F128_INF, etc.
F32_inf would be { 0, 255, 32 } for float32 // sign, exp, mantissa
F64_inf == { 0, 255, 64 } for float; and { 0, 2047, 64 } for double;
F128_inf == { 0, 255, 128 } or {0,2047, 128}
in general: { s=sign, e=MAXEXP, m=infinite size } // MAXEXP + next bit 0, indicates infinity

maybe a better form of infinite would be: {s, MAXEXP, m=#bits needed to represent the number, 0 == unknown }
as the 32 bit float has a mantissa of 23 bits it could indicate that a number needs more than a 4million bits or 1 million digits.

{ 0, MAXEXP, 1600 } is an infinite number that would require 1600 bits to represent (or about 400+digits).

This could make OrderOfMagnitude math possible, even in infinite. e.g. { 0, MAXEXP, 300 } / { 0, MAXEXP, 250 } ==> { 0, MAXEXP, 50}

a simple way might be to fill the mantissa with the exponent before truncating it to MAXEXP.
example:
1E30 * 1E30 ==> overflow as 1E60 does not fit in float ==> {s, MAXEXP, 60 } // exponent is truncated to MAXEXP to indicate infinity.

Some calculators show the mantissa and an E for Error. At least that gives you an indication of what the number looks like (starts with)


A number not seen in the IEEE754 which would be nice to have is EPSILON,
the number that can not be represented but is slightly more than zero (and of course -EPSILON)
would be more interesting as the rounding to ZERO

</thinking out loud>

While there might be arguable different types of infinity, in IEEE 754 (now 754R), there are only 2 encodings of infinity (sign 0/1, exponent all 1's, fraction all 0's).

Nans offer many more possibilities (sign unspecified, exponent all 1's, non-zero fraction with the most significant fraction bit indicating whether the NaN is quiet if the most significant bit is set, or signalling if the most significant bit is not set).

While there might be arguable different types of infinity, in IEEE 754 (now 754R), there are only 2 encodings of infinity (sign 0/1, exponent all 1's, fraction all 0's).

true, we need a new spec :wink: