Go Down

Topic: Understanding 32-Bit Floating Point Number Representation (binary32 format) (Read 129 times) previous topic - next topic

GolamMostafa

Let us start with the following floating point number:
Code: [Select]
float y = 75.12345678;
A floating point number has two parts: integer part (75) and fractional part (.12345678).

1.  In programming, the keyword "float" is used to declare a floating point number. Here, the "data type" is float.
2.  After the declaration of a floating point number (for example: float y = 75.12345678;), a 32-bit binary number (0x42963F36) is automatically saved  into four consecutive memory locations as is shown in the following memory map diagram (Fig-1).

Figure-1: Memory map diagram for the storage of 75.12345678

(1)  The 32-bit value (0x42963F36) for the float number 75.12345678 can be found by:
(a)  Manual calculation in Step-4.
(b)  Executing the following codes:
Code: [Select]
float y = 75.12345678;
long *ptr;
ptr = (long*)&y;
Serial.println(*ptr,HEX);//shows:42963F36

(2)  The float value 75.12345678 for the binary32 value 0x42963F36 can be obtained by:
(a)  Manual calculation in Step-5.
(b)  Executing the following codes:
Code: [Select]
float y;            //request is placed for four byte wide memory space
long *ptr;                 //ptr is a pointer variable; it holds beginning address of the above space
ptr = (long*)&y;       //address goes into ptr. Read/write will be 4-byte (32-bit) at a time.
*ptr = 0x42963F36;  //32-bit data goes into 4-byte wide space requested for float y
Serial.print(y, 8);      //shows:75.12345 886 (5-digit accuracy) instead of 75.12345678

Or
Code: [Select]
float y = 75.12345678;
Serial.print(y, 8);  //shows: 75.12345886 ; only 5-digit are accurate

(3)  If float number 75.12345678 is desired to be stored into the memory locations of Fig-1, then only 75.12345 is found to have stored correctly. The last digits are not saved correctly due to information loss in the inter-conversion processes. Thus, the accuracy of float number is only 5-digit (called accuracy) after the decimal point. The IEEE-754/binary32 standard can store up to 23 decimal digits after the decimal point (called precision).  

3.  The 32-bit value for any float point number is created as per IEEE-754/binrary32 standard of Fig-2. The value is also known as binary32 formatted value. There are three parts in this format: signBit (1-bit), exponent (8-bit), and fraction (23-bit). The range of floating point number that can be created is: 1.2E-38 to 3.4E+38 (1.2x10-38 to 3.4x1038).

Figure-2: IEEE-754/binary32 Template to create 32-bit value for float number 75.12345678

4.  Manual Calculation to get 32-bit binary32 value (0x42963F36) for the float number 75.12345678. In this calculation, the Template of Fig-2 has been used.
(1)  Calculating binary bits for 75: 1001011
(2)  Calculating binary bits for  0.12345678 (Continue multiplying by 2 until the fractional part is "exhausted to 0" or "23 fractional bits are accumulated".)
Code: [Select]
0.12345678x2 = 0.24691356       0.24691356x2 = 0.49382712    
0.49382712x2 = 0.98765424       0.98765424x2 = 1.97530848  
0.97530848x2 = 1.95061696       0.95061696x2 = 1.90123392
0.90123392x2 = 1.80246784       0.80246784x2 = 1.60493568
0.60493568x2 = 1.20987136       0.20987136x2 = 0.41974272    
0.41974272x2 = 0.83948544       0.83948544x2 = 1.67897088
0.67897088x2 = 1.35794176       0.35794176x2 = 0.71588352
0.71588352x2 = 1.43176704       0.43176704x2 = 0.86353408
0.86353408x2 = 1.72706816       0.72706816x2 = 1.45413632
0.45413632x2 = 0.90827264       0.90827264x2 = 1.81654528
0.81654528x2 = 1.63309056       0.63309056x2 = 1.26618112
0.26618112x2 = 0.53236224

(3)  (75.12345678)10 = (1001011.00 01 11 11 10 01 10 10 11 01 11 0)2 (approximately)
(4)  (1001011.00 01 11 11 10 01 10 10 11 01 11 0)2  
==> (1.001011 00 01 11 11 10 01 10 10 11 01 11 0)2*26
(5)  Let us get binary32 value 0x42963F36 for the float number 75.12345678.
(a)  Sign bit (1-bit: b31) : 0        //Because 75.12345678 is a positive number
(b)  Biased exponent (8-bit: b30 - b23) : 6 (from Step-4(4)) + 127 (fixed bias) = 133 = 0x85 = 1000 0101
(c)  Binary fraction digits (23-bit: b22 - b0): 001011  00 01 11 11 10  01 10 10 1 from Step-4(2)
(d) Putting together:
Code: [Select]
Sign bit   Bias Exponent   Binary fraction digits
0             10000101           001011 00 01 11 11 10 01 10 10 1    
==> arranging as nibbles, we get: 0100 0010 1001 0110 0011 1111 0011 0101    
==> arranging as hex digit, we get: 0x42963F35 and not 0x42963F36 (information loss)

5.  Manual Calculation to obtain float number 75.12345678 from binary32 formatted number 0x42963F36.
(1)  In this computation, the following formula of Fig-3 will be used.

Figure-3: Formula to reconstruct float number from the binary32 value

(2)  Given binary32 number is: 0x42963F36
(3)  Let us arrange the hex value of Step-5(2) to comply with the Template of Fig-2; as a result, we get:
Code: [Select]
==> 0  10000101  00101100011111100110110
(4)  Let us evaluate the equation of Fig-3 with the bit values of Step-5(3) to get the float number 75.1235678.
==> Real value = (-1)signBit*(1+b22x2-1+b21x2-22+...+b0x2-23)*2(e-127)   //e = 133 from Step-5(5)(b) (Exponent = 1000 0101 = 0x85 = 133)

==> real Value =  1*(1 + 0x2-1 + 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 + 1x2-6 + 0x2-7 + 0x2-8 + 0x2-9 + 1x2-10 + 1x2-11 + 1x2-12 + 1x2-13 + 1x2-14 + 1x2-15 + 0x2-16 + 0x2-17 + 1x2-18 + 1x2-19 + 0x2-20 + 1x2-21 + 1x2-22 + 0x2-23)*2133-127

==> Real Value =  (1 + 0 + 0 + 0.125 + 0 + 0.03125 + 0.015625 + 0 + 0 + 0 + 0.0009765625 + 0.00048828125 + 0.000244140625 + 0.0001220703125 + 0.00006103515625 + 0.000030517578125 + 0 + 0 + 0.000003814697265625 + 0.0000019073486328125 + 0 + 0.000000476837158203125 + 0.0000002384185791015625 + 0)*26  

==> Real Value= (1 + 0.17380404472351074218750)*64
==> Real value = 75.12345 88623046875 (5-digit accuracy and agrees very well with Step-2(b))

6.  Manual Calculation to obtain float number 75.12345678 from 32-bit natural binary (not binary32 format)
(1)  The given 32-bit binary number is:
(1001011.00 01 11 11 10 01 10 10 11 01 11 0)2 from Step-4(3).
==>75+0x2-1 + 0x2-2 + 0x2-3 +    1x2-4 + 1x2-5 + 1x2-6 + 1x2-7+1x2-8+1x2-9 +0x2-10+0x2-11
+1x2-12+1x2-13+   0x2-14+1x2-15+0x2-16+1x2-17       +1x2-18+0x2-19+1x2-20+1x2-21   +1x2-22+0x2-23

Code: [Select]
= 75 + 0 + 0 + 0 + 0.0625 + 0.03125 + 0.015625 + 0.0078125 + 0.00390625 +0.001953125
+ 0 + 0 + 0.000244140625 + 0.0001220703125 + 0 + 0.000030517578125 + 0
+ 0.00000762939453125 + 0.000003814697265625 + 0 + 0.00000095367431640625
+ 0.000000476837158203125 + 0.0000002384185791015625 + 0

= 75 + 0.1234567165374755859375  (here, there are 23 digits after decimal point)
= 75.1234567 165374755859375 (7-digit accuracy and 23-digit precision

7.  In order to achieve 15-digit accuracy with 52-digit precision, the IEEE-754/binary64 Standard exists; where, the float number is represented using 64-bit as per format of Fig-4. Currently, this double precision float number is supported by Arduino DUE and not by standard Arduinos (UNO/NANO/MEGA). The data declaration in Arduino DUE is:
Code: [Select]
double y = 1.23456;
.

Figure-4: IEEE-754/binary64 Template for float number

GolamMostafa


Go Up