Computational error with Uno R3

Using IDE2.1.0 on Linux Mint 21.1 and Arduino Uno R3.

Result of calculation is incorrect for some but not all values. Here is a sample code:

void setup() {
  // put your setup code here, to run once:
	Serial.begin(9600);
float value1 = 32.757;
float value2 = 32.758;
float value3 = 32.759;
float value4 = 32.760;
float value5 = 32.761;
float value6 = 32.762;

long result1 = (long)(value1 * 2000.0);
long result2 = (long)(value2 * 2000.0);
long result3 = (long)(value3 * 2000.0);
long result4 = (long)(value4 * 2000.0);
long result5 = (long)(value5 * 2000.0);
long result6 = (long)(value6 * 2000.0);

    Serial.print(result1);
    Serial.print("\t");
    Serial.print(result2);
    Serial.print("\t");
    Serial.print(result3);
    Serial.print("\t");
    Serial.print(result4);
    Serial.print("\t");
    Serial.print(result5);
    Serial.print("\t");
    Serial.print(result6);
    Serial.print("\t");
    Serial.println("");


}

void loop() {
  // put your main code here, to run repeatedly:

}

it prints results as follows:
65514 65516 65517 65519 65522 65524
Clearly, the results 3 and 4 are wrong.
Unfortunately, I do not have another Uno to see if the problem is caused by bad Atmega chip or bad software.
Your help would be much appreciated.
Best regards

That is perfectly typical float to integer truncation. Float values are not exact.

Try this code. It should make clear what is happening.

void setup() {
Serial.begin(115200);
while (!Serial);
float x= 32.759;
Serial.println(x,6);
Serial.println(x*2000.0,6);

}

void loop() {}
1 Like

1. What are your expected umbers?

2. If you use paper-pencil to calculate 32.758 x 2000.0, you will get 65,516.0000.

3. To see the figure/number of Section-2 on the Serial Monitor, execute the following sketch:

void setup() 
{
  Serial.begin(9600);
  float y = 32.758*2000.0;
  Serial.println(y, 4);//4-digit after decimal point
}

void loop() {}

4. Output:

65516.0000

5.

In the above code, you have told the MCU to perform 32.758 x 2000.0; the MCU has indeed calculated 65516.0000. But, you have told the Compiler to discard the fractional part by appending (long) (called casting) at RHS; as a result, you have got 65516 which is what you have asked for.

To see the problem with a bit more clarity, change
long result1 = (long)(value1 * 2000.0);
etc. to
float result1 = (float)(value1 * 2000.0);
What does this reveal? Remember, a float has approximately six(IIRC) decimal digits of resolution, due to the limits of the 4-byte implementation of a single-precision floating number.
When you see the results, then google rounding vs truncation, and see what you can infer.

seems like some rounding is in order..
try..

long result3 = round(value3 * 2000.0);
long result4 = round(value4 * 2000.0);

good luck.. ~q

Thank you all for quick replies. What confused me was that 'value' looks like scaled 15-bit number and the 'result' is a 16-bit number. Of course, the scaling factor 1/1000 is not a power of two...
Best regards.

P.S. How do I mark this topic as 'solved'?

Type "long" is a 32 bit signed integer on typical Arduinos.

With reference to the above variable (the valu1), can you please explain what have you wanted to mean by "scaled 15-bit number"?

When a float number is declared, it (32.757) is saved in memory as 32-bit number into four consecutive memory locations in binary32 (IEEE-754) format (Fig-1).
floatmem32-757

Figure-1:

Totally agree with the precision analysis but want to point out that if you believe computation is wrong in the Arduino you can always verify by running the same calculations on your computer to see if it's the same. I compiled and ran the following on my Intel-based Mac. Note that I explicitly forced the lengths to be completely compatible:

#include <stdio.h>
int main(void) {
    float value1 = 32.757F;
    float value2 = 32.758F;
    float value3 = 32.759F;
    float value4 = 32.760F;
    float value5 = 32.761F;
    float value6 = 32.762F;

    int32_t result1 = (int32_t)(value1 * 2000.0F);
    int32_t result2 = (int32_t)(value2 * 2000.0F);
    int32_t result3 = (int32_t)(value3 * 2000.0F);
    int32_t result4 = (int32_t)(value4 * 2000.0F);
    int32_t result5 = (int32_t)(value5 * 2000.0F);
    int32_t result6 = (int32_t)(value6 * 2000.0F);

    printf("%d\t%d\t%d\t%d\t%d\t%d\n", result1, result2, result3, result4, result5, result6);
}

The program printed:

65514	65516	65517	65519	65522	65524

exactly the same as the Arduino.

32.757 is already a float number and yet you have appended F -- why?

32.757 is a double precision (64-bit) floating point number. The "F" makes it single precision (32-bit). Since the compiler used for the Arduino doesn't support double precision, when you use 32.757 there it is a single precision constant.

So I wanted to make sure the comparison was as close as possible. If I leave off all the "F"s I get a different result, but then the calculations aren't the same as in the Arduino:

65513	65515	65517	65519	65522	65524

On most desktops, 32.757 is a double. In this case it doesn't matter, but you can/will get different code for:

    float result, input;
    result = input * 1000.0;  // input, constant, and calc will be promoted to "double."
    // vs
    result = input * 1000.0F;  // everything stays as "float."

(which can be somewhat important on ARM chips with single-precision floating point hardware. Being careful to use float constants instead of the default double constants can result in much smaller and faster code.)

1. That means that the declaration float input = 32.757; or double input = 32.757; in 32-bit architecture like ESP8266 will store the number as 64-bit number (0x404060E560418937). The data type float will be promoted to double.

2. If we declare like float input = 32.757F in 32-bit arch, then the number 32.757 will be stored as 0x4203072B and the data type will remain float.

3. In 8-bit architecture (ATmega328P), the declaration float input = 32.757; or float input = 32.757F; will always store the number as 0x4003072B.

Theoretically, the right hand side is calculated as a double, but then it is silently converted to a float (32bit) when stored in the float variable (which is also 32bits.) In reality, the compiler will be smart enough NOT to do this for constants (and even for some expressions.)

But consider the common example of adjusting an analogRead to a floating point value:

  float sensorValue = (analogRead(sensorPin)*5.0)/1024.0;

On an Uno R4 (32bit ARM), this compiles to the following invocations of (expensive) double-precision math:

    411e:       f001 fcf7       bl      5b10 <analogRead>
    4122:       f005 f813       bl      914c <__aeabi_i2d>   ;; convert int to double.
    4126:       2200            movs    r2, #0
    4128:       4b18            ldr     r3, [pc, #96]   ; (418c <loop+0x7c>)
    412a:       f005 f879       bl      9220 <__aeabi_dmul>    ;; double multiply
    412e:       2200            movs    r2, #0
    4130:       4b17            ldr     r3, [pc, #92]   ; (4190 <loop+0x80>)
    4132:       f005 f875       bl      9220 <__aeabi_dmul>  ;; another double multiply (the divide?)
    4136:       f005 fb43       bl      97c0 <__aeabi_d2f>  ;; convert the double result back to float

But if I do:

  float sensorValue = (analogRead(sensorPin)*5.0f)/1024.0f;

I get this nice short sequence of the native ARM floating point operations:

    411e:       f001 fcf3       bl      5b08 <analogRead>
    4122:       ee07 0a90       vmov    s15, r0
    4126:       eeb8 8ae7       vcvt.f32.s32    s16, s15  ;; convert int to float.
    412a:       eef1 7a04       vmov.f32        s15, #20
    412e:       ee28 8a27       vmul.f32        s16, s16, s15  ;; float multiply instruction
    4132:       eddf 7a15       vldr    s15, [pc, #84]  ; 4188 <loop+0x78>
    4136:       7820            ldrb    r0, [r4, #0]
    4138:       ee28 8a27       vmul.f32        s16, s16, s15  ;; second float mutiply

In 8-bit architecture (ATmega328P)...

Note that the use of 32bits for "double" on AVR is a compiler choice that may not apply to all 8bit CPUs. It would not be totally unexpected for some other compiler to do the intermediate operations with 64bit doubles as well.

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.