IEEE 754 Floating-Point Standard: Understanding Computer Arithmetic
If you've spent any time programming, you've likely encountered a strange phenomenon: 0.1 + 0.2 does not equal 0.3. Instead, you get something like 0.30000000000000004. This isn't a bug in your language; it's a fundamental consequence of how computers represent real numbers using the IEEE 754 standard.
In this guide, we'll demystify the IEEE 754 standard, explain how floating-point numbers are stored, and provide tips for handling precision issues in your code.
What is IEEE 754?
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is the most widely used standard for floating-point computation. Established in 1985, it defines formats for representing real numbers in binary and the operations performed on them.
The most common formats are:
- Single Precision (32-bit): Used in
floatin C/C++/Java. - Double Precision (64-bit): Used in
doublein C/C++/Java and is the default number type in JavaScript and Python.
How it Works: The Anatomy of a Float
A floating-point number is represented in a way similar to scientific notation ($1.23 \times 10^4$), but in binary. It consists of three parts:
- Sign Bit (1 bit):
0for positive,1for negative. - Exponent: Determines the scale of the number.
- Mantissa (Significand): Represents the significant digits.
64-bit Double Precision Layout:
- Sign: 1 bit
- Exponent: 11 bits
- Mantissa: 52 bits
The formula used is: $(-1)^{sign} \times (1.mantissa) \times 2^{exponent - bias}$
Why $0.1 + 0.2 \neq 0.3$?
The root cause is that most decimal fractions cannot be represented exactly in binary.
- In base 10, a fraction can be represented exactly if its denominator's prime factors are only 2 and 5 (the factors of 10).
- In base 2, a fraction can only be represented exactly if its denominator's prime factors are only 2.
$0.1$ is $1/10$. Since 10 has a factor of 5, it becomes an infinite repeating sequence in binary:
0.00011001100110011...
Computers must round this infinite sequence to fit into 32 or 64 bits, leading to the tiny errors we see.
Special Values
IEEE 754 also defines several special values to handle edge cases:
- NaN (Not a Number): Result of undefined operations (e.g.,
0/0). - Infinity ($\infty$): Result of overflow or division by zero (
1/0). - Negative Zero (-0): Distinct from positive zero in some calculations.
Best Practices for Developers
- Never use
==for floats: Always check if the difference is smaller than a tiny value (epsilon).if (Math.abs(0.1 + 0.2 - 0.3) < Number.EPSILON) { ... } - Use Decimals for Money: For financial calculations, use specialized libraries (like
decimal.js) or store values as integers (e.g., cents instead of dollars). - Be Aware of Range: Double precision can represent very large numbers, but precision decreases as the numbers get larger.
FAQ
Q: Is floating-point non-deterministic? A: Generally, no. Given the same inputs and the same rounding mode, IEEE 754 should produce the same results. However, different compilers or CPU instructions (like FMA) might cause slight variations.
Q: What is "BigInt"? A: BigInt (in JS) handles arbitrary-precision integers. It does not handle fractions. For fractions, you need a Decimal library or a Rational type.
Q: How many decimal digits of precision does a double have? A: A 64-bit double has about 15 to 17 significant decimal digits.
Related Tools
- Unit Converter - Perform precise conversions between different measurement units.
- JSON Formatter - Inspect how numbers are represented in JSON data.
- Hash Generator - Verify data integrity where every bit counts.