7.6 Numerical Precision and Floating Point Considerations

Table of Contents

Why Numerical Precision Matters in MATLAB

MATLAB uses binary floating point arithmetic for almost all numeric work. This allows it to handle very large and very small numbers efficiently, but it also introduces small rounding errors and limits to precision. In this chapter you focus on what this means in practice when you write and run MATLAB code.

When you understand how floating point numbers behave, you can avoid many confusing results, such as sums that are slightly off, comparisons that fail even though numbers “look” the same, or algorithms that behave unstably for some inputs.

The double and single Precision Model

By default, MATLAB stores numeric values as double, which is a 64 bit floating point format. This format follows the IEEE 754 standard. The format is not explained in detail here, but there are two key consequences that you need to know.

First, double has about 15 to 16 decimal digits of precision. This means that while you can represent many different magnitudes, you only get a fixed number of significant digits that are reliable.

Second, the spacing between adjacent representable numbers depends on the magnitude of the number. Around 1.0, the gap between two consecutive doubles is about $2.22 \times 10^{-16}$. Around $10^{10}$ the gap is much larger.

You can query this spacing in MATLAB with eps. If you call eps with no input, you get the distance from 1.0 to the next larger representable double:

matlab

format long
eps

If you call eps(x), you get the distance between x and the next larger representable number:

matlab

eps(1)
eps(1e10)
eps(0.1)

The larger the magnitude of x, the larger eps(x) becomes. This is a direct expression of how precision behaves in floating point arithmetic.

MATLAB can also use single precision, which uses 32 bits and provides about 7 decimal digits of precision. You can create single precision numbers with single(3.14) and you can check classes with class(x). Single precision uses less memory but has lower accuracy and a smaller range of representable values.

Rounding Errors and Representation Limits

Not all decimal numbers can be represented exactly as finite binary fractions, just like $1/3$ cannot be written as a finite decimal. For example, 0.1 in decimal has no exact binary representation, so the closest representable double is slightly different from $0.1$.

You can see this if you display 0.1 with high precision:

matlab

format long
0.1

MATLAB prints a long decimal expansion of the actual stored double. This stored value is very close to $0.1$, but not exactly equal.

Because of this, seemingly simple expressions can produce surprising results:

matlab

format long
a = 0.1 + 0.2;
b = 0.3;
a - b

The result is a small nonzero number, not exactly zero. The arithmetic is still correct in floating point, but the representation cannot capture the decimal values exactly.

This type of tiny discrepancy is normal. When you write numerical code in MATLAB, you should expect results that are correct only up to a certain tolerance and not rely on exact decimal equality.

Equality Comparisons and Tolerances

Due to rounding and representation limits, direct equality comparisons with == can be misleading for floating point values. If two computations of the “same” quantity follow different arithmetic paths, they may differ slightly. Using == will often give false even when the values match within reasonable numerical error.

Consider this example:

matlab

x = 0.1 + 0.2;
y = 0.3;
x == y        % often false

A better pattern is to compare floating point numbers using a tolerance and check if the difference is “small enough”:

matlab

tol = 1e-12;
abs(x - y) < tol

The choice of tol depends on the scale of your problem and the magnitude of your numbers. You can also scale the tolerance by the magnitude of the numbers:

matlab

tol = 1e-12;
abs(x - y) < tol * max(1, max(abs([x y])))

For vectors or matrices, you can use functions such as norm to measure the size of a difference, then compare it to a tolerance:

matlab

A = rand(100,1);
B = A * 1.000000000000001;     % slightly scaled
diffNorm = norm(A - B);
tol = 1e-10;
diffNorm < tol

Avoid code that depends on exact equality of floating point results, for example in if statements or while loop stopping conditions. Instead, introduce a tolerance and check whether a difference has become small enough.

Accumulation of Errors in Loops and Sums

Each floating point operation can introduce a tiny rounding error. When you perform many operations in sequence, such as summing a long list of numbers, these errors can accumulate. The final result can deviate from the mathematically exact value by more than a single eps.

You can see this effect in long loops:

matlab

format long
s = 0;
for k = 1:1000000
    s = s + 0.1;
end
s

The result is not exactly $100000$, because each addition of 0.1 carries a small rounding error.

The order in which you sum numbers also matters. Adding many small numbers to a large number can cause the small ones to have little effect, because the gap between representable numbers near the large value is bigger than the small increment.

For sums in MATLAB, the built in sum function is usually more efficient and sometimes more accurate than a manual loop. However, you should still be aware that any floating point sum will have some rounding error.

If accuracy is critical for large sums, you can consider numerically more stable approaches, for example summing from smallest magnitude to largest, or using compensated summation algorithms. MATLAB provides sum as a general tool, but specific numerical methods are a topic in more advanced work.

Catastrophic Cancellation

Catastrophic cancellation occurs when you subtract two nearly equal numbers. The true result should be small, but the subtraction causes many leading digits to cancel and the remaining digits are dominated by rounding errors. The relative error in the result can be very large.

Consider the expression
$$
f(x) = \sqrt{x + 1} - \sqrt{x}.
$$

For large $x$, these two square roots are very close, so their difference is small. A direct computation can lose precision:

matlab

x = 1e12;
format long
f = sqrt(x + 1) - sqrt(x)

A mathematically equivalent but numerically more stable form is
$$
f(x) = \frac{1}{\sqrt{x + 1} + \sqrt{x}}.
$$

This avoids subtracting nearly equal numbers:

matlab

g = 1 ./ (sqrt(x + 1) + sqrt(x))

For large x, f and g differ, and g is usually more accurate. This example illustrates that algebraic rearrangement can significantly change numerical behavior, even when expressions are mathematically identical.

When you see patterns such as “a large expression minus a nearly equal large expression,” consider whether you can rewrite the code to avoid that subtraction.

Overflow, Underflow, Inf, and NaN

Floating point formats have limits on the largest and smallest magnitudes they can represent. When a calculation exceeds these limits or produces undefined results, MATLAB uses special values.

Overflow occurs when a result is too large in magnitude to represent. MATLAB then returns Inf or -Inf:

matlab

format long
x = realmax;      % largest representable double
y = x * 2

Here, realmax is the largest finite double. Multiplying by 2 produces Inf. You can inspect these limits with functions such as realmax and realmin.

Underflow occurs when a result is closer to zero than the smallest normal magnitude. In double precision this often produces either 0 or very tiny subnormal numbers. You can see the smallest positive normalized double with realmin.

MATLAB also uses NaN (Not a Number) for undefined or invalid arithmetic results:

matlab

0 / 0
Inf - Inf
sqrt(-1)          % in real arithmetic

These produce NaN. Once a value is NaN, it tends to spread through further calculations, since many operations with NaN result in NaN.

You can test for these special values with isinf, isnan, and isfinite:

matlab

isinf(Inf)
isnan(0/0)
isfinite(5)

Handling NaN and Inf correctly is especially important in data analysis, where missing data or invalid operations can appear in large arrays.

Scaling, Normalization, and Condition

The accuracy of floating point computations often depends on the scale of the numbers you work with and on the condition of the problem. Poorly scaled variables can amplify rounding errors. For example, if one variable has magnitude around $10^9$ and another around $10^{-3}$, adding them directly may lose the influence of the smaller term, because it falls below the local eps near the large value.

A simple example is:

matlab

format long
a = 1e9;
b = 1e-3;
a + b == a

The result of a + b can be exactly equal to a, because the increment b is smaller than the difference between adjacent representable numbers near a.

A common way to reduce these effects is to scale or normalize your data to a more uniform range before computation, for example subtracting the mean, dividing by a standard deviation, or rescaling to a fixed interval such as [0, 1]. This does not change the mathematical relationships, but can make the numerical behavior more stable.

The condition of a problem refers to how sensitive its solution is to small changes in input. A problem that is ill conditioned can produce large output changes from tiny input perturbations, even with perfect arithmetic. Floating point rounding then interacts with this inherent sensitivity and can produce large errors. Linear algebra tools in MATLAB often provide condition numbers, for example with cond, to help detect such issues in matrix computations.

Formatting and Interpreting Numeric Output

MATLAB’s format setting controls how numbers are displayed, but it does not change how they are stored in memory. This can affect how you interpret numerical results.

For example:

matlab

x = 0.1 + 0.2;
format short
x
format long
x

With format short, you may see 0.3000, which suggests exact equality with 0.3. With format long, you see more digits and the slight difference becomes visible.

When you inspect numerical results and suspect precision issues, switch to format long or format long e to see more detail. Remember that what you see is still a rounded decimal representation of the stored binary value.

Single vs Double Precision Considerations

Although MATLAB defaults to double, you may choose single precision for memory or performance reasons, especially with very large arrays or GPU computations. Single precision has fewer bits of precision, so it is more affected by rounding and representation errors.

To see the difference in spacing:

matlab

format long
eps(double(1))
eps(single(1))

The eps for single at 1 is much larger than for double, which means increments smaller than that are lost.

If you mix single and double in an expression, MATLAB often converts to double. You can check the resulting type with class. If you intentionally use single, be consistent and aware of the reduced accuracy.

Practical Habits for Reliable Numerical Results

There are a few simple habits that can significantly improve the robustness of your numerical MATLAB code.

First, avoid exact equality comparisons for floating point values. Use tolerances and magnitude scaled checks instead. This applies to if conditions, while loop termination, and logical indexing.

Second, be careful with operations that subtract nearly equal quantities. If you can, use mathematically equivalent forms that avoid such subtraction.

Third, be aware of the range of your data and intermediate results. Extremely large or extremely small values increase the risk of overflow, underflow, and loss of significance. Normalizing inputs can help.

Fourth, inspect results with sufficient precision when diagnosing issues. Use format long or format long e to see small discrepancies and to distinguish between 0, tiny numbers, and NaN or Inf.

Finally, when you rely on library functions in MATLAB, understand that they are generally written to handle floating point issues reasonably well, but they still inherit the same limitations. Always interpret answers as approximations within a certain accuracy, not as exact mathematical values.

Key points to remember:
Floating point numbers in MATLAB have limited precision, so many decimal values cannot be represented exactly.
Do not rely on exact equality with == for floating point results. Compare using a tolerance that matches your problem’s scale.
Rounding errors accumulate in long computations, and the order of operations can affect the final result.
Subtracting nearly equal numbers can cause catastrophic cancellation and large relative errors.
Watch for Inf, -Inf, and NaN in results. Use isinf, isnan, and isfinite to detect and handle them.
Formatting with format changes only how numbers are displayed, not how they are stored.
Scaling and normalization can greatly improve numerical stability, especially for large or poorly scaled values.

Comments

Please login to add a comment.

Don't have an account? Register now!

7.6 Numerical Precision and Floating Point Considerations

Why Numerical Precision Matters in MATLAB

The double and single Precision Model

Rounding Errors and Representation Limits

Equality Comparisons and Tolerances

Accumulation of Errors in Loops and Sums

Catastrophic Cancellation

Overflow, Underflow, Inf, and NaN

Scaling, Normalization, and Condition

Formatting and Interpreting Numeric Output

Single vs Double Precision Considerations

Practical Habits for Reliable Numerical Results

Comments

Where to Move