Product Support
1108 - Troubleshooting Common Floating-Point Arithmetic Problems
This technical note provides guidance to help identify and avoid phenomena that may occur when computing with floating-point numbers. Because MATLAB operations are performed in double-precision arithmetic conforming to the IEEE standard 754, a brief discussion of this standard and tips to mitigate the effects of floating point arithmetic in the computations performed by MATLAB are offered.
Basic Problems with Floating-Point Arithmetic
Basic Problems with Floating-Point Arithmetic
Section 1: Representation of Floating-Point Numbers
To recognize the floating-point arithmetic phenomena, it is essential to first understand how floating point numbers are represented. In the rare use of MATLAB that notices these occurrences, it is helpful to understand the properties and limitations of floating point numbers. This section briefly introduces the representation of floating-point numbers and some keywords. The native data type in MATLAB has traditionally been a double precision number. Double precision numbers are represented by the formula
(-1)^s 2^(e-1023) (1 + f)where s is a single bit representing the sign of the number, e is an eleven bit binary number representing the exponent, and f is a fifty-two bit binary number called the mantissa representing the precision of the number. In double precision, e spans between 1 and 2046. However, e may assume two other values, 0 (all bits are zero) and 2047 (all bits are one), which indicate special conditions which will be discussed later. The 1023 appearing in the formula above is the bias for the exponent, and it indicates that all of the numbers representable may have exponents in the range of -1022 to 1023. In double precision, f is such that 2^52 f is an integer which is greater than or equal to zero, but strictly less than 2^52. The leading 1 in the (1+f) portion of the formula above gives fifty-three bits of precision, and all of the numbers whose exponent is not one of the special values are referred to as normalized numbers. A value of 2047 for e indicates a non-finite value, such as Infinity or NaN (Not a Number). A value of 0 for e indicates a denormal, or subnormal, number. That is, when e is zero, the (1+f) term in the formula above is simply replaced with f. Thus, a zero in double precision is represented by s = 0, e = 0, and f = 0.
As a result of the finiteness of the representation of floating-point numbers, they show two limitations not experienced by the number systems which they model (namely the real numbers). These limitations are:
- There are gaps between two adjacent double precision numbers which depend on the magnitude of the numbers.
- Almost everywhere, the real numbers are not exactly representable as double precision numbers.
Gaps Between Doubles
The gaps are a result of having only fifty-three bits of precision. In particular, the next larger double precision number than 2^52 is 2^52 + 1. The hexadecimal representations of these numbers are 4330000000000000 and 4330000000000001 respectively. Use the command
format hexin MATLAB to see the hexadecimal representation of any number. For double precision floating-point numbers, the first three hexadecimal digits correspond to the sign bit followed by the eleven exponent bits. The mantissa is represented by the next thirteen hexadecimal digits. As the numbers grow larger, the gaps grow as well. For larger numbers, such as 2^100 (represented by 4630000000000000) the next largest number is 4630000000000001 which is 2^100 + 2^48! That is, between 2^100 and 2^48 there are no other double precision floating-point numbers. As a result, the following strange behavior occurs in MATLAB:
a = 2^100; b = a + 2^47; b == aans =
1
What to do instead: MATLAB Alternatives
The MATLAB command eps gives the value of the difference between its argument and the next largest floating-point number. eps called with no arguments returns the distance between 1 and the next largest double precision number. To obtain the single precision eps you may call eps('single'). Continuing with the example above, note thatformat long g; eps(2^100)ans =
281474976710656
which is equal to
2^48ans =
281474976710656
The effects this loss of precision has in numerical algorithms is discussed below.
Inexact Representations: Round-Off
Almost all real numbers are not representable as double precision floating-point numbers. Consider, for example, the number 0.1. This number is represented in double-precision by 3fb999999999999a which is larger than 0.1 by exactly (1/10)*(1/(2^(54))). As a result, successive additions of this number to itself have the following startling consequences:
format hex; (0.1 + 0.1) - 0.2ans =
0000000000000000
(0.1 + 0.1 + 0.1) - 0.3ans =
3c90000000000000
format long e; (0.1 + 0.1 + 0.1) - 0.3ans =
5.551115123125783e-017
Note that the last answer is not the expected value of zero. This is due to the representation errors in 0.2 and 0.1 compounding to give a result which is larger than the floating point representation of 0.3.
Section 2: Order of Operations, Swamping, and Cancellation
Due to the effects of round-off errors, as seen above, the influence of the order of operations becomes readily apparent. Consider the following function:f = @(n) (1 - n.*((n+1)./n - 1))./eps;In exact arithmetic, this function is equivalent to zero for all positive real numbers. In floating point arithmetic, however, this is not the case. Take into account the following plot:
plot(f(1:256),'.');
Almost none of the values of f are zero. In fact, the only arguments admit a value of zero are those corresponding to powers of two.
On the other hand, consider the function
g = @(n) (1 - n.*((n+1)./n) + n)./eps;which is exactly the same as f, except with the -n distributed. Plotting g on the same axes as f shows
plot(g(1:256),'r*');
Unlike the plot of f which shows how the error in the subtraction propagates, this plot shows how the multiplication accentuates the small round-off error made in the division. Also unlike f, which seems to assume all integers between -128 and 128 (although it does not), g only assumes values which are powers of two.
Another example of how the order of operations matters is quite simple:
format long e; eps/2 + 1 - eps/2ans =
9.999999999999999e-001
format hex; eps/2 + 1 - eps/2ans =
3fefffffffffffff
Note that the solution here is not one. Instead, it is the next smallest double precision number. The addition results in exactly one, since the one swamps the other addend:
format long e; eps/2 + 1ans =
1
format hex; eps/2 + 1ans =
3ff0000000000000
The subtraction is then performed with a so-called guard digit and results in
format long e; 1 - eps/2ans =
9.999999999999999e-001
format hex; 1 - eps/2ans =
3fefffffffffffff
What to do instead: MATLAB Alternatives
In cases like the one described above, it is best to re-order the computations to perform the operations with the numbers of smallest magnitude first in order to avoid swamping.format long e; eps/2 - eps/2 + 1ans =
1
format hex; eps/2 - eps/2 + 1ans =
3ff0000000000000
Note that it is not always practical to do this.
In some cases, it may be practical to change the computation to gain more accuracy in certain intervals. For example, consider the function
h = @(x) sqrt(x + 1) - 1;For all nonegative values of x less than eps, h will evaluate to zero, as the one will swamp x. Multiplying by sqrt(x + 1) + 1 and dividing by the same, h may be rewritten as the function
k = @(x) x./(sqrt(x + 1) + 1);A plot gives more clarity:
y = 0:(1/16):16; x = eps*y;
plot(x, h(x));
hold on;
plot(x, k(x), 'r');
legend('h', 'k', 'Location', 'NorthWest');
MATLAB has two functions which deal with this kind of phenomenon explictly. These functions are EXPM1 and LOG1P, and they are designed to compute exp(z)-1 and log(z + 1) accurately for values of z which are close to zero.
More Resources
- Moler, Cleve, "Floating Points," MATLAB News and Notes, Fall, 1996.
- Moler, Cleve, Numerical Computing with MATLAB, SIAM.
- Avoiding Common Problems with Floating Point Arithmetic in MATLAB Programming documentation
- Technical Support Solution 1-16FOQ: How do I tell if the error in my answer is the result of round-off error or a bug?
Store