
Rounding (roundoff) error is a phenomenon of digital computing resulting from the computer's inability to represent some numbers exactly. Specifically, a computer is able to represent exactly only integers in a certain range, depending on the word size used for integers. Certain floatingpoint numbers may also be represented exactly, depending on the representation scheme in use on the computer in question and the word size used for floatingpoint numbers. Certain floatingpoint numbers cannot be represented exactly, regardless of the word size used.
Errors due to rounding have long been the bane of analysts trying to solve equations and systems. Such errors may be introduced in many ways, for instance:
Integer Representation
A digital computer can represent exactly integers in the range 0..2^{n} 1, where n is the number of bits used to represent an integer (typically 16 or 32; 2^{16}1=65535, 2^{32}1=4,294,967,295). Usually, half these integers are used to represent negative numbers, so the effective range is 2^{n1}..2^{n}1 (32,768..32,767 for 16 bits, 2,147,483,648.. 2,147,483,647 for 32 bits). To accomplish this, "two's complement" representation is typically used so that a negative number k is represented by adding a "bias term" of 2^{n} to get k+2^{n}. For instance, if n=4 bits is used (a number too small to be likely, but useful for illustrative purposes) the numbers 8..7 may be represented, by adding a bias term of 2^{4}=16 so that the negative numbers 8..1 are represented as 8..15 (see below).
Four bits:
0 
0000 
4 
0100 
8 
1000 
12 
1100 
1 
0001 
5 
0101 
9 
1001 
13 
1101 
2 
0010 
6 
0110 
10 
1000 
14 
1110 
3 
0011 
7 
0111 
11 
1011 
15 
1111 
Two's complement:
0 
0000 
4 
0100 
8 
1000 
4 
1100 
1 
0001 
5 
0101 
7 
1001 
3 
1101 
2 
0010 
6 
0110 
6 
1000 
2 
1110 
3 
0011 
7 
0111 
5 
1011 
1 
1111 
Overflow
Overflow occurs when an arithmetic calculation results in an integer too large for the word size. For instance, with n=4 bits, the result of adding 6+7 is 13, which exceeds the maximum positive integer (7). The result of this calculation is 1101, which is interpreted as 3. In a more likely example, the result of adding 20000+20000 results in an integer too large for 16bit integers (with the result interpreted as 25536).
FloatingPoint Representation
The exact way floatingpoint numbers are represented varies between computing platforms, although the same basic ideas apply in general.
The general representation scheme used for floatingpoint numbers is based on the notion of 'scientific notation form' (e.g., the number 257.25 may be expressed as .25725 x 10^{3}. This number is said to have a mantissa of .25725 and exponent 3). Using this idea, floatingpoint numbers are represented in binary in the form
s eee....e mmm....m
The first bit (signified as 's') is a "sign bit": 0 for positive numbers, 1 for negative numbers. The next group of bits ('eee....e') is used to represent a positive or negative integer exponent. The number of bits used to represent the exponent is not standard, although it must be large enough to allow a reasonable range of values. Because the number is stored in binary form, its exponent uses a base of 2, not 10.
The remaining bits ('mmm....m') represent the mantissa. It is in the mantissa that accuracy may be lost. Clearly, the more bits used to represent the mantissa, the greater the precision of the number.
Overflow and Underflow in FloatingPoint Calculations
Because the mantissa and exponents are integers, it is possible to experience overflow when performing calculations that produce results exceeding the field size of the component parts. Overflow of the mantissa results in a loss of accuracy but the loss is in the least significant bits of the number. Overflow of the exponent means the number has become too large to be represented using the given representation for floatingpoint numbers.
A calculation resulting in a number so small that the negative number used for the exponent is beyond the number of bits used for exponents is another type of overflow, often called underflow. This means the number has become too large to be represented using the given representation for floatingpoint numbers.
Example 1: FloatingPoint Representation of Whole Numbers
A typical 64bit binary representation of the floatingpoint numbers 1.0, 2.0, …, 16.0 is shown below:
1.0 : 0 01111111111 0000 … 0
2.0 : 0 10000000000 0000 … 0
3.0 : 0 10000000000 1000 … 0
4.0 : 0 10000000001 0000 … 0
5.0 : 0 10000000001 0100 … 0
6.0 : 0 10000000001 1000 … 0
7.0 : 0 10000000001 1100 … 0
8.0 : 0 10000000010 0000 … 0
9.0 : 0 10000000010 0010 … 0
10.0 : 0 10000000010 0100 … 0
11.0 : 0 10000000010 0110 … 0
12.0 : 0 10000000010 1000 … 0
13.0 : 0 10000000010 1010 … 0
14.0 : 0 10000000010 1100 … 0
15.0 : 0 10000000010 1110 … 0
16.0 : 0 10000000011 0000 … 0
The first bit is the sign bit.
The next 11 bits represent the exponent, plus a bias term of (2^{10}1), meaning:
the number 0 in the exponent field is interpreted as (2^{10}1);
the number (2^{11}1) in the exponent field is interpreted as (2^{11}1)(2^{10}1) = 2^{11}  2^{10} = 2^{10};
the number (2^{10}1) in the exponent field is interpreted as 0; and in general,
the number (2^{10}+k) in the exponent field is interpreted as k+1.
The mantissa field stores a binary fraction f (0<=f<1), so the mantissa represents the value 1+f. All floatingpoint numbers can be normalized into the form (1+f)*2^{e} (where e = E(2^{10}1)).
Thus,
1.0 = (1+0) * 2^{0},
2.0 = (1+0) * 2^{1},
3.0 = (1+0.5) * 2^{1},
4.0 = (1+0) * 2^{2},
5.0 = (1+.25) * 2^{2},
6.0 = (1+.5) * 2^{2},
7.0 = (1+.75) * 2^{2},
8.0 = (1+0) * 2^{3},
9.0 = (1+.125)* 2^{3},
10.0 = (1+.25) * 2^{3},
11.0 = (1+.375)* 2^{3},
12.0 = (1+.5) * 2^{3},
13.0 = (1+.625)* 2^{3},
14.0 = (1+.75) * 2^{3},
15.0 = (1+.875)* 2^{3},
16.0 = (1+0) * 2^{4},
etc.
To convert these numbers to the representation form described earlier, add 2^{10}1 to the exponent e (thus, 3 is represented as 2^{10}1 + 3 = 2^{10}+2 = 2^{10} + 2^{1} = 10000000010) and use the fractional portion f for the mantissa (so 1.125 is represented as 001000...0) and then
9.0 is 1.125*2^{3} = 0 10000000010 0010 … 0.
Notice 9.0 is 1.125*2^{3} = 1 10000000010 0010 … 0.
Example 2: FloatingPoint Representation of Numbers with Fractional Parts
The number 15/128 = .1171875 is
(1+.875)*2^{4}
The exponent is 2^{10}1  4 = 2^{10}5 = 2^{10}2^{2}  2^{0}
= 2^{9} + 2^{8} + 2^{7} + 2^{6} + 2^{5} + 2^{4} + 2^{3} + 2^{1} + 2^{0}
= 01111111011 in binary.
The mantissa is 1110 in binary.
.1171875 = 0 01111111011 1110 … 0
Notice then that the mantissa for 15/128 is the same as for 15.0  only the exponent differs. This is an important observation, as it shows why in general, the mantissas for the integers 0..2^{k1} are the same as those for the fractions 0/2^{k} .. (2^{k}1)/2^{k}. The mantissas in this set are expressible in k bits  and so may be represented exactly if the word size uses at least k bits. This also shows why only numbers of the form 0/2^{k} .. (2^{k}1)/2^{k} may be expressed exactly with k bits, which is of particular interest when k is the total number of bits in the mantissa.
Arithmetic Operations
Arithmetic operations can result in loss of accuracy.
A seemingly innocuous operation like addition can greatly increase the amount of precision needed to represent the resulting number. The reason for this is that two numbers represented in scientific notation cannot be added unless they have the same exponent. Consider the following illustration of the computation 192 + 3 = 195 :
The binary representation of 192 is 1.5*2^{7} = 0 10000110 100 … 0
The binary representation of 3 is 1.5*2^{1} = 0 10000000 100 … 0
Note that both these numbers have very simple mantissas requiring only 1 bit of precision. Their sum, however requires 7 bits to represent the mantissa:
The binary representation of 195 is 1.5234375*2^{7} = 0 10000110 1000011 … 0
This occurs because the number 3 is shifted from 1.5*2^{1} to 0.0234375*2^{7} first, so that the addition can be accomplished:
3 : 0.0234375*2^{7} = 0 10000110 0000011 … 0
192 : 1.5 *2^{7} = 0 10000110 1000000 … 0
195 : 1.5234375*2^{7} = 0 10000110 1000011 … 0
In general, the greater the discrepancy in size between the exponents of two numbers, the greater the precision needed to represent their sum. Should the discrepancy be too large, the smaller number will be lost entirely and the calculated sum will simply equal the larger number.
Subtraction is essentially the same as addition.
Multiplication of two numbers in scientific notation is accomplished by multiplying their mantissas and adding their exponents. That is, if x=(1+f)*2^{m} and y=(1+g)*2^{n} then xy=(1+f+g+fg)*2^{m+n}. Overflow or underflow of the exponent is possible, but the most likely reason for loss of accuracy is overflow of the mantissa.
Division is accomplished by dividing their mantissas and subtracting their exponents. Dividing mantissas can create an infinite length repeating mantissa in the result, for instance, dividing 1 by 10 as shown below.
Other operations(particularly trigonometric functions) generally produce irrational numbers whose mantissas are therefore stored inexactly.
In summary, the number of bits of precision increases when two numbers are used in an operation and some bits will be lost when the precision exceeds the number of bits in the mantissa. Care should be exercised whenever performing an operation.
Inexact Numbers
Some numbers cannot be represented exactly. One reason for this is that some rational numbers  those that can be expressed as the ratio of two integers  may have repeating, infinite length mantissas when expressed in binary. For instance, the number 1/3 = .333333… has such a mantissa in base 10. (We denote this .3 to show that the digit 3 repeats infinitely.) Expressed in binary, 1/3 = .010101… = .01 . Thus, in either system, representation of 1/3 in scientific notation with a fixed number of bits results in some loss of accuracy. The number 1/10 = .0011 in binary and 1/20 = .00011, whereas 1/10 = .1, 1/20 = .05 in decimal.
In general, suppose m bits are used to represent a mantissa and e bits are used to represent the exponent in binary. Floatingpoint numbers that can be expressed with mantissas k/2^{m} (2^{m} <= k < 2^{m}) and exponents in the range 2^{e} .. 2^{e} may be represented exactly in this system, whereas others are subject to rounding error.
Computing systems may use various methods for accounting for lost bits  in particular "truncation" or "rounding". For instance, with rounding, the lost bits in the representation of 1/10 are rounded up, but the lost bits in the representation of 7/10 are rounded down. Either case results in a loss of accuracy. See below:
0.1 : 0 01111011100 11001100110011001101
0.7 : 0 01111110011 00110011001100110011
The 32^{nd} bit in the representation of 0.1 should be 0, but the bits that follow and are lost are 1100 and get rounded up. The 32^{nd} bit in the representation of 0.7 is properly 1, but the bits that follow and are lost are 0011 and get rounded down.
Numbers that cannot be represented as the ratio of two integers are irrational. This includes numbers such as and p . These numbers cannot be written as repeating decimal (or binary) numbers  if so, they could be represented as the ratio of two integers. Irrational numbers cannot be represented exactly on a digital computer using the floatingpoint representations discussed earlier, and therefore are stored inexactly.
Usage of an inexact number x in a calculation means that formulas involving x actually are using x+e for some small number e . This number e may be insignificant but may also become very critical depending on the calculation. For instance, repeatedly adding a step size 0.7 to a number will result in accumulated loss of e each time an addition is performed. Doing this 1000 times results in a loss of 1000e .
Observe the following table that shows the effects of repeated addition of the value 0.7. The comparison is between 0.7 added i times and i*0.7 (recognizing that the calculation of i*0.7 is not exact, either). Results are reported for powers of 2 and 10 between 1 and 10000.
i 
sum 
i*d 
diff 
1 
0.69999999 
0.69999999 
0 
2 
1.4 
1.4 
0 
4 
2.8 
2.8 
0 
8 
5.5999994 
5.5999999 
4.7683716e07 
10 
6.999999 
7 
8.3446503e07 
16 
11.199998 
11.2 
1.9073486e06 
32 
22.400003 
22.4 
3.8146973e06 
64 
44.800026 
44.799999 
2.6702881e05 
100 
70.000023 
70 
2.4080276e05 
128 
89.599937 
89.599998 
6.1035156e05 
256 
179.19955 
179.2 
0.00044250488 
512 
358.401 
358.39999 
0.0010070801 
1000 
700.00696 
700 
0.0069699287 
1024 
716.80725 
716.79999 
0.0072631836 
2048 
1433.584 
1433.6 
0.015991211 
4096 
2867.084 
2867.2 
0.1159668 
8192 
5734.6553 
5734.3999 
0.25537109 
10000 
7000.6084 
7000 
0.60851765 
Clearly, results of calculations that propagate accumulated error can become too inaccurate.
Coping with Rounding Error
There are various approaches to dealing with rounding error. The particular choices depend on the severity of the problem and how critical the accuracy of the answers:
Using larger floatingpoint words
Certainly the simplest approach is to increase the word size used for floatingpoint variables. In a C/C++ program for instance, changing variable declarations from float to double requires no other modifications to the program. The loss in accuracy from inexact numbers is reduced considerably. For instance, using a 32bit float, 0.1 is represented as
13421773/134217727,
for an error on the order of 10^{9}. Using a 64bit double, 0.1 is represented as
57646075230342400/576460752303423000,
for an error on the order of 10^{15}.
Note that because larger words use more storage space, total storage can become scarce more quickly when using large arrays of floatingpoint numbers. It may be necessary to use the larger words for only some floatingpoint numbers used in key calculations.
Whether using rational or irrational numbers, it may be necessary to have higher numerical precision than is afforded by builtin data types. This can be achieved by defining one’s own data types with mantissas of arbitrary length, using array structures to represent integers, for instance. Of course, it is also necessary to define the arithmetic operations that operate on any such defined type.
Using Exact Numbers
As pointed out previously, the only floatingpoint numbers that can be represented exactly have the form 0/2^{k} .. (2^{k}1)/2^{k} (with a positive or negative sign bit). Use of any other numbers results in one of these numbers, plus error term e . The error term is 0 if the number is exact. Doing arithmetic wherever possible with exact numbers should result in fewer propagated errors.
Representing numbers as rational numbers with separate integer numerators and denominators can also increase precision. For example, rather than store the number d=7/10 as a floatingpoint number with lower order bits truncated as shown earlier, d can be stored as a record with separate numerator and denominator fields, (d.num=7, d.denom=10). Then, repeated addition of d to a sum variable (also represented as a rational) produces (sum.num=14, sum.denom=10); (sum.num=21, sum.denom=10), etc. Likewise, arithmetic operations of addition, subtraction, multiplication, or division of two rational numbers represented in this way continue to produce rationals with separate integer numerators and denominators. The potential of overflow is a persistent threat, however: at some point, precision is lost.
Representation of irrational numbers is more problematic. However, it may be possible to use such numbers in symbolic form  for instance, rather than calculate directly and incur rounding error, it may be possible to factor it out and perform calculations with it in symbolic form, to yield a final result of the form a + b.
Revising Calculations to Reduce Effects of Rounding
In some cases, adverse effects of rounding can be reduced by revising the order of calculations. For instance, suppose a, b, c, and d are stored exactly, and we need to compute Y= a/d + b/d + c/d. Computing this as written may introduce rounding error in each of the computations a/d, b/d, c/d. However, computation of the exact sum (a + b + c) then dividing by d has less potential for error since it employs 1 division operation, not 3.
In other cases, the algorithms used to calculate values may be ineffective due to rounding errors, and alternative algorithms may need to be developed. For instance, solution of a linear system of equations of the form Ax=b for a square matrix A may be solved by determining A^{1} (if it exists) then computing x=A^{1}b. Yet the computed value of A^{1} may be highly inaccurate due to compounded error in computing the individual terms, and further degradation may occur in computing A^{1}b. Where applicable, a more accurate approach is often to solve the equivalent system x = b(AI)x iteratively, starting with an initial guess x^{0}, then computing x^{k+1} = b(AI)x^{k} repeatedly until the values converge to a solution with desired precision. The reason this approach works is that the initial guess is assumed to contain error, that is, x^{0}=x+e . Subsequent iterations are designed to reduce e : the rounding error is treated as part of the total error, and so iteration terminates when total error has been reduced to a tolerable level.
Error Estimation and Analysis
Once a number is known to contain an inaccuracy, an error term e has been introduced. Each calculation that follows propagates the error. Examination of the algorithm in question can yield an estimate of actual error and/or bounds on total error. If the maximum total error has an upper bound within a tolerable range, the algorithm can be used with confidence. If not, the algorithm must be revised to reduce the scope of rounding error.
Summary
Rounding error is a natural consequence of the representation scheme used for integers and floatingpoint numbers in digital computers. Rounding can produce highly inaccurate results as errors get propagated through repeated operations using inaccurate numbers. Proper handling of rounding error may involve a combination of approaches such as use of highprecision data types and revised calculations and algorithms. Mathematical analysis can be used to estimate the actual error in calculations.