Rounding Error
Jeffrey L. Popyack, June 2000

Rounding (roundoff) error is a phenomenon of digital computing resulting from the computer's inability to represent some numbers exactly. Specifically, a computer is able to represent exactly only integers in a certain range, depending on the word size used for integers. Certain floating-point numbers may also be represented exactly, depending on the representation scheme in use on the computer in question and the word size used for floating-point numbers. Certain floating-point numbers cannot be represented exactly, regardless of the word size used.

Errors due to rounding have long been the bane of analysts trying to solve equations and systems. Such errors may be introduced in many ways, for instance:

Integer Representation

A digital computer can represent exactly integers in the range 0..2n -1, where n is the number of bits used to represent an integer (typically 16 or 32; 216-1=65535, 232-1=4,294,967,295). Usually, half these integers are used to represent negative numbers, so the effective range is -2n-1..2n-1 (-32,768..32,767 for 16 bits, -2,147,483,648.. 2,147,483,647 for 32 bits). To accomplish this, "two's complement" representation is typically used so that a negative number k is represented by adding a "bias term" of 2n to get k+2n. For instance, if n=4 bits is used (a number too small to be likely, but useful for illustrative purposes) the numbers -8..7 may be represented, by adding a bias term of 24=16 so that the negative numbers -8..-1 are represented as 8..15 (see below).

Four bits:

0

0000

4

0100

8

1000

12

1100

1

0001

5

0101

9

1001

13

1101

2

0010

6

0110

10

1000

14

1110

3

0011

7

0111

11

1011

15

1111

Two's complement:

0

0000

4

0100

-8

1000

-4

1100

1

0001

5

0101

-7

1001

-3

1101

2

0010

6

0110

-6

1000

-2

1110

3

0011

7

0111

-5

1011

-1

1111

Overflow

Overflow occurs when an arithmetic calculation results in an integer too large for the word size. For instance, with n=4 bits, the result of adding 6+7 is 13, which exceeds the maximum positive integer (7). The result of this calculation is 1101, which is interpreted as -3. In a more likely example, the result of adding 20000+20000 results in an integer too large for 16-bit integers (with the result interpreted as -25536).

Floating-Point Representation

The exact way floating-point numbers are represented varies between computing platforms, although the same basic ideas apply in general.

The general representation scheme used for floating-point numbers is based on the notion of 'scientific notation form' (e.g., the number 257.25 may be expressed as .25725 x 103. This number is said to have a mantissa of .25725 and exponent 3). Using this idea, floating-point numbers are represented in binary in the form

s eee....e mmm....m

The first bit (signified as 's') is a "sign bit": 0 for positive numbers, 1 for negative numbers. The next group of bits ('eee....e') is used to represent a positive or negative integer exponent. The number of bits used to represent the exponent is not standard, although it must be large enough to allow a reasonable range of values. Because the number is stored in binary form, its exponent uses a base of 2, not 10.

The remaining bits ('mmm....m') represent the mantissa. It is in the mantissa that accuracy may be lost. Clearly, the more bits used to represent the mantissa, the greater the precision of the number.

Overflow and Underflow in Floating-Point Calculations

Because the mantissa and exponents are integers, it is possible to experience overflow when performing calculations that produce results exceeding the field size of the component parts. Overflow of the mantissa results in a loss of accuracy but the loss is in the least significant bits of the number. Overflow of the exponent means the number has become too large to be represented using the given representation for floating-point numbers.

A calculation resulting in a number so small that the negative number used for the exponent is beyond the number of bits used for exponents is another type of overflow, often called underflow. This means the number has become too large to be represented using the given representation for floating-point numbers.

Example 1: Floating-Point Representation of Whole Numbers

A typical 64-bit binary representation of the floating-point numbers 1.0, 2.0, …, 16.0 is shown below:

 1.0 : 0 01111111111 0000 … 0
 2.0 : 0 10000000000 0000 … 0
 3.0 : 0 10000000000 1000 … 0
 4.0 : 0 10000000001 0000 … 0
 5.0 : 0 10000000001 0100 … 0
 6.0 : 0 10000000001 1000 … 0
 7.0 : 0 10000000001 1100 … 0
 8.0 : 0 10000000010 0000 … 0
 9.0 : 0 10000000010 0010 … 0
10.0 : 0 10000000010 0100 … 0
11.0 : 0 10000000010 0110 … 0
12.0 : 0 10000000010 1000 … 0
13.0 : 0 10000000010 1010 … 0
14.0 : 0 10000000010 1100 … 0
15.0 : 0 10000000010 1110 … 0
16.0 : 0 10000000011 0000 … 0

The first bit is the sign bit.

The next 11 bits represent the exponent, plus a bias term of (210-1), meaning:

the number 0 in the exponent field is interpreted as -(210-1);

the number (211-1) in the exponent field is interpreted as (211-1)-(210-1) = 211 - 210 = 210;

the number (210-1) in the exponent field is interpreted as 0; and in general,

the number (210+k) in the exponent field is interpreted as k+1.

The mantissa field stores a binary fraction f (0<=f<1), so the mantissa represents the value 1+f. All floating-point numbers can be normalized into the form (1+f)*2e (where e = E-(210-1)).

Thus,

 1.0 = (1+0) * 20,
 2.0 = (1+0) * 21,
 3.0 = (1+0.5) * 21,
 4.0 = (1+0) * 22,
 5.0 = (1+.25) * 22,
 6.0 = (1+.5) * 22,
 7.0 = (1+.75) * 22,
 8.0 = (1+0) * 23,
 9.0 = (1+.125)* 23,
10.0 = (1+.25) * 23,
11.0 = (1+.375)* 23,
12.0 = (1+.5) * 23,
13.0 = (1+.625)* 23,
14.0 = (1+.75) * 23,
15.0 = (1+.875)* 23,
16.0 = (1+0) * 24,
etc.

To convert these numbers to the representation form described earlier, add 210-1 to the exponent e (thus, 3 is represented as 210-1 + 3 = 210+2 = 210 + 21 = 10000000010) and use the fractional portion f for the mantissa (so 1.125 is represented as 001000...0) and then

9.0 is 1.125*23 = 0 10000000010 0010 … 0.

Notice -9.0 is -1.125*23 = 1 10000000010 0010 … 0.

Example 2: Floating-Point Representation of Numbers with Fractional Parts

The number 15/128 = .1171875 is

(1+.875)*2-4

The exponent is 210-1 - 4 = 210-5 = 210-22 - 20

= 29 + 28 + 27 + 26 + 25 + 24 + 23 + 21 + 20

= 01111111011 in binary.

The mantissa is 1110 in binary.

.1171875 = 0 01111111011 1110 … 0

Notice then that the mantissa for 15/128 is the same as for 15.0 - only the exponent differs. This is an important observation, as it shows why in general, the mantissas for the integers 0..2k-1 are the same as those for the fractions 0/2k .. (2k-1)/2k. The mantissas in this set are expressible in k bits - and so may be represented exactly if the word size uses at least k bits. This also shows why only numbers of the form 0/2k .. (2k-1)/2k may be expressed exactly with k bits, which is of particular interest when k is the total number of bits in the mantissa.

Arithmetic Operations

Arithmetic operations can result in loss of accuracy.

A seemingly innocuous operation like addition can greatly increase the amount of precision needed to represent the resulting number. The reason for this is that two numbers represented in scientific notation cannot be added unless they have the same exponent. Consider the following illustration of the computation 192 + 3 = 195 :

The binary representation of 192 is 1.5*27 = 0 10000110 100 … 0

The binary representation of 3 is 1.5*21 = 0 10000000 100 … 0

Note that both these numbers have very simple mantissas requiring only 1 bit of precision. Their sum, however requires 7 bits to represent the mantissa:

The binary representation of 195 is 1.5234375*27 = 0 10000110 1000011 … 0

This occurs because the number 3 is shifted from 1.5*21 to 0.0234375*27 first, so that the addition can be accomplished:

3 : 0.0234375*27 = 0 10000110 0000011 … 0
192 : 1.5 *27 = 0 10000110 1000000 … 0
195 : 1.5234375*27 = 0 10000110 1000011 … 0

In general, the greater the discrepancy in size between the exponents of two numbers, the greater the precision needed to represent their sum. Should the discrepancy be too large, the smaller number will be lost entirely and the calculated sum will simply equal the larger number.

Subtraction is essentially the same as addition.

Multiplication of two numbers in scientific notation is accomplished by multiplying their mantissas and adding their exponents. That is, if x=(1+f)*2m and y=(1+g)*2n then xy=(1+f+g+fg)*2m+n. Overflow or underflow of the exponent is possible, but the most likely reason for loss of accuracy is overflow of the mantissa.

Division is accomplished by dividing their mantissas and subtracting their exponents. Dividing mantissas can create an infinite length repeating mantissa in the result, for instance, dividing 1 by 10 as shown below.

Other operations(particularly trigonometric functions) generally produce irrational numbers whose mantissas are therefore stored inexactly.

In summary, the number of bits of precision increases when two numbers are used in an operation and some bits will be lost when the precision exceeds the number of bits in the mantissa. Care should be exercised whenever performing an operation.

Inexact Numbers

Some numbers cannot be represented exactly. One reason for this is that some rational numbers - those that can be expressed as the ratio of two integers - may have repeating, infinite length mantissas when expressed in binary. For instance, the number 1/3 = .333333… has such a mantissa in base 10. (We denote this .3 to show that the digit 3 repeats infinitely.) Expressed in binary, 1/3 = .010101… = .01 . Thus, in either system, representation of 1/3 in scientific notation with a fixed number of bits results in some loss of accuracy. The number 1/10 = .0011 in binary and 1/20 = .00011, whereas 1/10 = .1, 1/20 = .05 in decimal.

In general, suppose m bits are used to represent a mantissa and e bits are used to represent the exponent in binary. Floating-point numbers that can be expressed with mantissas k/2m (-2m <= k < 2m) and exponents in the range -2e .. 2e may be represented exactly in this system, whereas others are subject to rounding error.

Computing systems may use various methods for accounting for lost bits - in particular "truncation" or "rounding". For instance, with rounding, the lost bits in the representation of 1/10 are rounded up, but the lost bits in the representation of 7/10 are rounded down. Either case results in a loss of accuracy. See below:

0.1 : 0 01111011100 11001100110011001101
0.7 : 0 01111110011 00110011001100110011

The 32nd bit in the representation of 0.1 should be 0, but the bits that follow and are lost are 1100 and get rounded up. The 32nd bit in the representation of 0.7 is properly 1, but the bits that follow and are lost are 0011 and get rounded down.

Numbers that cannot be represented as the ratio of two integers are irrational. This includes numbers such as and p . These numbers cannot be written as repeating decimal (or binary) numbers - if so, they could be represented as the ratio of two integers. Irrational numbers cannot be represented exactly on a digital computer using the floating-point representations discussed earlier, and therefore are stored inexactly.

Usage of an inexact number x in a calculation means that formulas involving x actually are using x+e for some small number e . This number e may be insignificant but may also become very critical depending on the calculation. For instance, repeatedly adding a step size 0.7 to a number will result in accumulated loss of e each time an addition is performed. Doing this 1000 times results in a loss of 1000e .

Observe the following table that shows the effects of repeated addition of the value 0.7. The comparison is between 0.7 added i times and i*0.7 (recognizing that the calculation of i*0.7 is not exact, either). Results are reported for powers of 2 and 10 between 1 and 10000.

i

sum

i*d

diff

1

0.69999999

0.69999999

0

2

1.4

1.4

0

4

2.8

2.8

0

8

5.5999994

5.5999999

4.7683716e-07

10

6.999999

7

8.3446503e-07

16

11.199998

11.2

1.9073486e-06

32

22.400003

22.4

-3.8146973e-06

64

44.800026

44.799999

-2.6702881e-05

100

70.000023

70

-2.4080276e-05

128

89.599937

89.599998

6.1035156e-05

256

179.19955

179.2

0.00044250488

512

358.401

358.39999

-0.0010070801

1000

700.00696

700

-0.0069699287

1024

716.80725

716.79999

-0.0072631836

2048

1433.584

1433.6

0.015991211

4096

2867.084

2867.2

0.1159668

8192

5734.6553

5734.3999

-0.25537109

10000

7000.6084

7000

-0.60851765

Clearly, results of calculations that propagate accumulated error can become too inaccurate.

Coping with Rounding Error

There are various approaches to dealing with rounding error. The particular choices depend on the severity of the problem and how critical the accuracy of the answers:

Using larger floating-point words

Certainly the simplest approach is to increase the word size used for floating-point variables. In a C/C++ program for instance, changing variable declarations from float to double requires no other modifications to the program. The loss in accuracy from inexact numbers is reduced considerably. For instance, using a 32-bit float, 0.1 is represented as

13421773/134217727,

for an error on the order of 10-9. Using a 64-bit double, 0.1 is represented as

57646075230342400/576460752303423000,

for an error on the order of 10-15.

Note that because larger words use more storage space, total storage can become scarce more quickly when using large arrays of floating-point numbers. It may be necessary to use the larger words for only some floating-point numbers used in key calculations.

Whether using rational or irrational numbers, it may be necessary to have higher numerical precision than is afforded by built-in data types. This can be achieved by defining one’s own data types with mantissas of arbitrary length, using array structures to represent integers, for instance. Of course, it is also necessary to define the arithmetic operations that operate on any such defined type.

Using Exact Numbers

As pointed out previously, the only floating-point numbers that can be represented exactly have the form 0/2k .. (2k-1)/2k (with a positive or negative sign bit). Use of any other numbers results in one of these numbers, plus error term e . The error term is 0 if the number is exact. Doing arithmetic wherever possible with exact numbers should result in fewer propagated errors.

Representing numbers as rational numbers with separate integer numerators and denominators can also increase precision. For example, rather than store the number d=7/10 as a floating-point number with lower order bits truncated as shown earlier, d can be stored as a record with separate numerator and denominator fields, (d.num=7, d.denom=10). Then, repeated addition of d to a sum variable (also represented as a rational) produces (sum.num=14, sum.denom=10); (sum.num=21, sum.denom=10), etc. Likewise, arithmetic operations of addition, subtraction, multiplication, or division of two rational numbers represented in this way continue to produce rationals with separate integer numerators and denominators. The potential of overflow is a persistent threat, however: at some point, precision is lost.

Representation of irrational numbers is more problematic. However, it may be possible to use such numbers in symbolic form - for instance, rather than calculate directly and incur rounding error, it may be possible to factor it out and perform calculations with it in symbolic form, to yield a final result of the form a + b.

Revising Calculations to Reduce Effects of Rounding

In some cases, adverse effects of rounding can be reduced by revising the order of calculations. For instance, suppose a, b, c, and d are stored exactly, and we need to compute Y= a/d + b/d + c/d. Computing this as written may introduce rounding error in each of the computations a/d, b/d, c/d. However, computation of the exact sum (a + b + c) then dividing by d has less potential for error since it employs 1 division operation, not 3.

In other cases, the algorithms used to calculate values may be ineffective due to rounding errors, and alternative algorithms may need to be developed. For instance, solution of a linear system of equations of the form Ax=b for a square matrix A may be solved by determining A-1 (if it exists) then computing x=A-1b. Yet the computed value of A-1 may be highly inaccurate due to compounded error in computing the individual terms, and further degradation may occur in computing A-1b. Where applicable, a more accurate approach is often to solve the equivalent system x = b-(A-I)x iteratively, starting with an initial guess x0, then computing xk+1 = b-(A-I)xk repeatedly until the values converge to a solution with desired precision. The reason this approach works is that the initial guess is assumed to contain error, that is, x0=x+e . Subsequent iterations are designed to reduce e : the rounding error is treated as part of the total error, and so iteration terminates when total error has been reduced to a tolerable level.

Error Estimation and Analysis

Once a number is known to contain an inaccuracy, an error term e has been introduced. Each calculation that follows propagates the error. Examination of the algorithm in question can yield an estimate of actual error and/or bounds on total error. If the maximum total error has an upper bound within a tolerable range, the algorithm can be used with confidence. If not, the algorithm must be revised to reduce the scope of rounding error.

Summary

Rounding error is a natural consequence of the representation scheme used for integers and floating-point numbers in digital computers. Rounding can produce highly inaccurate results as errors get propagated through repeated operations using inaccurate numbers. Proper handling of rounding error may involve a combination of approaches such as use of high-precision data types and revised calculations and algorithms. Mathematical analysis can be used to estimate the actual error in calculations.