Ch2PartIII

Representing and Manipulating Information (Part III)

2.4 Floating Point

A floating-point representation encodes rational numbers of the form $V = x \times 2^y$.

It is useful for performing computations involving very large numbers ,numbers very close to 0 , and more generally as an approximation to real arithmetic.

Up until the 1980s, every computer manufacturer devised its own conventions
for how floating-point numbers were represented and the details of the operations
performed on them.

All of this changed around 1985 with the advent of IEEE Standard 754.

2.4.1 Fractional Binary Numbers

Like decimal, binary number can have a binary point.

binary point
$$
b = \sum_{i=-n}^m (b_i \times 2^i)
$$

  • shifting the binary point one position to the left has the effect of dividing the number by 2.
  • shifting the binary point one position to the right has the effect of multiplying the number by 2.

Note that numbers of the form $(0.11 . . . 1)_2$ represent numbers just below 1. For
example, $(0.111111)_2$ represents$\frac{63}{64}$. We will use the shorthand notation $1.0 − \epsilon$ to
represent such values.

Assuming we consider only finite-length encodings, decimal notation cannot
represent numbers such as $\frac{1}{3}$ and $\frac{5}{7}$exactly. Similarly, fractional binary notation
can only represent numbers that can be written $x \times 2^y$. Other values can only be
approximated.

Practice Problem 2.45

Fill in the missing information in the following table:

Fractional value Binary Representation Decimal representation
$\frac{1}{8}$ 0.001 0.125
$\frac{3}{4}$
$\frac{5}{16}$
10.1011
1.001
5.875
3.1875

My solution: :white_check_mark:

Fractional value Binary Representation Decimal representation
$\frac{1}{8}$ 0.001 0.125
$\frac{3}{4}$ 0.11 0.75
$\frac{5}{16}$ 0.0101 0.3125
$2\frac{11}{16}$ 10.1011 2.6875
$1\frac{1}{8}$ 1.001 1.125
$5\frac{7}{8}$ 101.111 5.875
$3\frac{3}{16}$ 11.0011 3.1875

Practice Problem 2.46

The imprecision of floating-point arithmetic can have disastrous effects.

On February 25, 1991, during the first Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi Scud missile. The Scud struck an American Army barracks and killed 28 soldiers. The US General Accounting Office (GAO) conducted a detailed analysis of the failure and determined that the underlying cause was an imprecision in a numeric calculation. In this exercise, you will reproduce part of the GAO’s analysis.

The Patriot system contains an internal clock, implemented as a counter
that is incremented every 0.1 seconds. To determine the time in seconds, the
program would multiply the value of this counter by a 24-bit quantity that was
a fractional binary approximation to $\frac{1}{10}$. In particular, the binary representation of $\frac{1}{10}$ is the nonterminating sequence $(0.000110011[0011] . . . )_2$, where the portion in brackets is repeated indefinitely. The program approximated 0.1, as a value x, by considering just the first 23 bits of the sequence to the right of the binary point: x = 0.00011001100110011001100

A. What is the binary representation of 0.1 − x?

B. What is the approximate decimal value of 0.1 − x?

C. The clock starts at 0 when the system is first powered up and keeps counting up from there. In this case, the system had been running for around 100 hours. What was the difference between the actual time and the time computed by the software?

D. The system predicts where an incoming missile will appear based on its
velocity and the time of the last radar detection. Given that a Scud travels
at around 2,000 meters per second, how far off was its prediction?

Normally, a slight error in the absolute time reported by a clock reading would
not affect a tracking computation. Instead, it should depend on the relative time between two successive readings. The problem was that the Patriot software had been upgraded to use a more accurate function for reading time, but not all of the function calls had been replaced by the new code. As a result, the tracking software used the accurate time for one reading and the inaccurate time for the other.

My solution: :x:

A✓
$$
0.1 - x = (0.000110011[0011] . . . )_2-(0.00011001100110011001100)_2\
=0.00000000000000000000000 011[0011]…
$$
B✗
$$
0.1 - x \approx 2^{-25}+2^{-26} = 0.000000044703483581542968750000
$$
C✗

100 hours = 36000 seconds = 360000 * 0.1 seconds
$$
360000 \times (0.1 -x ) \approx 0.1609325408935547
$$
D✗
$$
0.1609325408935547 \times 2000 = 321.8650817871094 \approx321
$$

Solution on the book:

B. Comparing this to the binary representation of $\frac{1}{10}$ , we can see that it is simply
$2^{−20} × \frac{1}{10}$, which is around $9.54 × 10^{−8}$.

($0.1 - (0.00011001100110011001100)_2$)

C. 9.54 × 10^{−8} × 100 × 60 × 60 × 10 ≈ 0.343 seconds.
D. 0 .343 × 2,000 ≈ 687 meters.


2.4.2 IEEE Floating-Point Representation

Positional notation such as considered in the previous section would not be efficient for representing very large numbers. For example, the representation of
$5 × 2^{100}$ would consist of the bit pattern 101 followed by 100 zeros. Instead, we
would like to represent numbers in a form $x × 2^y$ by giving the values of x and y.

float and double

  • In the single-precision floating-point format (a float in C), fields s, exp, and frac are 1, k = 8, and n = 23 bits each, yielding a 32-bit representation.
  • In the double-precision floating-point format (a double in C), fields s, exp, and frac are 1, k = 11, and n = 52 bits each, yielding a 64-bit representation.

The value encoded by a given bit representation can be divided into three
different cases (the latter having two variants), depending on the value of exp.

cases

Case 1: Normalized Values

This is the most common case.

  • It occurs when the bit pattern of exp is neither all zeros nor all ones.
  • the exponent field is interpreted as representing a signed integer in biased form. That is, the exponent value is E = e − Bias, where e is the unsigned number having bit representation $e_{k−1} . . . e_1e_0$ and Bias is a bias value equal to $2^{k−1} − 1$
  • The fraction field frac is interpreted as representing the fractional value f. The significand is defined to be M = 1 + f
  • since we can always adjust the exponent E so that significand M is in the range 1 ≤ M < 2 (assuming there is no overflow). We therefore do not need to explicitly represent the leading bit, since it always equals 1.
Case 2: Denormalized Values

When the exponent field is all zeros, the represented number is in denormalized
form.

  • In this case, the exponent value is E = 1 − Bias, and the significand value is
    M = f , that is, the value of the fraction field without an implied leading 1.
  • Denormalized numbers serve two purposes
    • First, they provide a way to represent numeric value 0, since with a normalized number we must always have M ≥ 1, and hence we cannot represent 0.
      • 0000…0000 for +0.0
      • 1000….0000 for -0.0
      • With IEEE floating-point format, the values −0.0 and +0.0 are considered different in some ways and the same in others.
    • A second function of denormalized numbers is to represent numbers that are very close to 0.0.
Case 3: Special Values

A final category of values occurs when the exponent field is all ones.

  • s = 0 for $+\infin$
  • s = 1 for $-\infin$
  • Infinity can represent results that overflow, as when we multiply two very large numbers, or when we divide by zero.
  • When the fraction field is nonzero, the resulting value is called a NaN , short for “not a number.”
    • Such values are returned as the result of an operation where the result cannot be given as a real number or as infinity, as when computing $\sqrt{-1}$ or $\infin − \infin$
    • They can also be useful in some applications for representing uninitialized data.

2.4.3 Example Numbers

example

8 bits

Practice Problem 2.47

Consider a 5-bit floating-point representation based on the IEEE floating-point
format, with one sign bit, two exponent bits (k = 2), and two fraction bits (n = 2). The exponent bias is $2^{2−1} − 1 = 1$.
The table that follows enumerates the entire nonnegative range for this 5-bit
floating-point representation. Fill in the blank table entries using the following
directions:

e: The value represented by considering the exponent field to be an unsigned
integer

E: The value of the exponent after biasing

$2^E$: The numeric weight of the exponent

f : The value of the fraction

M: The value of the significand

$2^E × M$: The (unreduced) fractional value of the number

V : The reduced fractional value of the number

Decimal: The decimal representation of the number

Express the values of 2E, f , M, 2 E × M, and V either as integers (when
possible) or as fractions of the form x/y , where y is a power of 2. You need not
fill in entries marked X

Bits e E $2^E$ f M $2^E\times M$ V Decimal
0 00 00
0 00 01
0 00 10
0 00 11
0 01 00
0 01 01 1 0 1 $\frac{1}{4}$ $\frac{5}{4}$ $\frac{5}{4}$ $\frac{5}{4}$ 1.25
0 01 10
0 01 11
0 10 00
0 10 01
0 10 10
0 10 11
0 11 00 X X X X X X X
0 11 01 X X X X X X X
0 11 10 X X X X X X X
0 11 11 X X X X X X X

My solution : :warning:

Bits e E $2^E$ f M $2^E\times M$ V Decimal
0 00 00 0 0 1 0 0 0 0 0
0 00 01 0 0 1 $\frac{1}{4}$ $\frac{1}{4}$ $\frac{1}{4}$ $\frac{1}{4}$ 0.25
0 00 10 0 0 1 $\frac{1}{2}$ $\frac{1}{2}$ $\frac{1}{2}$ $\frac{1}{2}$ 0.5
0 00 11 0 0 1 $\frac{3}{4}$ $\frac{3}{4}$ $\frac{3}{4}$ $\frac{3}{4}$ 0.75
0 01 00 1 0 1 0 1 1 1 1
0 01 01 1 0 1 $\frac{1}{4}$ $\frac{5}{4}$ $\frac{5}{4}$ $\frac{5}{4}$ 1.25
0 01 10 1 0 1 $\frac{1}{2}$ $\frac{3}{2}$ $\frac{3}{2}$ $\frac{3}{2}$ 1.5
0 01 11 1 0 1 $\frac{3}{4}$ $\frac{7}{4}$ $\frac{7}{4}$ $\frac{7}{4}$ 1.75
0 10 00 2 1 2 0 1 2 2 2
0 10 01 2 1 2 $\frac{1}{4}$ $\frac{5}{4}$ $\frac{5}{2}$ $\frac{5}{2}$ 2.5
0 10 10 2 1 2 $\frac{1}{2}$ $\frac{3}{2}$ 3 3 3
0 10 11 2 1 2 $\frac{3}{4}$ $\frac{7}{4}$ $\frac{7}{2}$ $\frac{7}{2}$ 3.5
0 11 00 X X X X X X $\infin$ X
0 11 01 X X X X X X $NaN$ X
0 11 10 X X X X X X $NaN$ X
0 11 11 X X X X X X $NaN$ X

$2^E\times M$ shouldn’t be reduced.

Beware of exp==1111…111

Practice Problem 2.48

As mentioned in Problem 2.6, the integer 3,510,593 has hexadecimal represen-
tation 0x00359141, while the single-precision floating-point number 3,510,593.0 has hexadecimal representation 0x4A564504. Derive this floating point representation and explain the correlation between the bits of the integer and floating-point representations.

My solution:

convert 3510593 to IEEE float point
$$
(3510593)_{10} = (1101011001000101000001)_2
$$
Nomoralize it
$$
(1101011001000101000001)_2 = 1.101011001000101000001 \times 2 ^{21}
$$
We can now get the mantissa is $101011001000101000001 00$ by padding to 23 bits with 0
$$
21 + (2^7-1)=(10010100)_2
$$
Now we have the exponent part is $10010100$

Since the value is positive, we have the sign bit equals to 0

3510593 in IEEE float point representation is $0 10010100 10101100100010100000100$, hexadecimal format is 4A564504

1
2
3
00000000001101011001000101000001
*********************
01001010010101100100010100000100

Practice Problem 2.49

A. For a floating-point format with an n-bit fraction, give a formula for the
smallest positive integer that cannot be represented exactly (because it
would require an (n + 1)-bit fraction to be exact).

Assume the exponent field size k is large enough that the range of representable exponents does not provide a limitation for this problem.

B. What is the numeric value of this integer for single-precision format (n =
23)?

My solution : :x:

A

Think about the smallest positive integer which can be represented,1, we know f part will be $000…0000$, gives that $M = 1$.

So we can see that the largest positive integer which can be represented will have f part equals 111….1, gives that $ f= \frac{2^n-1}{2^{n}} , M =\frac{2^{n+1}-1}{2^{n}} $

Therefore the smallest positive integer which can not be represented is $(2^{n+1}-1) + 1 = 2^{n+1}$

B
$$
2^{24} = 16777216
$$
this value can be expressed by adjusting exponent!

Solution on the book:

A. The number has binary representation 1, followed by n zeros, followed by 1,
giving value $2^{n+1} + 1$.
B. When n = 23, the value is $2^{24} + 1 = 16,777,217$.


2.4.4 Rounding

Floating-point arithmetic can only approximate real arithmetic, since the representation has limited range and precision. Thus, for a value x, we generally want a systematic method of finding the “closest” matching value x′ that can be represented in the desired floating-point format.

The IEEE floating-point format defines four different rounding modes. The default method finds a closest match(round-to-even), while the other three can be used for computing upper and lower bounds.

round modes

  • Round-to-even mode adopts the convention that it rounds the number either upward or downward such that the least significant digit of the result is even.
  • Round-toward-zero mode rounds positive numbers downward and negative numbers upward.
  • Round-down mode rounds both positive and negative numbers downward
  • Round-up mode rounds both positive and negative numbers upward
Round to even

Round-to-even at first seems like it has a rather arbitrary goal—why is there
any reason to prefer even numbers?

  • Imaging scenarios in which rounding a set of data values would then introduce a statistical bias into the computation of an average of the values. Rounding to even will round upward about 50% of the time and round downward about
    50% of the time, trying to keep the minimus bias.
  • Round-to-even rounding can be applied even when we are not rounding to
    a whole number. For example, when rounding decimal numbers to the nearest
    hundredth, we would round both 1.2350000 and 1.2450000 to 1.24, since 4 is even.
  • Similarly, round-to-even rounding can be applied to binary fractional num-
    bers.For example, We would round 10.11100 up to 11 .00 and 10.10100 down to 10.10, since these values are halfway between two possible results, and we prefer to have the least significant bit equal to zero.
  • Note that only when the value to be rounded is exactly the half between boundaries should we prefer to even, otherwise round to the nestest.

Practice Problem 2.50

Show how the following binary fractional values would be rounded to the nearest half (1 bit to the right of the binary point), according to the round-to-even rule.
In each case, show the numeric values, both before and after rounding.

A. 10.111
B. 11.010
C. 11.000
D. 10.110

My solution: :white_check_mark:

Binary Binary after rounding Decimal before rounding Decimal after rounding
10.111 11.0 2.875 3.0
11.010 11.0 3.25 3.0
11.000 11.0 3.0 3.0
10.110 11.0 2.75 3.0

Practice Problem 2.51

We saw in Problem 2.46 that the Patriot missile software approximated 0.1 as x =0.00011001100110011001100. Suppose instead that they had used IEEE round-to-even mode to determine an approximation x′ to 0.1 with 23 bits to the right of the binary point.

A. What is the binary representation of x′?

B. What is the approximate decimal value of x′ − 0.1?

C. How far off would the computed clock have been after 100 hours of opera-
tion?

D. How far off would the program’s prediction of the position of the Scud
missile have been?

My solution: :white_check_mark:

A
$$
x’=0.00011001100110011001101
$$
B
$$
x’-0.1 = 0.00000002384185791
$$
C
$$
0.085830688476
$$
D
$$
171.661376952
$$

Practice Problem 2.52

Consider the following two 7-bit floating-point representations based on the IEEE floating-point format. Neither has a sign bit—they can only represent nonnegative numbers.

  1. Format A
    There are k = 3 exponent bits. The exponent bias is 3.
    There are n = 4 fraction bits.
  2. Format B
    There are k = 4 exponent bits. The exponent bias is 7.
    There are n = 3 fraction bits.

Below, you are given some bit patterns in format A, and your task is to convert
them to the closest value in format B. If necessary, you should apply the round-to-even rounding rule. In addition, give the values of numbers given by the format A and format B bit patterns. Give these as whole numbers (e.g., 17) or as fractions (e.g., 17/64).

Format A

Bits Values
011 0000 1
101 1110
010 1001
110 1111
000 0001

Format B

Bits Values
0111 000 1

My solution : :white_check_mark:

Format A

Bits Values
011 0000 1
101 1110 $\frac{15}{2}$
010 1001 $\frac{25}{32}$
110 1111 $\frac{31}{2}$
000 0001 $\frac{1}{64} $

Format B

Bits Values
0111 000 1
1001 111 $\frac{15}{2}$
0110 100 $\frac{3}{4}$
1011 000 16
0001 000 $\frac{1}{64}$

2.4.5 Floating-Point Operations

One strength of the IEEE standard’s method of specifying the behavior of floating-point operations is that it is independent of any particular hardware or software realization.

  • Addition over real numbers also forms an abelian group, but we must consider what effect rounding has on these properties

    We define $x +^f y = \text{round-to-even(x+y)}$

    • The operation is commutative, but NOT associative.

      For example, (3.14+1e10)-1e10 will yield 0(3.14 is lost during rounding), while 3.14 + (1e10 - 1e10) gives 0

      As a result compilers tend to be very conservative, avoiding any optimizations that could have even the slightest effect on functionality.

    • On the other hand, floating-point addition satisfies the following monotonicity property:

      if a ≥ b, then $x +^f a ≥ x +^f b$ for any values of a, b, and x other than NaN.

  • Floating-point multiplication also obeys many of the properties one normally associates with multiplication

    • it is commutative, and it has 1.0 as a multiplicative identity

    • it is not associative, due to the possibility of overflow or the loss of precision due to rounding.

    • floating-point multiplication does not distribute over addition.

      For example, with single-precision floating point, the expression 1e20*(1e20- 1e20) evaluates to 0.0, while 1e20*1e20-1e20*1e20 evaluates to NaN

    • On the other hand, floating-point multiplication satisfies the following monotonicity properties for any values of a, b, and c other than NaN:

      a ≥ b and c ≥ 0 ⇒ a *f c ≥ b *f c

      a ≥ b and c ≤ 0 ⇒ a *f c ≤ b *f c

    • In addition, we are also guaranteed that a *f a ≥ 0, as long as a = NaN.

  • As we saw earlier, none of these monotonicity properties hold for unsigned or two’scomplement multiplication.

  • This lack of associativity and distributivity is of serious concern to scientific programmers and to compiler writers. Even such a seemingly simple task as writing code to determine whether two lines intersect in three-dimensional space can be a major challenge.

2.4.6 Floating Point in C

All versions of C provide two different floating-point data types: float and double.

  • On machines that support IEEE floating point, these data types correspond to single- and double-precision floating point. In addition, the machines use the round-to-even rounding mode.

  • Unfortunately, since the C standards do not require the machine to use IEEE floating point, there are no standard methods to change the rounding mode or to get special values such as −0, +∞, −∞, or NaN.

    Most systems provide a combination of include (.h) files and procedure libraries to provide access to these features, but the details vary from one system to another.

Practice Problem 2.53

Fill in the following macro definitions to generate the double-precision values +∞, −∞, and −0:

1
2
3
#define POS_INFINITY
#define NEG_INFINITY
#define NEG_ZERO

You cannot use any include files (such as math.h), but you can make use of the fact that the largest finite number that can be represented with double precision is around $1.8 × 10^{308}$.

My solution :

Solution on the book :

In general, it is better to use a library macro rather than inventing your own code. This code seems to work on a variety of machines, however. We assume that the value 1e400 overflows to infinity

1
2
3
#define POS_INFINITY 1e400
#define NEG_INFINITY (-POS_INFINITY)
#define NEG_ZERO (-1.0/POS_INFINITY)

When casting values between int, float, and double formats, the program changes the numeric values and the bit representations as follows (assuming data type int is 32 bits):

casting rules

Practice Problem 2.54

Assume variables x, f, and d are of type int, float, and double, respectively. Their values are arbitrary, except that neither f nor d equals +∞, −∞, or NaN.

For each of the following C expressions, either argue that it will always be true (i.e., evaluate to 1) or give a value for the variables such that it is not true (i.e., evaluates to 0).

1
2
3
4
5
6
7
8
A. x == (int)(double) x
B. x == (int)(float) x
C. d == (double)(float) d
D. f == (float)(double) f
E. f == -(-f)
F. 1.0/2 == 1/2.0
G. d*d >= 0.0
H. (f+d)-f == d

My solution : :white_check_mark:

1
2
3
4
5
6
7
8
A: True
B: False
C: False
D: True
E: True
F: True
G: True
H: False

Solution on the book:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
A. x == (int)(double) x
Yes, since double has greater precision and range than int.

B. x == (int)(float) x
No. For example, when x is TMax.

C. d == (double)(float) d
No. For example, when d is 1e40, we will get +∞ on the right.

D. f == (float)(double) f
Yes, since double has greater precision and range than float.

E. f == -(-f)
Yes, since a floating-point number is negated by simply inverting its sign bit

F. 1.0/2 == 1/2.0
Yes, the numerators and denominators will both be converted to floatingpoint representations before the division is performed.

G. d*d >= 0.0
Yes, although it may overflow to +∞.

H. (f+d)-f == d
No. For example, when f is 1.0e20 and d is 1.0, the expression f+d will be rounded to 1.0e20, and so the expression on the left-hand side will evaluate to 0.0, while the right-hand side will be 1.0.