WS02: Floating point number systems#

These exercises are indented to give you practice at using the material on numerical approximation and are intended to reinforce the material that was covered in lectures.

Please attempt the worksheet before your tutorial. Support is available in your tutorial or in the Class Team.

Partial solutions are available to check your understanding. Please create Issues and Pull requests with your solutions or share your answers for peer feedback in the class team Worksheet solutions channel

Part a (pen and paper warm up)#

1. Number systems#

Consider the number system given by \((\beta, t, L, U) = (10, 3, -3, 3)\) which gives

\[ x = \pm .b_1 b_2 b_3 \times 10^e \text{ where } -3 \le e \le 3. \]
  • How many numbers can be represented by this normalised system?

  • What are the two largest positive numbers in this system?

  • What are the two smallest positive numbers?

  • What is the smallest possible difference between two numbers in this system?

  • What is the smallest possible difference in this system, \(x\) and \(y\), for which \(x < 100 < y\)?

2. Do floating point operations commute?#

  • Let

    \[ x = .85 \times 10^0, \quad y = .3 \times 10^{-2}, \quad z = .6 \times 10^{-2}, \]

    in the system \((\beta, t, L, U) = (10, 2, -3, 3)\). Evaluate the following expression in this number system.

    \[ x+(y+y), \quad (x+y)+y, \quad x+(z+z), \quad (x+z) +z. \]

    (Also note the benefits of adding the smallest terms first!)

  • Given the number system \((\beta, t, L, U) = (10, 3, -3, 3)\) and \(x = .100\times 10^3\), find nonzero numbers \(y\) and \(z\) from this system for which \(fl(x+y) = x\) and \(fl(x+z) > x\).

Part b (floating point numbers in python)#

3. Floating point types in numpy#

numpy has information about it’s floating point types built in. Run this code block and interpret the answers.

import numpy as np

print(help(np.finfo))

for dtype in [float, np.double, np.single, np.half]:
    print(dtype.__name__, np.finfo(dtype))

4. Equality of floating point representations#

When working with floating point numbers you cannot simply test for equality (see also this stackoverflow question).

We want to test if we have computed the square root of two accurately.

a = np.sqrt(2.0)
print(f"a={a}, type(a)={type(a)}")

b = a * a
print(f"b={b}")

print(b == 2.0)
print(b - 2.0)

Disaster! One option is to use numpy.isclose, which roughly equivalent to this code:

def my_isclose(x, y, tol=1.0e-9):
    return abs(x - y) < tol


print(my_isclose(a * a, b))
print(np.isclose(a * a, b))

5. What’s the best way to write a function#

Show that two functions

\[ f(x) = x ( \sqrt{x+1} - \sqrt{x}) \qquad \mbox{ and } \qquad g(x) = \frac{x}{\sqrt{x+1} + \sqrt{x}}, \]

are equivalent. Evaluate \(f(500)\) and \(g(500)\) using double precision (np.float64), single precision (np.float32) and half precision (np.float16). You should use numpy.sqrt to compute square roots. Explain why these answers are different and comment on which is the more accurate and why.

import numpy as np

def f(x):
    # TODO implement me

def g(x):
    # TODO implement me
# test f(x) and g(x)

6. Ordering of floating point calculations in practice#

For this question you are required to write a function pySquared. This function should make use of the formula

\[ \frac{1}{6} \pi^2 = \sum_{k=1}^\infty \frac{1}{k^2} \]

to estimate the value of \(\pi^2\) by computing the partial sum

\[ \pi^2 \approx 6 \sum_{k=1}^n \frac{1}{k^2} \]

for large values of \(n\).

a. Using double precision and the predefined constant pi, produce a table that gives the absolute error in the approximation to \(\pi^2\) that you obtain using your function piSquared with \(n = \{10^6, 10^7, 10^8, 10^9\}\). What is the difference between your answers when \(n=10^8\) and \(10^9\)? Can you explain what is happening?

def piSquared(n):
    # TODO implement me

# TODO make table

b. Now modify the function piSquared to compute the same sum of the \(n\) terms but to add them up in the opposite order (i.e., start with the smallest terms first - which correspond to the largest values of \(k\) first). Call your new script piSquared_v2.

def piSquared_v2(n):
    # TODO implement me

Produce a table that gives the absolute error in the approximation to \(\pi^2\) that you obtain using your modified piSquared_v2 with \(n = \{10^6, 10^7, 10^8, 10^9\}\). What difference do you see when using the modified function? What is the explanation for this function being superior to the original version?

# TODO make table

Part c: Extension#

7. Check the largest number#

The IEEE single precision standard is \((\beta, t, L, U)=(2, 23, -127, 128)\), which is available via numpy.single. First, use the following Python command to print the largest positive number in this system; then, write a python code (or use a calculator) to compute the largest positive number and check whether it is the same as the output using the Python command.

print(np.finfo(np.float32))

For the largest positive number, \(0.11..1 \times 2^{128}\), you may need to think what does \(0.11..1\) mean for a binary (base is 2) number system.We are used to decimals (base is 10) in our everyday life, for example, \(0.72 = 7\times 10^{-1} + 2\times 10^{-2}\). However, we only have 0 and 1 in a binary number system… Actually, we can convert a binary to decimal as follows: \((0.101)_{t=2} = (1 \times 2^{-1} + 0 \times 2^{-2}+ 1 \times 2^{-3})_{t=10} = (1/2 + 0 + 1/8 )_{t=10}=(0.625)_{t=10}\).

def largest(b, t, l, u):
    np.float64(0)
    b = np.float64(b)
    t = int(t)
    l = np.float64(l)
    u = np.float64(u)


"""
TODO: compute y=0.11..1 x 2^u
"""
print(f"The largest number is {largest(2,24,-126,128)}.")