General Information Academics Admission Facilities People Research Search College of Engineering
Florida Institute of Technology Department of Computer Sciences

Number Systems

Number Systems

Number systems are used for counting. We can count ``whole things'' with integers:

\begin{displaymath}\mathbf{Z}=\{\ldots -3,\,-2,\,-1,\,0,\,1,\,2,\,3,\ldots\},\end{displaymath}

and we can count ``fractional things'' with rationals:

\begin{displaymath}\mathbf{Q}=\{p/q\::\:p,\,q \in \mathbf{Z}\}.\end{displaymath}

Mathematics extends counting concepts with additional number systems such as real and complex numbers: These are beyond the scope of these notes.

Decimal Number System

The decimal, denary, base 10, or radix 10 number system is a positional system, meaning that numbers are represented as

\begin{displaymath}\cdots d_3d_2d_1d_0.d_{-1}d_{-2}d_{-3}\cdots\end{displaymath}

where the digits

d3, d2, d1, d0, d-1, d-2, d-3

are symbols from the set

\begin{displaymath}\{0,1,2,3,4,5,6,7,8,9\}\end{displaymath}

and their position from the decimal point ``.'' determines the particular power of 10 by which they are scaled. For example,

\begin{displaymath}314.1592 = 3\cdot 10^{2}+1\cdot 10^{1}+4\cdot 10^{0}+
1\cdot 10^{-1}+5\cdot 10^{-2}+9\cdot 10^{-3}+2\cdot 10^{-4}\end{displaymath}

To represent negative numbers two additional symbols are included ``+'' and ``-,'' but the ``+'' symbol is not always explicitly used:

\begin{displaymath}+314 = 314\,\mbox{positive integer}, \quad -314\,\mbox{negative integer}\end{displaymath}

You must already know how to perform arithmetic (addition, subtraction, multiplication, and division) in the decimal system.

The decimal system does not serve well for computer arithmetic because it is difficult to build cheap devices that stably represent the 10 digits and the 2 signs.

Let's look at another interesting system that is also not useful in computing, although it has a long history and still no doubt serves many people very well.

Unary Number System

The unary or tally number system uses one symbol ``|'' (called a unit) to count positive whole things;

The unary number system is unlike any of the other systems used to count.

1.
The unary system is not a positional system.
2.
We can count nothing (zero) only by the lack of any unit (symbol)
3.
To count fractional things is UN-intuitive at best (an off-line discussion perhaps)
4.
A sign symbol ``-'' is need to represent negative quantities.
5.
It is not efficient: the width to represent a number n is n. For example, to count 10 in unary we use |||||||||| (ten units) which has width w=10, and to count 1,000,000 requires a string of one million units (width w=106).

To study efficiency more let's consider another unusual system.

Milliary Number System

The milliary number system uses r=1,000 as a base or radix for counting and one thousand millits to denote counts. The millit symbols are

\begin{displaymath}\{0,\,1,\,2,\ldots,9,\,A,\,B,\dots,\spadesuit\}\end{displaymath}

where $\spadesuit$ represents the decimal number 999. For example,

\begin{eqnarray*}2A\spadesuit.\spadesuit4B & = & 2\cdot 1000^{2} + A\cdot 1000^{...
...0}+\frac{11}{1,000,000,000}\\
& = & 2 010 999.99904011_{10}\\
\end{eqnarray*}


Notice the use of subscripts to denote the base (radix) whenever confusion can arise: The default base is 10 -- if no subscript is used the number is interpreted as decimal.

The width needed to count large numbers is small: For example, we can count 99910 with one symbol $\spadesuit$ (width w=1); we can count 1,000 with 2 symbols 10 (width w=2); and we can count 999,99910 with 2 symbols $\spadesuit$ (width w=2).

But the millinary system is still not very efficient: the radix is very large, so the number of symbols (millits) needed to represent numbers is large too.

The accepted measure of a number's system efficiency is the product of the radix and the width: rw. Representing all numbers between 0 and 999,999 in millinary has efficiency $rw=1000\times 2=2,000$.

And milliary is not a good computing system because building efficient stable devices that can represent 1000 different symbols (states) is difficult too, but the binary system is efficient and it is not hard to build stable 2 state devices.

Binary Number System

The binary system uses two symbols ``0'' and ``1'' called bits to count. It is vastly superior to the unary and millinary in efficient counting. We can count 10 using 4 bits 10102, and we can count 1,000,00010 using 20 bits:

\begin{eqnarray*}1111\ 0100\ 0010\ 0100\ 0000_{2} & = & 2^{19}+2^{18}+2^{17}+2^{...
...8+262144+131072+65536+16384+512+64 \\
& = & 1,000,000_{10} \\
\end{eqnarray*}


Binary numbers have an efficiency of $rw=2\times 20 =40$ when representing counts between 0 and one million.

Here's another example:

\begin{eqnarray*}1101.1101_2 &=& 2^{3}+2^{2}+2^{0}+2^{-1}+2^{-2}+2^{-4} \\
&=&...
...{4}+\frac{1}{16} \\
&=& 13 + \frac{13}{16}\\
&=& 13.8125 \\
\end{eqnarray*}


We'll consider arithmetic over the binary system below. Some authors use a ``b'' suffix to denote that a number is binary, e.g.,

\begin{displaymath}11101.1111_2 = 11101.1111\mbox{b}\end{displaymath}

Ternary Number System

The ternary system uses ``trits'' 0, 1, and 2 to count. For example:

\begin{eqnarray*}201.1202_3 &=& 2\times3^{2}+3^{0}+3^{-1}+2\times 3^{-2}+2\times...
...c{3}{81} \\
&=& 19 + \frac{48}{81}\\
&=& 19.592592\cdots \\
\end{eqnarray*}


Ternary is more efficient than binary: It has an efficiency of $rw=3\times13 =39$ when representing counts between 0 and one million:

\begin{eqnarray*}1,000,000_{10} & = & 3^{12} + 2\cdot3^{11} +3^{10} + 2\cdot3^{9...
...3^{5} +2\cdot3^{3} +3^{0}\\
& = & 1\ 212\ 210\ 202\ 001_{3}\\
\end{eqnarray*}


Octal Number System

Octal numbers have a base b=8, and use the octits 0,1,2,3,4,5,6,7 to denote magnitudes. For example,

\begin{eqnarray*}5\ 672.1_8 & = & 5\cdot8^{3}+6\cdot8^{2}+7\cdot8^{1}+2\cdot8^{0}+1\cdot8^{-1}\\
& = & 3001.125\\
\end{eqnarray*}


Some authors use a ``o'' or ``q'' suffix to denote that a number is octal, e.g.,

\begin{displaymath}21012.1201_3 = 21012.1201\mbox{o} = 21012.1201\mbox{q}.\end{displaymath}

In the C programming language octal constants are represented by

'\o'

where ooo is one or more octits, e.g., '\013' is decimal 11 and ASCII for vertical tab.

In the Java programming language octal constants are represented by by a leading zero followed by octits,

0nnn

e.g., 0377 is decimal 255.

Hexadecimal Number System

Hexadecimal numbers have a base b=16. The letters


\begin{displaymath}A=10, \, B=11,\, C=12,\,D=13, E = 14,\,F=15\end{displaymath}

are used as hexits above 9.

\begin{eqnarray*}C7A.D_{16} & = & 12\cdot16^{2}+7\cdot16^{1}+10\cdot16^{0}+13\cdot16^{-1}\\
& = & 3072+112+10+\frac{1}{16}\\
& = & 3194.0625\\
\end{eqnarray*}


Some authors use a ``h'' or ``H'' suffix to denote that a number is hexadecimal, e.g.,

\begin{displaymath}32AE1.1B_{16} = 32AE1.1B\mbox{h} = 32AE1.1B\mbox{H}.\end{displaymath}

In the C programming language hexadecimal constants are represented by

'\xh'

where h is one or more hexits, e.g., '\xb' is decimal 11 and ASCII for vertical tab.

In the Java programming language hexadecimal constants are represented by by a leading 0x or 0X followed by hexits,

0xhhh

e.g., 0Xff is decimal 255.

Sexagesimal Number System

Sexagesimal numbers have a base b=60 and use 60 symbols called sexits. Arabic astronomers from Babylon used this system and remnants of it still exist today in trigonometric and time units of ``degrees, minutes, and seconds.''

General Case: Positional Numbers Base b

Given an integer base b > 1, for example b=10, a number xcan be written as

\begin{displaymath}x = c_mb^m + c_{m-1}b^{m-1} + \cdots c_2b^2+c_1b + c_0b^0+c_{-1}b^{-1}+c_{-2}b^{-2}+\cdots\end{displaymath}

for appropriate coefficients $c_m,\,c_{m-1},\ldots c_2,\,c_1,\,c_0,c_{-1},\,c_{-2},\cdots$all of which are greater than or equal to 0 and less than or equal to b-1.

Using n positions to the left of the point you can represent any integer between 0 and

\begin{displaymath}\sum_{i=0}^{n-1} (b-1)\times b^i = b^n-1\end{displaymath}

For example with:

Conversions Between Positional Number Systems

Conversions From Decimal to Other Systems

Repeated division with accumulation of remainders can be used to convert integers from decimal to an arbitrary base:


public class Base {
    public static String convert(int num, int b) {
        StringBuffer result = new StringBuffer();
        while (num >= b) {
            result.append(num % b);
            num /= b;
        }
        result.append(num);
        return (result.reverse().toString());
    }
    public static void main(String[] args) {
        int num = Integer.parseInt(args[0]);
        int b = Integer.parseInt(args[1]);
        System.out.println("Decimal Number = "+num);
        System.out.println("Base = "+b);
        System.out.println("Conversion = "+convert(num, b));
    }
}

Example 1: given base b=9 and decimal number n=342 we can find the base 9representation of 342 by repeated division:

\begin{displaymath}342 \div 9 = 38\ R\ 0 \quad 38 \div 9 = 4\ R\ 2 \quad 4 \div 9 = 0\ R \ 4\end{displaymath}

So

\begin{displaymath}342 = 4\times 9^2 + 2 \times 9 + 0 \times 9^0 = 420_9\end{displaymath}

Example 2: Let b=3 and write decimal number n=1,000,000 in ternary.

Division Quotient Remainder
$1,000,000 \div 3$ 333,333 1
$333,333 \div 3$ 111,111 0
$111,111 \div 3$ 37,037 0
$37,037 \div 3$ 12,345 2
$12,345 \div 3$ 4,115 0
$ 4,115 \div 3$ 1,371 2
$ 1,371 \div 3$ 457 0
$ 457 \div 3$ 152 1
$ 152 \div 3$ 50 2
$ 50 \div 3$ 16 2
$ 16 \div 3$ 5 1
$ 5 \div 3$ 1 2
$ 1 \div 3$ 0 1
1,000,000 = 12122102020013

Fractions can be converted by repeated multiplication and accumulation of integer excesses.

Example 3: Let b=2 and write decimal fraction n=0.1 in binary.

Multiplication Product Integer Fraction
$0.1 \times 2$ 0.2 0 0.2
$0.2 \times 2$ 0.4 0 0.4
$0.4 \times 2$ 0.8 0 0.8
$0.8 \times 2$ 1.6 1 0.6
$0.6 \times 2$ 1.2 1 0.2
$0.2 \times 2$ 0.4 0 0.4
$0.4 \times 2$ 0.8 0 0.8
$0.8 \times 2$ 1.6 1 0.6
$0.6 \times 2$ 1.2 1 0.2


\begin{eqnarray*}0.1 & = & 0.000110011001100\cdots_2 \\
& = & \frac{1}{16}+\fr...
...ac{16}{15}+\frac{1}{32}\frac{16}{15} \\
& = & \frac{1}{10} \\
\end{eqnarray*}


Example 4: Let b=3 and write decimal fraction n=0.01 in ternary.

Multiplication Product Integer Fraction
$0.01 \times 3$ 0.03 0 0.03
$0.03 \times 3$ 0.09 0 0.09
$0.09 \times 3$ 0.27 0 0.27
$0.27 \times 3$ 0.81 0 0.81
$0.81 \times 3$ 2.43 2 0.43
$0.43 \times 3$ 1.29 1 0.29
$0.29 \times 3$ 0.87 0 0.87
$0.87 \times 3$ 2.61 2 0.61

$0.01 = 0.00002102\cdots_3$

Conversions From Other Systems to Decimal

Converting from some base to decimal can be done by repeated multiplication (the opposite of the steps above). The conversion can be organized using a fool-proof simple tabular format, called Horner's algorithm.

Example 5: Convert $1\ 0001\ 0111_2$ to decimal:

1 0 0 0 1 0 1 1 1  
  2 4 8 16 34 68 138 278  
1 2 4 8 17 34 69 139 279  

28+24+22+21+20=256+16+4+2+1=279

1.
Bring down the left bit.
2.
Multiply left (most significant) bit by 2;
3.
Add the result to next most significant bit;
4.
Multiply the sum by 2;
5.
Repeat steps 3 and 4 until all bits are used;
The decimal equivalent of the binary number is the last sum.

Example 6: Convert $1\ 2201\ 0121_3$ to decimal: Here the multiplier is 3.

1 2 2 0 1 0 1 2 1  
  3 15 51 153 462 1386 4161 12489  
1 5 17 51 154 462 1387 4163 12490  

\begin{displaymath}3^{8}+2\cdot 3^{7}+2\cdot 3^{6}+1\cdot 3^{4}+3^{2}+2\cdot 3^{1}+ 3^{0}\end{displaymath}

Conversion Between Non-Decimal Bases

To convert a binary number to an hexadecimal, group the bits 4 at a time starting from the right (least significant bit):


\begin{displaymath}10\ 0010 1110_2 = 22E_{16}=558\end{displaymath}


\begin{displaymath}1010\ 1101\ 0110_2 = AD6_{16}\end{displaymath}

To express an hexadecimal number in binary, simply expand each hexit:


\begin{displaymath}CAD6_{16}=1100\ 1010\ 1101\ 0110\end{displaymath}

To convert a binary number to an octal, group the bits 3 at a time starting from the right (least significant bit):


\begin{displaymath}1\ 000\ 101\ 110_2 = 1056_8=558\end{displaymath}


\begin{displaymath}101\ 011\ 010\ 111_2 = 5327_{8}\end{displaymath}

To express an octal number in binary, simply expand each octit:


\begin{displaymath}756_{8}=111\ 101\ 110_2\end{displaymath}

Signed Integers

There are multiple representations for signed integers:

Signed magnitude

Use + or - sign, but encode as 0 or 1.

Binary Number System
One Bit 2 Bits 3 Bits 4 Bits
0 = +0 0 = +0 0 = +0 0 = +0
1 = -0 1 = +1 1 = +1 1 = +1
  10 = -0 10 = +2 10 = +2
  11 = -1 11 = +3 11 = +3
    100 = -0 100 = +4
    101 = -1 101 = +5
    110 = -2 110 = +6
    111 = -3 111 = +7
      1000 = -0
      1001 = -1
      1010 = -2
      1011 = -3
      1100 = -4
      1101 = -5
      1110 = -6
      1111 = -7

Two's complement

The one's complement of a number: flip each bit, for example

\begin{displaymath}n= 0\ 1110\ 1101 \longrightarrow \bar{n} = \cdots 1111\ 1111\ 0001\ 0010\end{displaymath}

A number n plus its one's complement $\bar{n}$ always equals $\cdots 1111\ 1111$.

\begin{displaymath}n+\bar{n} = \cdots 1111\ 1111\end{displaymath}

Notice the following:

1.
Adding 1 to $\cdots 1111\ 1111$ gives $\cdots 0000\ 0000$ (carry out = carry in = 1)
2.
Subtracting 1 from $\cdots 0000\ 0000$ gives $\cdots 1111\ 1111$.

The two's complement of a number n is its one's complement plus 1.

\begin{displaymath}\mbox{Two's complement of $n$} = \bar{\bar{n}} = \bar{n} +1\end{displaymath}

For example:
Decimal Binary One's Complement Two's complement Decimal
0 0 1111 1111 0000 0000 0
1 1 1111 1110 1111 1111 -1
2 10 1111 1101 1111 1110 -2
3 11 1111 1100 1111 1101 -3
4 100 1111 1011 1111 1100 -4
32 10 0000 1101 1111 1110 0000 -32
127 111 1111 1000 0000 1000 0001 -127
Notice a number plus its two's complement is alway a string to zeros.

Two Algorithms for finding two's complement:

1.
Complement each bit and add 1.
2.
(Shortcut) Starting from the rightmost (least significant bit), copy bits through and including first 1; Complement the remaining bits.

Memory Organization

A computer's memory is divided into chunks. By default assume that a computer's logic and arithmetic is based on binary notation. The smallest chunk of memory is a bit a 0 or a 1. Larger chunks are:

The word length of a processor is the number of bits in fundamental unit. Usually the word length is:

Some architectures also define ``bytes,'' ``halfwords,'', ``double words,'' etc.

Processor Word Length Year
Intel 4004 4 bits 1971
Intel 8080 8 bits 1974
Intel 8086 16 bits 1978
Intel 80386 32 bits 1985
Intel Pentium 32 bits 1993
Intel Itanium 64 bits 2001
Motorola 68000 16 bits 1980
VAX 11/780 16 bits mid 1970's
IBM S/360 32 bits mid 1960's
UNIVAC 12 characters early to mid 1950's
ENIAC 10 digits mid 1940's

  
Arithmetic

Binary Addition (Half-Adder)
Input Output
Addend Addend Sum Carry (Out)
0 0 0 0
1 0 1 0
0 1 1 0
1 1 0 1

Binary Addition (Full-Adder)  
Input Output
Addend Addend Carry (In) Sum Carry (Out)
0 0 0 0 0
1 0 0 1 0
0 1 0 1 0
1 1 0 0 1
0 0 1 1 0
1 0 1 0 1
0 1 1 0 1
1 1 1 1 1

Multiplication      
Multiplicand Multiplicand Product      
0 0 0      
1 0 0      
0 1 0      
1 1 1      

Unsigned Overflow

With 8 bits it is possible to represent unsigned integers from 0 to 255:

Decimal Value Signed Binary Representation
0 0000 0000
1 0000 0001
2 0000 0010
   
254 1111 1110
255 1111 1111

Overflow occurs when the result of an operation exceeds word length of the processor (assume 8 bit word length),

\begin{displaymath}255+1 = 1111\ 1111 + 0000\ 0001 = 1\ 0000\ 0000\end{displaymath}


\begin{displaymath}128\times 2 = 1000\ 0000 \times 0000\ 0010 = 1\ 0000\ 0000\end{displaymath}

and is detected when there is a carry out of 1 from the most significant (leftmost) bit.

Overflow does not occur in pure mathematics.

Two's Complement Overflow

Overflow is detected when:

Carry into the most significant bit is not the same as the carry out of the most significant bit


\begin{displaymath}127+1 = 0111\ 1111 + 0000\ 0001 = 1000\ 0000\quad c_{\mbox{in}}=1,\,c_{\mbox{out}}=0\end{displaymath}


\begin{displaymath}128\times 2 = 1000\ 0000 \times 0000\ 0010 = 1\ 0000\ 0000\end{displaymath}

Floating Point Numbers

A very good article on floating point numbers is What Every Computer Scientist Should Know About Floating Point Numbers. It contains more information than the basics covered here.

Computers can represent all integers within a specific range and as long as operations on integers in this range do not produce integers outside the range the results are exact.

Computers cannot represent all rationals within any (but trivial) range. Therefore operations on rationals almost always produce errors.

The method used to represent rational numbers is called floating point and is characterized by:

1.
the base b, which is almost always 2 in modern computers.
2.
the precision p which is the number of symbols (bits) used to represent a value
3.
the exponent range from L to U which allows the value to scale (or float) between some small and large value.
An (IEEE) floating point number x has the form

\begin{displaymath}x = (-1)^{s} \times 2^{e-127}\times 1.f\end{displaymath}

where

\begin{displaymath}f=\frac{d_1}{2}+\frac{d_2}{2^2}+\cdots+\frac{d_{p}}{2^{p}}\end{displaymath}

where the $d_1,\,d_2,\ldots, d_p$ are bits 0 or 1, and $L \leq e \leq U$

Example Machines
Machine b p L U
Univac 2 27 -128 127
IBM 360 16 14 -64 63
IEEE (single) 2 24 -126 127
IEEE (double) 2 53 -1022 1023
The IEEE 754 floating point standard requires that floating point numbers be normalized that is the value starts with binary 1.

Example 7:

Suppose base b=2; precision p=4 (1 guard digit plus 3 of precision); and the exponent range is from -1 to +3. Then the positive numbers are:

Value Scale 2-1 Scale 20 Scale 21 Scale 22 Scale 23
1.0002 0.12 = 1/2 12 = 1 102 = 2 1002 = 4 10002 = 8
1.0012 0.10012 = 9/16 1.0012 = 9/8 10.012 = 9/4 100.12 = 9/2 10012 = 9
1.0102 0.10102 = 5/8 1.0102 = 5/4 10.102 = 5/2 101.02 = 5 10102 = 10
1.0112 0.10112 = 11/16 1.0112 = 11/8 10.112 = 11/4 101.12 = 11/2 10112 = 11
1.1002 0.11002 = 3/4 1.1002 = 3/2 11.002 = 3 110.02 = 6 11002 = 12
1.1012 0.11012 = 13/16 1.1012 = 13/8 11.012 = 13/4 110.12 = 13/2 11012 = 13
1.1102 0.11102 = 7/8 1.1102 = 7/4 11.102 = 7/2 111.02 = 7 11102 = 14
1.1112 0.11112 = 15/16 1.1112 = 15/8 11.112 = 15/4 111.12 = 15/2 11112 = 15


\begin{picture}(288,16)
\put(0,4){\line(1,0){288}}
\multiput(0,0)(18,0){16}{\lin...
...{3}
\put(72,-6){4}
\put(90,-6){5}
\put(108,-6){6}
\put(270,-6){15}
\end{picture}

Note that errors must occur when performing arithmetic with this set of numbers. For example,

\begin{displaymath}6+\frac{9}{16}=\frac{105}{16} \approx \frac{13}{2}.\end{displaymath}

This rounding error is inevitable

IEEE single precision uses 32 bits words with 1 guard digit.

Typical layout of a word
s e[7:0] f[22:0]

0 < e < 255 $(-1)^{s} \times 2^{e-127} \times 1.f$ (normal numbers)
$\stackrel{e = 0; f \neq 0}{\mbox{(at least one bit in f is nonzero)}}$ $(-1)^s \times 2^{-126} \times 0.f$ (subnormal numbers)
$\stackrel{e = 0; f = 0}{\mbox{(all bits in f are zero)}}$ $(-1)^s \times 0$ (signed zero)
s = 0; e = 255; f = 0 +INF (positive infinity)
s = 1; e = 255; f = 0 -INF (negative infinity)
$ s = u; e = 255; f \neq 0$ NaN (Not-a-Number)

Relative Error and ULPS

There are several ways to measure the errors made in floating point approximations to non-representable rational or real numbers. One is absolute error: Let x denote the real number and let $\tilde{x}$denote the floating point value. The absolute error is the absolute value of their difference:

\begin{displaymath}\vert x-\tilde{x}\vert.\end{displaymath}

For example, the absolute error committed when approximating x=3.14159 by $\tilde{x}=3.14$ is 0.00159.

Absolute error is not used often: it does not always produce a good measure of error. Two very large numbers might approximate each other well but have a large absolute error because they are large:

\begin{displaymath}\vert 1.073\ 741\ 824\times 10^{9}-1.073\ 741\ 825,\times 10^{9}\vert = 1.0\end{displaymath}

Two very small numbers might not be good approximations of one another but have a small absolute error because they are small:

\begin{displaymath}\vert 9.31322574615478515625\times10^{-10}-9.31322574615478515000\times10^{-10}\vert=6.25\times10^{-28}\end{displaymath}

And absolute error does not take into account the base and the precision of the machine.

A related, but better way to measure the difference between floating point numbers and a real number is units in the last place (ulps). ULPS takes into account the precision of the computer. For example, assuming the base is 10, the precision is p=7, $\tilde{x}=1.073\ 741\times 10^{9}$ and $x=1.073\ 741\ 824\ 500\times 10^{9}$, we compute:

\begin{displaymath}\vert 1.073\ 741\ 824 -1.073\ 741\ 000\vert 10^{7-1} = 0.824 \mbox{ulps}.\end{displaymath}

A another way to measure the difference between floating point numbers and real numbers is relative error, their difference divided by the real number:

\begin{displaymath}\frac{x-\tilde{x}}{x}.\end{displaymath}

For example, the relative error committed when approximating x=3.14159 by $\tilde{x}=3.14$ is

\begin{displaymath}\frac{3.14159-3.14}{3.14159}\approx 0.000506.\end{displaymath}

To see the difference between absolute error, ulps, and relative error, consider the real number x=12.35 and floating point number $\tilde{x}=1.24\times 10^1$, where the base is 10 and the precsion is 2.

Now consider 8x and $8\tilde{x}$.

Machine Epsilon

A machine's (single precision) epsilon $\epsilon$ is the smallest positive (single precision) floating point number such that

\begin{displaymath}1\oplus \epsilon > 1\end{displaymath}

where $\oplus$ denotes floating point addition. All machines that implement the IEEE floating point standard have machine epsilon

Single Precision Double Precision
1.192092895507812500e-07 = 2-23 2.220446049250313081e-16 = 2-52

Here's a code fragment that can be used to calculate a machine's epsilon:


/* calculate the machine epsilon */
while ((float) 1.0 < (float) (1.0 + feps)) feps /= 2.0; exp++;
printf ("(Single precision) Machine epsilon %20.18e = 2^-%d
\(\backslash\)n \(\backslash\)n", 2*feps, -exp);


exp = 0;
while (1.0 < (1.0 + eps)) eps /= 2.0; exp++;
printf ("(Double precision) Machine epsilon %20.18e = 2^-%d
\(\backslash\)n \(\backslash\)n", 2*eps, -exp);

Avoiding Floating Point Errors

Careful attention to programming can help avoid floating point errors.

Add small quantities first

Adding a small quantity to a large quantity can cause the small quantity to be lost - there are not enough bits in a word to hold the sum, for example adding a number smaller than machine epsilon to 1 results in a sum of 1.

Many functions are approximated by a truncated infinite series, for example,

The order in which floating point addition is performed can affect the sum: Add small quantities first.

Consider the harmonic numbers

\begin{displaymath}H_1=1,\,H_2=1+\frac{1}{2},\,H_3=1+\frac{1}{2}+\frac{1}{3},\ldots,H_n=1+\frac{1}{2}+\frac{1}{3}+\cdots+\frac{1}{n}\end{displaymath}

(The kth harmonic produced by a violin string is the fundamental tone produced by a string that is 1/k times as long.)

It can be shown that harmonic numbers are approximated by the formula

\begin{displaymath}H_n =
\ln(n)+\gamma+\frac{1}{2n}-\frac{1}{12n^2}+\frac{\epsilon_n}{120n^4},\quad 0 < \epsilon_n < 1\end{displaymath}

where $\gamma = 0.5772156649\cdots$ is known as Euler's constant

We'll consider three ways to compute Hn in floating point arithmetic.

1.
Using asymptotic formula where error is O(1/n4)
2.
Using single and double precision adding large numbers first
3.
Using single and double precision adding small numbers first

The C code:


int n = 1000000;
double gamma = 0.57721566490153286061;
int i = 1;
double h = 0.0;
float H = 0.0;
double har = 0.0;


har = log(n)+gamma+1/(2.0*(double)n)-1/(12.0*(double)(n*n));
printf ("Harmonic number by asymptotic formula H_1000000 =
\(\backslash\)t%lf \(\backslash\)n \(\backslash\)n", har);
for (i = 1; i < n; i++) {
    h += 1/ ((double) i);
    H += 1/((float) i);
}
printf ("Double precision forward sum H_1000000 =
\(\backslash\)t%f,", h);
printf ("
\(\backslash\) t relative error = \(\backslash\)t%e \(\backslash\)n", (har-h)/har);
printf ("Single precision forward sum H_1000000 =
\(\backslash\)t%lf,", H);
printf ("
\(\backslash\)t relative error = \(\backslash\)t%e \(\backslash\)n \(\backslash\)n", (har-H)/har);


h = 0.0; H = 0.0;
for (i = n; i > 0; i-) {
    h += 1/((double) i);
    H += 1/((float) i);
}
printf ("Double precision backward sum H_1000000 =
\(\backslash\)t%f,", h);
printf ("
\(\backslash\)t relative error = \(\backslash\)t%e \(\backslash\)n", (har-h)/har);
printf ("Single precision backward sum H_1000000 =
\(\backslash\)t%lf,", H);
printf ("
\(\backslash\)t relative error = \(\backslash\)t%e \(\backslash\)n \(\backslash\)n", (har-H)/har);

produces these results:

Using the displayed results and realizing that the asymtotic approximation is the most accurate, we see

Avoid catastrophic cancellation

Another concern in floating point arithmetic is catastrophic cancellation where two nearly identical numbers are subtracted causing loss of almost all significant digits in the result.

A classic example is the evaluation of the quadratic formula


\begin{displaymath}r_{1,2} = \frac{-b\pm\sqrt{b^2-4ac}}{2a}\end{displaymath}

for the roots of the quadratic equation

ax2+bx+c=0.

Smart floating point evaluation flys in the face of the high school dictum: rationalize the denominator!

Consider the equation

\begin{displaymath}x^2-2x+\epsilon = 0 \quad\mbox{as $\epsilon\rightarrow0$}\end{displaymath}

which has roots

\begin{displaymath}r_{1,2} = 1\pm\sqrt{1-\epsilon}.\end{displaymath}

When $\epsilon$ is small the square root will be close to 1 and evaluation of the root $r_2=1-\sqrt{1-\epsilon}$ will result in catastrophic cancellation.

However, by unrationalizing the denominator, the root can be accurately calculated:

\begin{displaymath}r_2=\frac{\epsilon}{1+\sqrt{1-\epsilon}}.\end{displaymath}

An asymptotic formula will again be used to estimate the answer. Since by the binomial theorem

\begin{eqnarray*}(1-x)^{n} & = & {n\choose 0} - {n \choose 1} x+ {n \choose 2} x...
...rac{n(n-1)(n-2)}{3!}x^3+\frac{n(n-1)(n-2)(n-3)}{4!}x^4-\cdots\\
\end{eqnarray*}


we find

\begin{displaymath}(1-\epsilon)^{\frac{1}{2}} = \sqrt{1-\epsilon} = 1 - \frac{1}{2}\epsilon - \frac{1}{8}\epsilon^2-\frac{1}{16}\epsilon^3-\cdots\end{displaymath}

and

\begin{displaymath}r_2=1-\sqrt{1-\epsilon}=\frac{1}{2}\epsilon + \frac{1}{8}\epsilon^2+\frac{1}{16}\epsilon^3+\cdots\end{displaymath}


double eps = 1.0;
double rwc = 0.0;
double rwoc = 0.0;
double r = 0.0;
while (1.0/1000000.0 < eps) {
    rwc = 1.0 - sqrt(1.0-eps);
    rwoc = eps/(1.0 + sqrt(1.0-eps));
    r = eps/2.0 + (eps*eps)/8.0 + (eps*eps*eps)/16.0;
    printf ("root with cancellation
\(\backslash\)t%20.18e \(\backslash\)n", rwc);
    printf ("root without cancellation
\(\backslash\)t%20.18e \(\backslash\)n", rwoc);
    printf ("root by asymptotic formula
\(\backslash\)t%20.18e \(\backslash\)n \(\backslash\)n", r);
    eps /= 2.0;
}

The last few lines of output yield

$\epsilon = 7.62939453125\times 10^{-6}$ root with cancellation 3.814704541582614183e-06
  root without cancellation 3.814704541610369759e-06
  root by asymptotic formula 3.814704541610369759e-06
$\epsilon = 3.814697265625\times 10^{-6}$ root with cancellation 1.907350451801903546e-06
  root without cancellation 1.907350451805372993e-06
  root by asymptotic formula 1.907350451805372993e-06
$\epsilon = 1.9073486328125\times 10^{-6}$ root with cancellation 9.536747711536008865e-07
  root without cancellation 9.536747711540345673e-07
  root by asymptotic formula 9.536747711540345673e-07

Next, consider the evaluation of: $e^{x}=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots$ at x=-5.0 and compare with the more stable evaluation of 1/ex and x=5. Taking the first 20 terms we find (for the unstable summation)


\begin{displaymath}e^{-5} = 1-5+125-2083.33 +26,60416.7-\cdots \approx 6.706341\times10^{-3}=0.006706341\end{displaymath}

Absolute error ULPS (7 place precision) Relative Error
$3.1605999\times 10^{-5}$ 31.6 $4.69075\times 10^{-3}$

Now compare this with the stable summation

\begin{displaymath}e^{5} = 1+5+125+2083.33 +26,60416.7+\cdots \approx 1.484131\times10^{2}\end{displaymath}

and

\begin{displaymath}1/e^{5} = 6.737949\times 10^{-3}=0.006737949\end{displaymath}

which compare to an accurate result of

e-5=.00673794699908546709

Absolute error ULPS (7 place precision) Relative Error
$2.0009145\times 10^{-9}$ 0.0020 $2.96962047\times 10^{-7}$

Questions

1.
Evaluate the geometric sum $\sum_{i=0}^{8} 2^{i}$.
2.
What is the largest integer one can represent in 8 bits?
3.
Evaluate the geometric sum $\sum_{i=0}^{8} 3^{i}$.
4.
What is the largest integer one can represent in 8 trits?
5.
Evaluate the geometric sum $\sum_{i=0}^{8} 10^{i}$.
6.
What is the largest integer one can represent in 8 digits?
7.
What is the efficiency of octal and hexadecimal in representing integers between 0 and 1,000,000?
8.
Maximize efficiency rw subject to the constraint rw=1,000,000
9.
Express the decimal fraction 0.15 in binary notation.
10.
Express the binary fraction 0.112 in (a) octal, (b) hexadecimal, (c) decimal notation.
11.
Provide a definition of nine's complement for decimal notation.
12.
Provide a definition of ten's complement for decimal notation.
13.
What numbers can be represented in 4-bit negative binary notation? ( n = d3(-2)3+d2(-2)2+d1(-2)1+d0(-2)0)
14.
What happens if you execute the code below? Why does it happen?

exp = 0;
while (1.0 <= (1.0 + eps)) eps /= 2.0; exp++;
15.
Using floating point arithmetic, evaluate $1-\cos(x)/x^2$ at x = 1.2x10-5 as accurately as you can.

Florida Institute of Technology
Department of Computer Sciences
150 West University Boulevard,
Melbourne, FL 32901-6988
Tel. (321) 674-8763, Fax (321) 674-7046,
E-mail: www@cs.fit.edu


© 2001 Florida Tech, this server is currently maintained by the Department of Computer Sciences. Please send your questions, comments and suggestions to www@cs.fit.edu.

William D. Shoaff
2002-01-22