|
|
|
|
|
Number systems are used for counting.
We can count ``whole things'' with integers:
Mathematics extends counting concepts with additional number systems such as real and complex numbers: These are beyond the scope of these notes.
The decimal, denary, base 10, or radix 10
number system is a
positional system, meaning that numbers are represented as
To represent negative numbers two additional symbols are included ``+'' and
``-,'' but the ``+'' symbol is not always explicitly used:
You must already know how to perform arithmetic (addition, subtraction, multiplication, and division) in the decimal system.
The decimal system does not serve well for computer arithmetic because it is difficult to build cheap devices that stably represent the 10 digits and the 2 signs.
Let's look at another interesting system that is also not useful in computing, although it has a long history and still no doubt serves many people very well.
The unary or tally number system uses one symbol ``|'' (called a unit) to count positive whole things;
The unary number system is unlike any of the other systems used to count.
To study efficiency more let's consider another unusual system.
The milliary number system uses r=1,000 as a base
or radix for counting and one thousand millits to denote counts.
The millit symbols are
The width needed to count large numbers is small:
For example, we can count 99910 with one symbol
(width w=1);
we can count 1,000 with 2 symbols 10 (width w=2); and
we can count
999,99910 with 2 symbols
(width w=2).
But the millinary system is still not very efficient: the radix is very large, so the number of symbols (millits) needed to represent numbers is large too.
The accepted measure of a number's system efficiency is the product of
the radix and the width: rw.
Representing all numbers between 0 and 999,999 in millinary
has efficiency
.
And milliary is not a good computing system because building efficient stable devices that can represent 1000 different symbols (states) is difficult too, but the binary system is efficient and it is not hard to build stable 2 state devices.
The binary system uses two symbols ``0'' and ``1'' called bits to count.
It is vastly superior to the unary and millinary in efficient counting.
We can count 10 using 4 bits 10102, and we can count
1,000,00010 using 20
bits:
Here's another example:
We'll consider arithmetic over the binary system below.
Some authors use a ``b'' suffix to denote that a number is binary, e.g.,
The ternary system uses ``trits'' 0, 1, and 2 to count.
For example:
Ternary is more efficient than binary:
It has an efficiency of
when representing
counts between 0 and one million:
Octal numbers have a base b=8, and use the octits
0,1,2,3,4,5,6,7 to denote magnitudes.
For example,
Some authors use a ``o'' or ``q'' suffix to denote that a number is octal, e.g.,
In the C programming language octal constants are represented by
'\o'
where ooo is one or more octits, e.g., '\013' is decimal 11 and ASCII for
vertical tab.
In the Java programming language octal constants are represented by
by a leading zero followed by octits,
Hexadecimal numbers have a base b=16. The letters
are used as hexits above 9.
Some authors use a ``h'' or ``H'' suffix to denote that a number is hexadecimal, e.g.,
In the C programming language hexadecimal constants are represented by
'\xh'
where h is one or more hexits, e.g., '\xb' is decimal 11 and ASCII for
vertical tab.
In the Java programming language hexadecimal constants are represented by
by a leading 0x or 0X followed by hexits,
Sexagesimal numbers have a base b=60 and use 60 symbols called sexits. Arabic astronomers from Babylon used this system and remnants of it still exist today in trigonometric and time units of ``degrees, minutes, and seconds.''
Given an integer base b > 1, for example b=10, a number xcan be written as
Using n positions to the left of the point you can represent any
integer between 0 and
Repeated division with accumulation of remainders can be used to convert integers from decimal to an arbitrary base:
% b);
Example 1: given base b=9 and decimal number n=342 we can find the base 9representation of 342 by repeated division:
Example 2: Let b=3 and write decimal number n=1,000,000 in ternary.
| Division | Quotient | Remainder |
|---|---|---|
|
|
333,333 | 1 |
|
|
111,111 | 0 |
|
|
37,037 | 0 |
|
|
12,345 | 2 |
|
|
4,115 | 0 |
|
|
1,371 | 2 |
|
|
457 | 0 |
|
|
152 | 1 |
|
|
50 | 2 |
|
|
16 | 2 |
|
|
5 | 1 |
|
|
1 | 2 |
|
|
0 | 1 |
Fractions can be converted by repeated multiplication and accumulation of integer excesses.
Example 3: Let b=2 and write decimal fraction n=0.1 in binary.
| Multiplication | Product | Integer | Fraction |
|---|---|---|---|
|
|
0.2 | 0 | 0.2 |
|
|
0.4 | 0 | 0.4 |
|
|
0.8 | 0 | 0.8 |
|
|
1.6 | 1 | 0.6 |
|
|
1.2 | 1 | 0.2 |
|
|
0.4 | 0 | 0.4 |
|
|
0.8 | 0 | 0.8 |
|
|
1.6 | 1 | 0.6 |
|
|
1.2 | 1 | 0.2 |
Example 4: Let b=3 and write decimal fraction n=0.01 in ternary.
| Multiplication | Product | Integer | Fraction |
|---|---|---|---|
|
|
0.03 | 0 | 0.03 |
|
|
0.09 | 0 | 0.09 |
|
|
0.27 | 0 | 0.27 |
|
|
0.81 | 0 | 0.81 |
|
|
2.43 | 2 | 0.43 |
|
|
1.29 | 1 | 0.29 |
|
|
0.87 | 0 | 0.87 |
|
|
2.61 | 2 | 0.61 |
Converting from some base to decimal can be done by repeated multiplication (the opposite of the steps above). The conversion can be organized using a fool-proof simple tabular format, called Horner's algorithm.
Example 5: Convert
to decimal:
| 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | |
| 2 | 4 | 8 | 16 | 34 | 68 | 138 | 278 | ||
| 1 | 2 | 4 | 8 | 17 | 34 | 69 | 139 | 279 |
Example 6: Convert
to decimal:
Here the multiplier is 3.
| 1 | 2 | 2 | 0 | 1 | 0 | 1 | 2 | 1 | |
| 3 | 15 | 51 | 153 | 462 | 1386 | 4161 | 12489 | ||
| 1 | 5 | 17 | 51 | 154 | 462 | 1387 | 4163 | 12490 |
To convert a binary number to an hexadecimal, group the bits 4 at a time starting from the right (least significant bit):
To express an hexadecimal number in binary, simply expand each hexit:
To convert a binary number to an octal, group the bits 3 at a time starting from the right (least significant bit):
To express an octal number in binary, simply expand each octit:
There are multiple representations for signed integers:
Use + or - sign, but encode as 0 or 1.
| Binary Number System | |||
| One Bit | 2 Bits | 3 Bits | 4 Bits |
| 0 = +0 | 0 = +0 | 0 = +0 | 0 = +0 |
| 1 = -0 | 1 = +1 | 1 = +1 | 1 = +1 |
| 10 = -0 | 10 = +2 | 10 = +2 | |
| 11 = -1 | 11 = +3 | 11 = +3 | |
| 100 = -0 | 100 = +4 | ||
| 101 = -1 | 101 = +5 | ||
| 110 = -2 | 110 = +6 | ||
| 111 = -3 | 111 = +7 | ||
| 1000 = -0 | |||
| 1001 = -1 | |||
| 1010 = -2 | |||
| 1011 = -3 | |||
| 1100 = -4 | |||
| 1101 = -5 | |||
| 1110 = -6 | |||
| 1111 = -7 | |||
The one's complement of a number: flip each bit, for example
The two's complement of a number n is its one's complement plus 1.
For example:
| Decimal | Binary | One's Complement | Two's complement | Decimal |
| 0 | 0 | 1111 1111 | 0000 0000 | 0 |
| 1 | 1 | 1111 1110 | 1111 1111 | -1 |
| 2 | 10 | 1111 1101 | 1111 1110 | -2 |
| 3 | 11 | 1111 1100 | 1111 1101 | -3 |
| 4 | 100 | 1111 1011 | 1111 1100 | -4 |
| 32 | 10 0000 | 1101 1111 | 1110 0000 | -32 |
| 127 | 111 1111 | 1000 0000 | 1000 0001 | -127 |
Two Algorithms for finding two's complement:
A computer's memory is divided into chunks. By default assume that a computer's logic and arithmetic is based on binary notation. The smallest chunk of memory is a bit a 0 or a 1. Larger chunks are:
The word length of a processor is the number of bits in fundamental unit. Usually the word length is:
| Processor | Word Length | Year |
| Intel 4004 | 4 bits | 1971 |
| Intel 8080 | 8 bits | 1974 |
| Intel 8086 | 16 bits | 1978 |
| Intel 80386 | 32 bits | 1985 |
| Intel Pentium | 32 bits | 1993 |
| Intel Itanium | 64 bits | 2001 |
| Motorola 68000 | 16 bits | 1980 |
| VAX 11/780 | 16 bits | mid 1970's |
| IBM S/360 | 32 bits | mid 1960's |
| UNIVAC | 12 characters | early to mid 1950's |
| ENIAC | 10 digits | mid 1940's |
| Binary Addition (Half-Adder) | |||
| Input | Output | ||
| Addend | Addend | Sum | Carry (Out) |
| 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 |
| 1 | 1 | 0 | 1 |
| Binary Addition (Full-Adder) | ||||
| Input | Output | |||
| Addend | Addend | Carry (In) | Sum | Carry (Out) |
| 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 1 | 1 | 0 | 0 | 1 |
| 0 | 0 | 1 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 |
| Multiplication | |||||
| Multiplicand | Multiplicand | Product | |||
| 0 | 0 | 0 | |||
| 1 | 0 | 0 | |||
| 0 | 1 | 0 | |||
| 1 | 1 | 1 | |||
With 8 bits it is possible to represent unsigned integers from 0 to 255:
| Decimal Value | Signed Binary Representation |
| 0 | 0000 0000 |
| 1 | 0000 0001 |
| 2 | 0000 0010 |
| 254 | 1111 1110 |
| 255 | 1111 1111 |
Overflow occurs when the result of an operation exceeds word length
of the processor (assume 8 bit word length),
Overflow does not occur in pure mathematics.
Overflow is detected when:
Carry into the most significant bit is not the same as the carry out of the most significant bit
A very good article on floating point numbers is What Every Computer Scientist Should Know About Floating Point Numbers. It contains more information than the basics covered here.
Computers can represent all integers within a specific range and as long as operations on integers in this range do not produce integers outside the range the results are exact.
Computers cannot represent all rationals within any (but trivial) range. Therefore operations on rationals almost always produce errors.
The method used to represent rational numbers is called floating point and is characterized by:
| Example Machines | ||||
| Machine | b | p | L | U |
| Univac | 2 | 27 | -128 | 127 |
| IBM 360 | 16 | 14 | -64 | 63 |
| IEEE (single) | 2 | 24 | -126 | 127 |
| IEEE (double) | 2 | 53 | -1022 | 1023 |
| Value | Scale 2-1 | Scale 20 | Scale 21 | Scale 22 | Scale 23 |
|---|---|---|---|---|---|
| 1.0002 | 0.12 = 1/2 | 12 = 1 | 102 = 2 | 1002 = 4 | 10002 = 8 |
| 1.0012 | 0.10012 = 9/16 | 1.0012 = 9/8 | 10.012 = 9/4 | 100.12 = 9/2 | 10012 = 9 |
| 1.0102 | 0.10102 = 5/8 | 1.0102 = 5/4 | 10.102 = 5/2 | 101.02 = 5 | 10102 = 10 |
| 1.0112 | 0.10112 = 11/16 | 1.0112 = 11/8 | 10.112 = 11/4 | 101.12 = 11/2 | 10112 = 11 |
| 1.1002 | 0.11002 = 3/4 | 1.1002 = 3/2 | 11.002 = 3 | 110.02 = 6 | 11002 = 12 |
| 1.1012 | 0.11012 = 13/16 | 1.1012 = 13/8 | 11.012 = 13/4 | 110.12 = 13/2 | 11012 = 13 |
| 1.1102 | 0.11102 = 7/8 | 1.1102 = 7/4 | 11.102 = 7/2 | 111.02 = 7 | 11102 = 14 |
| 1.1112 | 0.11112 = 15/16 | 1.1112 = 15/8 | 11.112 = 15/4 | 111.12 = 15/2 | 11112 = 15 |
Note that errors must occur when performing arithmetic with this set of numbers.
For example,
IEEE single precision uses 32 bits words with 1 guard digit.
| s | e[7:0] | f[22:0] |
| 0 < e < 255 |
|
![]() |
|
![]() |
|
| s = 0; e = 255; f = 0 | +INF (positive infinity) |
| s = 1; e = 255; f = 0 | -INF (negative infinity) |
|
|
NaN (Not-a-Number) |
There are several ways to measure the errors made in floating point
approximations to non-representable rational or real numbers.
One is absolute error: Let x denote the real number and let
denote the floating point value. The absolute error is the absolute value of
their difference:
Absolute error is not used often: it does not always produce a good measure
of error. Two very large numbers might approximate each other well but have a large absolute
error because they are large:
A related, but better way to measure the difference between floating point numbers and a
real number is units in the last place (ulps).
ULPS takes into account the precision of the computer.
For example, assuming
the base is 10, the precision is p=7,
and
,
we compute:
A another way to measure the difference between floating point numbers and
real numbers is relative error, their difference divided by the real number:
To see the difference between absolute error, ulps, and relative error,
consider the real number x=12.35 and floating point number
,
where the base is 10 and the precsion is 2.
A machine's (single precision) epsilon
is the smallest
positive (single precision) floating point number such that
| Single Precision | Double Precision |
| 1.192092895507812500e-07 = 2-23 | 2.220446049250313081e-16 = 2-52 |
Here's a code fragment that can be used to calculate a machine's epsilon:
exp = 0;
while (1.0 < (1.0 + eps)) eps /= 2.0; exp++;
printf ("(Double precision) Machine epsilon %20.18e = 2^-%d
n
n", 2*eps, -exp);
Careful attention to programming can help avoid floating point errors.
Adding a small quantity to a large quantity can cause the small quantity to be lost - there are not enough bits in a word to hold the sum, for example adding a number smaller than machine epsilon to 1 results in a sum of 1.
Many functions are approximated by a truncated infinite series, for example,
Consider the harmonic numbers
It can be shown that harmonic numbers are approximated by the formula
We'll consider three ways to compute Hn in floating point arithmetic.
The C code:
har = log(n)+gamma+1/(2.0*(double)n)-1/(12.0*(double)(n*n));
printf ("Harmonic number by asymptotic formula H_1000000 =
t%lf
n
n", har);
for (i = 1; i < n; i++) {
h += 1/ ((double) i);
H += 1/((float) i);
}
printf ("Double precision forward sum H_1000000 =
t%f,", h);
printf ("
t relative error =
t%e
n", (har-h)/har);
printf ("Single precision forward sum H_1000000 =
t%lf,", H);
printf ("
t relative error =
t%e
n
n", (har-H)/har);
h = 0.0; H = 0.0;
for (i = n; i > 0; i-) {
h += 1/((double) i);
H += 1/((float) i);
}
printf ("Double precision backward sum H_1000000 =
t%f,", h);
printf ("
t relative error =
t%e
n", (har-h)/har);
printf ("Single precision backward sum H_1000000 =
t%lf,", H);
printf ("
t relative error =
t%e
n
n", (har-H)/har);
Another concern in floating point arithmetic is catastrophic cancellation where two nearly identical numbers are subtracted causing loss of almost all significant digits in the result.
A classic example is the evaluation of the quadratic formula
for the roots of the quadratic equation
Consider the equation
However, by unrationalizing the denominator, the root can be
accurately calculated:
The last few lines of output yield
|
|
root with cancellation | 3.814704541582614183e-06 |
| root without cancellation | 3.814704541610369759e-06 | |
| root by asymptotic formula | 3.814704541610369759e-06 | |
|
|
root with cancellation | 1.907350451801903546e-06 |
| root without cancellation | 1.907350451805372993e-06 | |
| root by asymptotic formula | 1.907350451805372993e-06 | |
|
|
root with cancellation | 9.536747711536008865e-07 |
| root without cancellation | 9.536747711540345673e-07 | |
| root by asymptotic formula | 9.536747711540345673e-07 |
Next, consider the evaluation of:
at x=-5.0 and compare with the more stable evaluation
of 1/ex and x=5. Taking the first 20 terms we find (for
the unstable summation)
| Absolute error | ULPS (7 place precision) | Relative Error |
|
|
31.6 |
|
Now compare this with the stable summation
| Absolute error | ULPS (7 place precision) | Relative Error |
|
|
0.0020 |
|
Florida Institute of Technology