by William Shoaff with lots of help
You can download a postscript version of this file (which is prettier) at
Searching solves the problem of locating data that has been stored with given identification. As with sorting, we have a file of records where each record includes a key used to identify the record. These notes study algorithms for solving the problem of searching through file of records for one with a particular key k. When the file of records is small, it is often called a table. Note that we must be able to determine whether or not two keys are equal, and for some algorithms that solve this problem require a linear order on the keys.
Records may also contain secondary keys, attributes, or satellite data. Searching, or data retrieval, over secondary keys is a subject within the field of databases and will not be covered in these notes.
Searches are successful when a record with key k is found in the file or table, and unsuccessful when a record with the given key k is not found in the file. When searches are unsuccessful, we often want to insert a new record with key k in the file.
Searching algorithms can be classified as internal or
external dependent on whether or not the file can be stored
in primary memory. They may be classified as dynamic or
static dependent or whether the file changes or cannot
change during the search.
Search algorithms can also be classified based on how records are located.
One method would be by comparing keys, similar to comparison-based sorting.
Another method would be by digital properties of the keys
(#_#>
Before jumping into the subject let's consider two special problems that will lead to useful ideas and analysis techniques related to searching.
The subset problem asks if one set A is a subset of another set B, specifically:
Given two sets A = {a0, a1,..., an - 1} and B = {b0, b1,..., bm - 1}, determine whether or not AHere are three ways to attack the problem.B.
The first (brute force) solution will have running time that is roughly c1nm = O(nm). A good hashing method will provide solution two with a running time that is c2m + d2n = O(n + m). While the third solution will have running time c3(mlgm + nlg n). Which of these three methods will be best in a given application depends on the sizes of m and n, and the constants c1, c2, d2, c3. Bottom line: most often there are several methods to solve a problem. Which method is best depends on analysis.
Consider the problem of finding the maximum value in a set of keys. Specifically,
Given n keys k0, k1,..., kn - 1 find M and j such thatM = kj = max{k0, k1,..., kn - 1}.
Here's one implementation of a brute force algorithm that solves the problem. Note the algorithm does not do anything with the values M and j that it finds, but that is no matter here.
[Find the maximum]=
public void findMax(Record[] record) {
Key M = record[0].key();
int n = record.length;
int j = 0;
int i = 1;
while (i < n) {
if (record[i].key() > M) {
j = i;
M = record[i].key();
}
++i;
}
}
The algorithm is easy to analyze, yet instructive when one looks at the details. The initializations before the while loop occur once and have constant O(1) running time. The test i < n occurs n times, once for each i = 1, 2,..., n. Both the if statement and the incrementing operation ++i occurs n - 1 times. So, it is clear that the running time of the algorithm is O(n), but let's consider the number of times that a new maximum is found, that is, the number of times the condition in the if statement evaluates to true. Let's call this value Cn. It is clear that:
The average value An of Cn is between 0 and n - 1, but what is it exactly and what is its variance and standard deviation? To answer these questions, one must make assumptions about the permutations of the keys. There are n! permutations and we will assume each permutation is equally likely to occur.
The type of the keys is not really important to us (unless it affects the complexity of comparisons or transfer of key values), so we will assume that they are the natural numbers 0, 1,..., n - 1. For particular values of n, say n = 3, we can explicitly calculate the average number of times the condition key[i] > M is true.
| Permutation | Value of C3 | Permutation | Value of C3 |
|---|---|---|---|
| 0, 1, 2 | 2 | 0 2 1 | 1 |
| 1, 0, 2 | 1 | 1 2 0 | 1 |
| 2, 0, 1 | 0 | 2 1 0 | 0 |
Thus, the average number of times the maximum is updated is:
We'd like to analyze the general case. Consider two events that partition what could happen:
In the first case, there is one more maximum updates for k0, k1,..., kn - 1 than for k0, k1,..., kn - 2. In the second case, the number of maximum updates for k0, k1,..., kn - 1 and k0, k1,..., kn - 2are identical. Since the probability of the first case is
=
Sequential search is the most simple search technique. It starts at the beginning of a file and keeps looking until the key is found or the end of the file is reached. This is a brute force approach that may be applicable when the files are small or they only searched a few times.
[Sequential search]=
public void sequentialSearch(Record[] record, Key key) {
int n = record.length;
int i = 0;
while (i < n && key != record[i].key) ++i;
if (i < n) System.out.println(successful);
else System.out.println(unsuccessful);
}
The analysis of sequentialSearch() is straightforward,
but it does depend on whether the search was successful or not.
If the search is successful we have
T(n) = i + 1.
But the running time is always
T(n) = n =
(n) for an unsuccessful search.
If every key occurs with probability 1/n, then the
average value of a successful search is
| Taverage(n) | = | 1 . |
|
| = |
We can improve on the efficiency of sequential search, but not its asymptotic running time, by inserting a sentinel record with key k at the end of the file.
[Better sequential search]=
public void betterSequentialSearch(Record[] record, Key key) {
int n = record.length;
int i = 0;
while (key != record[i].key) ++i;
if (i < n) System.out.println(successful);
else System.out.println(unsuccessful);
}
Another way to make sequential search faster is to look in a table where the data is in sorted order. That is, assume the records have been sorted so their keys are in ascending order
Sequential search in an ordered table]=
public void orderedSequentialSearch(Record[] record, Key key) {
int n = record.length;
int i = 0;
while (key > record[i].key) ++i;
if (key == record[i]) System.out.println(successful);
else System.out.println(unsuccessful);
}
We have been assume each key occurs with identical probability 1/n. This may not be true in general applications. Let pi denote the probability of occurrence of key ki. The average number of comparisons An for a successful search has the value
Knowing the probabilities allows us to build better search tables. If
Zipf's law is one probability distribution that seems to occur in practice. The probabilities in Zipf's law are
| An | = |
Binary search is a classic method where about half of the keys are discarded on each unsuccessful compare leading to at most about lg n comparisons.
Given a table of records r0,..., rn - 1 with ordered keys
We will find that the running time of binary search satisfies
Although a common algorithm, it can be non-trivial to implement correctly. Knuth [1] states that John Mauchly (of ENIAC fame) discussed binary search in a 1946 publication, it was not until 1962 that an implementation of binary search that worked correctly for all values of n was published. Here's is an iterative version of the algorithm.
[Binary search]=
public Record binarySearch(Record[] record, Key key) {
int lower = 0;
int upper = record.length - 1;
while (lower <= upper) {
int middle = (lower + upper)/2;
if (key == record[middle].key) return record[middle];
if (key < record[middle].key) {
upper = middle - 1;
}
else {
lower = middle + 1;
}
}
}
A binary decision tree can be used to understand binary search. For example when n = 8, a binary decision tree that represents the possible comparisons when executing the algorithm can be drawn as follows. Notice it is an extended tree with interior nodes denoted by circles and exterior nodes denoted by rectangles. Also, note that the tree is right-complete (not all binary search trees are).
Initially the key k is compared against k3 since the middle index is
m = 3 =
(0 + 7)/2
.
If k < k3 the left branch is taken and the upper index becomes
u = m - 1 = 2. The new middle index is
m = 1 =
(0 + 2)/2
and
we compare k and k1.
If k > k1 the right branch is taken and the lower index becomes
l = m + 1 = 2.
The middle index is now
m = 2 =
(2 + 2)/2
and we compare
k and k2. The external nodes labeled
and
represent
unsuccessful searches where
k1 < k < k2 and
k2 < k < k3.
From the tree representation we can determine the following about the expected running time of binary search:
Now internal and external path lengths are related by the formula
=
Binary search also lends itself readily to a recursive implementation using an explicit tree structure. For example, suppose we have a (partial) binary tree data structure defined as:
[Tree node data structure]=
public class Node {
private Node left;
private Node right;
private Record record;
protected Node getLeft() { return left; }
protected Node getRight() { return right; }
protected Key getKey() {return record.key(); }
}
public class BinaryTree {
private Node root;
private int size;
public int size() { return size; }
public boolean isEmpty() { return (size == 0); }
public Node leftChild (Node node) { return node.getLeft(); }
public Node rightChild (Node node) { return node.getRight(); }
public boolean isInternal(Node node) {
return ((node.getLeft() != null || node.getRight() != null));
}
public boolean isExternal(Node node) {
return ((node.getLeft() == null && node.getRight() == null));
}
public Node binarySearch(Node node, Key key);
}
A recursive binary search algorithm can be implemented as follows:
[Recursive binary search]=
public Node binarySearch(Node node, Key key) {
if (isExternal(node) || (key == node.getKey())) return node();
if (key < node.getKey()) {
return binarySearch(leftChild(node), key);
}
else {
return binarySearch(rightChild(node), key);
}
}
A main advantage of this linked representation of a binary tree is that it allows for the implementation of a dynamic table; one in which we can efficiently insert and delete records (nodes). Dynamic sets can grow or shrink; change over time. A dynamic set that supports the operations of insert, delete, and search (for set membership) is called a dictionary.
[Search and insert]=
public void insert(Node node, Record record) {
if isExternal(node) {
node.setRecord(record);
node.setLeft = null;
node.setRight = null;
}
else if (record.key() < node.getKey()) {
insert(leftChild(node), record);
}
else if (record.key() > node.getKey()) {
insert(rightChild(node), record);
}
// else (record.key() == node.getKey()) so do nothing
}
However, a potential problem with it is that the binary tree may not remain
balanced as records are inserted and deleted.
When the tree is fairly balanced, the search time will be
O(lg n), but in the degenerate case, where the tree
reduces to essentially a linked list, the search time will grow to O(n).
It can be shown that the average search time will be
O(
)if each n node binary tree is equally likely.
Let's consider what happens when n nodes are inserted into a binary tree in a random order. Consider 3 common English words:
Consider the number of compares when searching for each word in the five trees.
| Compares in Tree | |||||||
| 0 | 1 | 2 | 3 | 4 | 5 | Total | |
| OF | 1 | 1 | 2 | 2 | 2 | 3 | 11 |
| THE | 2 | 3 | 1 | 1 | 3 | 2 | 12 |
| TO | 3 | 2 | 2 | 2 | 1 | 1 | 11 |
| Total | 6 | 6 | 5 | 5 | 6 | 6 | 34 |
|---|---|---|---|---|---|---|---|
Let An be the average number of compares for a successful search
in a tree with n internal nodes.
Let
be the average number of compares for an unsuccessful search
in a tree with n internal nodes.
We'd like to see if we can derive a general formula for
Notice that:
.
=
A hash table is and effective data structure for implementing dictionaries. A hash table is a generalization of an ordinary array; the hash table index is computed from the key stored in the table. Under reasonable assumptions, insert, delete and search can be performed in O(1) (constant) time. First, we'll consider direct-address (associative) tables, a formal abstract data type based on arrays. Then we'll look at hash tables and what makes a good hash function Finally, we'll consider open hashing.
Pretend a dynamic set draws keys from the universal set
[Direct-address table]=
public class DirectAddressTable {
Record[] T;
public directAddressSearch (int key) {
return T[key];
}
public directAddressInsert (Record record) {
T[record.key()] = record;
}
public directAddressDelete (Record record) {
T[record.key()] = null;
}
}
Direct addressing has O(1) worst-case time complexity for each of its operations. Direct addressing is impractical when the size | U| is large. Direct addressing is space-inefficient when the set of keys K actually stored is small relative to the size | U|. The implementation assumes that Records know how to compute a unique integer within the range 0 to m - 1.
We will extend direct-address tables to hash tables where a record is found by computing a hash value using a hash function on the record's key. The hash value provides an index into the hash table where the record can be found (or inserted or deleted).
Consider 10 common English words:
P(k) = Most hash functions assume the keys come from the set of natural numbers N = {0, 1, 2,...}. When the keys are not natural numbers they must be converted For example, character strings can be represented via their ASCII code in radix 128 notation. For example, ON could be represented as
The division method for constructing a hash function maps a key k into one of m slots using the hash function
Certain values of m should not be used.
Good values of m are primes not too close to exact powers of 2.
As an example, pretend there are about n = 4000 character strings to be held
in a hash table,
and we don't mind up to about 3 strings hashing into the same slot.
Since
4000/3
1333 we could set m = 1381, a prime not too
close to 1024.
The hash function could be selected as
There are two steps in the multiplication method:
| h(k) | = | ||
| = | |||
| = | |||
| = | 41 |
A hash table requires less space than direct-address tables and still (usually) have efficient insert, delete, and search operations. Storage requirements can be reduced to O(| K|) and O(1)average running times for operations can still be attained. An record r with key k is stored in slot h(k)where h is a hash function. The hash function maps the universe U of keys into slots of a hash table T[0..m - 1]
Chaining places all records that collide (hash to the same slot) into a linked list.
We want to analyze the running times of these operations.
It is clear than we can insert a record at the head of a linked list
in
(1) time. Notice that this allows duplicate records
in the hash table. To avoid this we would need to search the list
at each slot and insert the new record at the end of the list
upon finding that it is not there.
Define the load factor
of a hash table T to be n/m
where m is the number of slots and n records are stored in
the table.
The load factor
estimates the average number of records
stored in a chain, under the assumption of
simple uniform hashing, that is, records are assigned
to slots following a uniform distribution.
It is clear, that the worst case running time for searching a hash table with chaining is O(n). This occurs in the (highly) non-uniform case where all keys hash to the same slot. But, with the simple uniform hashing assumption, the probability is 1/m that any key hashes into any of the m slots.
Require essentially the same time as search since we must find the node preceeding the one to be deleted in order to set its next address to the node after the deleted one.
Open-address hashing resolves collisions by storing all records in the hash table itself. This is most useful when the number of records to be stored can be estimated in advance. Then enough room of contiguous memory can be allocated (with some room to spare) to hold all the records, and we can reduce the space requirements needed for collision resolution by chaining.
We define more general hash functions for open-address hashing.
That is, our hash functions will now be functions of two arguments:
the key k and the number of probes p,
where
0
p
m - 1.
The probe parameter is used to bound the number of probes made so
allows us to know the algorithms will terminate.
Linear probing in the most simple open-addressing scheme. We'll first discuss it in terms of searching a hash table. When a collision occurs we simply probe the next slot in the table. There are three possibilities that may occur:
Insertion is simple as well. Usually, one searches for the record to be inserted, and upon determining that it is not there, inserts it into the table at the empty slot where the search terminated.
For linear probing, the hash function of key k and probe p is
Quadratic probing uses a hash function of the form
Knuth [1] shows that linear probing computes on the order of
| 0.50 | 0.60 | 0.70 | 0.80 | 0.90 | 0.95 | 0.99 | |||
| Successful | 1.500 | 1.750 | 2.166 | 3.000 | 5.600 | 10.500 | 50.500 | ||
| Unsuccessful | 2.500 | 3.625 | 6.055 | 13.000 | 50.500 | 200.500 | 5000.500 |
Clearly linear probing becomes inefficient as the load factor approaches 1.0 and one should strive to keep it below about 2/3 or so.