Searching -- Looking for Answers

by William Shoaff with lots of help


Contents

You can download a postscript version of this file (which is prettier) at

http://www.cs.fit.edu/%7Ewds/classes/cse5081/Search/search.ps

  
Searching

Searching solves the problem of locating data that has been stored with given identification. As with sorting, we have a file of records where each record includes a key used to identify the record. These notes study algorithms for solving the problem of searching through file of records for one with a particular key k. When the file of records is small, it is often called a table. Note that we must be able to determine whether or not two keys are equal, and for some algorithms that solve this problem require a linear order on the keys.

Records may also contain secondary keys, attributes, or satellite data. Searching, or data retrieval, over secondary keys is a subject within the field of databases and will not be covered in these notes.

Searches are successful when a record with key k is found in the file or table, and unsuccessful when a record with the given key k is not found in the file. When searches are unsuccessful, we often want to insert a new record with key k in the file.

Searching algorithms can be classified as internal or external dependent on whether or not the file can be stored in primary memory. They may be classified as dynamic or static dependent or whether the file changes or cannot change during the search. Search algorithms can also be classified based on how records are located. One method would be by comparing keys, similar to comparison-based sorting. Another method would be by digital properties of the keys (#_#>, similar to distribution-based sorting.

Before jumping into the subject let's consider two special problems that will lead to useful ideas and analysis techniques related to searching.

  
The subset problem

The subset problem asks if one set A is a subset of another set B, specifically:

Given two sets A = {a0a1,..., an - 1} and B = {b0b1,..., bm - 1}, determine whether or not A $ \subseteq$ B.
Here are three ways to attack the problem.

1.
Compare each ai sequentially with each bj until finding a match.
2.
Enter each bj into a table and then search the table for each ai.
3.
Sort the elements of A and B and then make one pass through both files.

The first (brute force) solution will have running time that is roughly c1nm = O(nm). A good hashing method will provide solution two with a running time that is c2m + d2n = O(n + m). While the third solution will have running time c3(mlgm + nlg n). Which of these three methods will be best in a given application depends on the sizes of m and n, and the constants c1c2d2c3. Bottom line: most often there are several methods to solve a problem. Which method is best depends on analysis.

  
Searching for the maximum

Consider the problem of finding the maximum value in a set of keys. Specifically,

Given n keys k0k1,..., kn - 1 find M and j such that

M = kj = max{k0k1,..., kn - 1}.

Here's one implementation of a brute force algorithm that solves the problem. Note the algorithm does not do anything with the values M and j that it finds, but that is no matter here.

[Find the maximum]=
public void findMax(Record[] record) {
  Key M = record[0].key();
  int n = record.length;
  int j = 0;
  int i = 1;
  while (i < n) {
    if (record[i].key() > M) {
      j = i;
      M = record[i].key();
    }
    ++i;
  }
}

The algorithm is easy to analyze, yet instructive when one looks at the details. The initializations before the while loop occur once and have constant O(1) running time. The test i < n occurs n times, once for each i = 1, 2,..., n. Both the if statement and the incrementing operation ++i occurs n - 1 times. So, it is clear that the running time of the algorithm is O(n), but let's consider the number of times that a new maximum is found, that is, the number of times the condition in the if statement evaluates to true. Let's call this value Cn. It is clear that:

1.
The minimum value is Cn = 0 and occurs when

M = k0 = max{k0k1,..., kn - 1}.

2.
The maximum value is Cn = n - 1 and occurs when

k0 < k1 < ... < kn - 1.

The average value An of Cn is between 0 and n - 1, but what is it exactly and what is its variance and standard deviation? To answer these questions, one must make assumptions about the permutations of the keys. There are n! permutations and we will assume each permutation is equally likely to occur.

The type of the keys is not really important to us (unless it affects the complexity of comparisons or transfer of key values), so we will assume that they are the natural numbers 0, 1,..., n - 1. For particular values of n, say n = 3, we can explicitly calculate the average number of times the condition key[i] > M is true.

Permutation Value of C3 Permutation Value of C3
0, 1, 2 2 0 2 1 1
1, 0, 2 1 1 2 0 1
2, 0, 1 0 2 1 0 0
cccc

Thus, the average number of times the maximum is updated is:

A3 = $\displaystyle {\frac{2 + 1 + 0 + 1 + 1 + 0}{6}}$ = $\displaystyle {\textstyle\frac{5}{6}}$.

We'd like to analyze the general case. Consider two events that partition what could happen:

1.
kn - 1 is the maximum value;
2.
kn - 1 is not the maximum value.

In the first case, there is one more maximum updates for k0k1,..., kn - 1 than for k0k1,..., kn - 2. In the second case, the number of maximum updates for k0k1,..., kn - 1 and k0k1,..., kn - 2are identical. Since the probability of the first case is

P(M = kn - 1) = $\displaystyle {\frac{(n-1)!}{n!}}$ = $\displaystyle {\frac{1}{n}}$

and the probability of the second case is

P(M $\displaystyle \neq$ kn - 1) = 1 - P(M = kn - 1) = $\displaystyle {\frac{n-1}{n}}$,

we have

An = $\displaystyle {\frac{1}{n}}$(An - 1 + 1) + $\displaystyle {\frac{n-1}{n}}$An - 1 = An - 1 + $\displaystyle {\frac{1}{n}}$,

from which you can deduce that

An = A1 + $\displaystyle {\textstyle\frac{1}{2}}$ + $\displaystyle {\textstyle\frac{1}{3}}$ + ... + $\displaystyle {\frac{1}{n}}$.

And since A1 = 0, we find the average number of maximum assignments is roughly the harmonic number Hn, that is,

An = Hn - 1 = O(lg n).

Note that for n = 3 this agrees with our explicit observations above:

A3 = H3 - 1 = (1 + $\displaystyle {\textstyle\frac{1}{2}}$ + $\displaystyle {\textstyle\frac{1}{3}}$) - 1 = $\displaystyle {\textstyle\frac{5}{6}}$.

  
Sequential search

Sequential search is the most simple search technique. It starts at the beginning of a file and keeps looking until the key is found or the end of the file is reached. This is a brute force approach that may be applicable when the files are small or they only searched a few times.

[Sequential search]=
public void sequentialSearch(Record[] record, Key key) {
  int n = record.length;
  int i = 0;
  while (i < n && key != record[i].key) ++i;
  if (i < n) System.out.println(successful);
  else System.out.println(unsuccessful);
}

The analysis of sequentialSearch() is straightforward, but it does depend on whether the search was successful or not. If the search is successful we have T(n) = i + 1. But the running time is always T(n) = n = $ \Theta$(n) for an unsuccessful search. If every key occurs with probability 1/n, then the average value of a successful search is

Taverage(n) = 1 . $\displaystyle {\frac{1}{n}}$ + 2 . $\displaystyle {\frac{1}{n}}$ + ... + n . $\displaystyle {\frac{1}{n}}$  
  = $\displaystyle {\frac{n+1}{2}}$  

We can improve on the efficiency of sequential search, but not its asymptotic running time, by inserting a sentinel record with key k at the end of the file.

[Better sequential search]=
public void betterSequentialSearch(Record[] record, Key key) {
  int n = record.length;
  int i = 0;
  while (key != record[i].key) ++i;
  if (i < n) System.out.println(successful);
  else System.out.println(unsuccessful);
}

Another way to make sequential search faster is to look in a table where the data is in sorted order. That is, assume the records have been sorted so their keys are in ascending order

k0 < k1 < ... < kn - 2

and there is an ending sentinel record with key kn - 1 = $ \infty$.

Sequential search in an ordered table]=
public void orderedSequentialSearch(Record[] record, Key key) {
  int n = record.length;
  int i = 0;
  while (key > record[i].key) ++i;
  if (key == record[i]) System.out.println(successful);
  else System.out.println(unsuccessful);
}

Frequency of keys

We have been assume each key occurs with identical probability 1/n. This may not be true in general applications. Let pi denote the probability of occurrence of key ki. The average number of comparisons An for a successful search has the value

An = p0 + 2p1 + 3p2 + ... npn - 1,

where p0 + p1 + ... + pn - 1 = 1.

Knowing the probabilities allows us to build better search tables. If

p0 $\displaystyle \geq$ p1 $\displaystyle \geq$ ... $\displaystyle \geq$ pn - 1,

then An will be minimized.

Zipf's law is one probability distribution that seems to occur in practice. The probabilities in Zipf's law are

p0 = $\displaystyle {\frac{c}{1}}$p1 = $\displaystyle {\frac{c}{2}}$,..., pn - 1 = $\displaystyle {\frac{c}{n}}$

where

c = $\displaystyle {\frac{1}{H_n}}$.

Using the Zipf probabilities we have
An = $\displaystyle {\frac{n}{H_n}}$ $\displaystyle \approx$ $\displaystyle {\frac{n}{\ln n}}$  

  
Binary search

Binary search is a classic method where about half of the keys are discarded on each unsuccessful compare leading to at most about lg n comparisons.

Given a table of records r0,..., rn - 1 with ordered keys

k0 < k1 <...< kn - 1,

we are to find the position j of record rj with key k or determine that no record r with key k is in the table.

We will find that the running time of binary search satisfies

T(n) = $\displaystyle \Omega$(1),    and    T(n) = O(lg n).

Binary search is often called a simplification algorithm, rather than a divide and conquer algorithm, in that only only one subproblem of a smaller size is solved.

Although a common algorithm, it can be non-trivial to implement correctly. Knuth [1] states that John Mauchly (of ENIAC fame) discussed binary search in a 1946 publication, it was not until 1962 that an implementation of binary search that worked correctly for all values of n was published. Here's is an iterative version of the algorithm.

[Binary search]=
public Record binarySearch(Record[] record, Key key) {
  int lower = 0;
  int upper = record.length - 1;
  while (lower <= upper) {
    int middle = (lower + upper)/2;   
    if (key == record[middle].key) return record[middle];
    if (key <  record[middle].key) {
       upper = middle - 1;
    }
    else { 
       lower = middle + 1;
    }
  }
}

Analysis of binary search

A binary decision tree can be used to understand binary search. For example when n = 8, a binary decision tree that represents the possible comparisons when executing the algorithm can be drawn as follows. Notice it is an extended tree with interior nodes denoted by circles and exterior nodes denoted by rectangles. Also, note that the tree is right-complete (not all binary search trees are).


\begin{bundle}{\begin{picture}
(30,30)\put(15,15){\circle{26}}\put(4,12){$k:k_3$...
...} }
\chunk{\fbox{8} }
\end{bundle}}
\end{bundle}}
\end{bundle}}
\end{bundle}

Initially the key k is compared against k3 since the middle index is m = 3 = $ \lfloor$(0 + 7)/2$ \rfloor$. If k < k3 the left branch is taken and the upper index becomes u = m - 1 = 2. The new middle index is m = 1 = $ \lfloor$(0 + 2)/2$ \rfloor$ and we compare k and k1. If k > k1 the right branch is taken and the lower index becomes l = m + 1 = 2. The middle index is now m = 2 = $ \lfloor$(2 + 2)/2$ \rfloor$ and we compare k and k2. The external nodes labeled \fbox{2} and \fbox{3} represent unsuccessful searches where k1 < k < k2 and k2 < k < k3.

Theorem 1   If 2m - 1 $ \leq$ n < 2m, a successful search requires as few as 1 and at most m comparisons. An unsuccessful search requires either m - 1 or m comparisons. The worst case running time of binary search is

T(n) = O(lg n).

From the tree representation we can determine the following about the expected running time of binary search:

1.
Let Cn be the average number of compares for a successful search.
2.
Let $ \hat{C_n}$ be the average number of compares for an unsuccessful search.
3.
Assume each of the n keys is equally likely to be sought in a successful search
4.
Assume each of the n + 1 intervals between keys are equally likely to be reached in an unsuccessful search.
5.
Let E be the external path length in the binary decision tree, that is, the sum of paths lengths to the external boxed nodes:

E = $\displaystyle \sum_{j=0}^{n}$e(j),

where e(j) is the path length to external node jj = 0,..., n.
6.
Let I be the internal path length in the binary decision tree, that is, the sum of paths lengths to each of the internal round nodes:

I = $\displaystyle \sum_{j=0}^{n-1}$i(j)

where i(j) is the path length to internal nod jj = 0,..., n - 1.
Then, for a successful search we have

Cn = $\displaystyle \sum_{j=0}^{n-1}$$\displaystyle {\frac{1}{n}}$(1 + i(j)) = 1 + $\displaystyle {\frac{I}{n}}$,

since any of the n internal nodes can be reached with equal probability and 1 + i(j) counts the number of compares to reach internal node j. For an unsuccessful search we have

$\displaystyle \hat{C}_{n}^{}$ = $\displaystyle \sum_{j=0}^{n}$$\displaystyle {\frac{1}{n+1}}$e(j) = $\displaystyle {\frac{E}{n+1}}$.

since any of the n + 1 external nodes can be reached with equal probability and e(j) counts the number of compares to reach external node j.

Now internal and external path lengths are related by the formula

E = I + 2n,

so we find

Cn = 1 + $\displaystyle {\frac{I}{n}}$ = 1 + $\displaystyle {\frac{E-2n}{n}}$ = $\displaystyle {\frac{E}{n}}$ - 1 = $\displaystyle {\frac{n+1}{n}}$$\displaystyle \hat{C}_{n}^{}$ - 1.

Finally, it can be shown that the binary decision tree with minimal external path length has

E = (n + 1)($\displaystyle \lfloor$lgn$\displaystyle \rfloor$ + 2) - 2$\scriptstyle \lfloor$lg n$\scriptstyle \rfloor$ + 1,

from which we can conclude that

Cn = lgn - 1 + $\displaystyle \epsilon$ + $\displaystyle {\frac{\lfloor \lg n \rfloor + 2}{n}}$

and

Cn = lg(n + 1) + $\displaystyle \epsilon{^\prime}$

where $ \epsilon$ and $ \epsilon{^\prime}$ are tiny constants.

Binary search trees

Binary search also lends itself readily to a recursive implementation using an explicit tree structure. For example, suppose we have a (partial) binary tree data structure defined as:

[Tree node data structure]=
public class Node {
  private Node left; 
  private Node right;
  private Record record;
  protected Node getLeft() { return left; }
  protected Node getRight() { return right; }
  protected Key getKey() {return record.key(); }
}

public class BinaryTree {
  private Node root;
  private int size;
  public int size() { return size; }
  public boolean isEmpty() { return (size == 0); }
  public Node leftChild (Node node) { return node.getLeft(); }
  public Node rightChild (Node node) { return node.getRight(); }
  public boolean isInternal(Node node) {
    return ((node.getLeft() != null || node.getRight() != null));
  }
  public boolean isExternal(Node node) {
    return ((node.getLeft() == null && node.getRight() == null));
  }
  public Node binarySearch(Node node, Key key);
}

A recursive binary search algorithm can be implemented as follows:

[Recursive binary search]=
public Node binarySearch(Node node, Key key) {
  if (isExternal(node) || (key == node.getKey())) return node();
  if (key < node.getKey()) {
     return binarySearch(leftChild(node), key);
  }
  else {
     return binarySearch(rightChild(node), key);
  }
}

A main advantage of this linked representation of a binary tree is that it allows for the implementation of a dynamic table; one in which we can efficiently insert and delete records (nodes). Dynamic sets can grow or shrink; change over time. A dynamic set that supports the operations of insert, delete, and search (for set membership) is called a dictionary.

[Search and insert]=
public void insert(Node node,  Record record) {
  if isExternal(node) {
     node.setRecord(record);
     node.setLeft = null;
     node.setRight = null;
  }
  else if (record.key() < node.getKey()) {
    insert(leftChild(node), record);
  }
  else if (record.key() > node.getKey()) {
    insert(rightChild(node), record);
  }
  // else (record.key() == node.getKey()) so do nothing
}

However, a potential problem with it is that the binary tree may not remain balanced as records are inserted and deleted. When the tree is fairly balanced, the search time will be O(lg n), but in the degenerate case, where the tree reduces to essentially a linked list, the search time will grow to O(n). It can be shown that the average search time will be O($ \sqrt{n}$)if each n node binary tree is equally likely.

Let's consider what happens when n nodes are inserted into a binary tree in a random order. Consider 3 common English words:

OF, THE, TO
which we insert into a binary tree in the 6 possible order the words can occur. Only 5 distinct tree result: THE, OF, TO and THE, TO, OF produce identical trees.


\begin{bundle}{$\stackrel{\textbf{Tree 0}}{\texttt{OF}}$ }
\chunk{\fbox{0} }
\...
... \chunk{\fbox{2} }
\chunk{\fbox{3} }
\end{bundle}}
\end{bundle}}
\end{bundle}

\begin{bundle}{$\stackrel{\textbf{Tree 1}}{\texttt{OF}}$ }
\chunk{\fbox{0} }
\...
... \chunk{\fbox{2} }
\end{bundle}}
\chunk{\fbox{3} }
\end{bundle}}
\end{bundle}

\begin{bundle}{$\stackrel{\textbf{Tree 2=Tree 3}}{\texttt{THE}}$ }
\chunk{\begi...
...}{\texttt{T0}}
\chunk{\fbox{2} }
\chunk{\fbox{3} }
\end{bundle}}
\end{bundle}

\begin{bundle}{$\stackrel{\textbf{Tree 4}}{\texttt{T0}}$ }
\chunk{\begin{bundle...
... \chunk{\fbox{2} }
\end{bundle}}
\end{bundle}}
\chunk{\fbox{3} }
\end{bundle}

\begin{bundle}{$\stackrel{\textbf{Tree 5}}{\texttt{T0}}$ }
\chunk{\begin{bundle...
... \end{bundle}}
\chunk{\fbox{2} }
\end{bundle}}
\chunk{\fbox{3} }
\end{bundle}

Consider the number of compares when searching for each word in the five trees.

  Compares in Tree  
  0 1 2 3 4 5 Total
OF 1 1 2 2 2 3 11
THE 2 3 1 1 3 2 12
TO 3 2 2 2 1 1 11
Total 6 6 5 5 6 6 34
Thus the average number of compares averaged over the 3! = 6 permutations and 3 words is

A3 = $\displaystyle {\textstyle\frac{1}{3}}$ . $\displaystyle {\textstyle\frac{1}{6}}$ . 34 = $\displaystyle {\textstyle\frac{17}{9}}$.

Let An be the average number of compares for a successful search in a tree with n internal nodes. Let $ \hat{A_n}$ be the average number of compares for an unsuccessful search in a tree with n internal nodes. We'd like to see if we can derive a general formula for

An = $\displaystyle {\frac{1}{n}}$ . $\displaystyle {\frac{1}{n!}}$ . $\displaystyle \sum$compares and found.

Notice that:

1.
There are n! possible orderings for the insertions.
2.
If it takes C compares to insert a record into the tree, it will take C + 1 compares to find it.
3.
For a successful search, the record can lie in any of the n nodes with probability 1/n.
4.
For a successful search, we first unsuccessfully search through a tree with k nodes ( k = 0, 1,..., n - 1) and then find the record on the next compare.
5.
Therefore

An = $\displaystyle \sum_{k=0}^{n-1}$$\displaystyle {\frac{1}{n}}$$\displaystyle \left(\vphantom{1 + \hat{A}^k}\right.$1 + $\displaystyle \hat{A}^{k}_{}$ $\displaystyle \left.\vphantom{1 + \hat{A}^k}\right)$.

Rewriting the last formula we have

An = 1 + $\displaystyle {\frac{\hat{A}^0+\hat{A}_1 + \cdots + \hat{A}_{n-1}}{n}}$.

But we also know the relation between external/internal path lengths and successful/unsuccessful compare counts:

An = 1 + $\displaystyle {\frac{I}{n}}$ = 1 + $\displaystyle {\frac{E-2n}{n}}$ = $\displaystyle {\frac{n+1}{n}}$$\displaystyle \hat{A}_{n}^{}$ - 1.

Combining the two equations gives

(n + 1)$\displaystyle \hat{A}_{n}^{}$ = 2n + $\displaystyle \hat{A}^{0}_{}$ + $\displaystyle \hat{A}_{1}^{}$ + ... + $\displaystyle \hat{A}_{n-1}^{}$.

We know sums in recurrences can be eliminated by subtracting two successive versions of the recurrence, that is the difference between

(n + 1)$\displaystyle \hat{A}_{n}^{}$ = 2n + $\displaystyle \hat{A}^{0}_{}$ + $\displaystyle \hat{A}_{1}^{}$ + ... + $\displaystyle \hat{A}_{n-1}^{}$

and

n$\displaystyle \hat{A}_{n-1}^{}$ = 2(n - 1) + $\displaystyle \hat{A}^{0}_{}$ + $\displaystyle \hat{A}_{1}^{}$ + ... + $\displaystyle \hat{A}_{n-2}^{}$

is

(n + 1)$\displaystyle \hat{A}_{n}^{}$ - n$\displaystyle \hat{A}_{n-1}^{}$ = 2 + $\displaystyle \hat{A}_{n-1}^{}$,

or

$\displaystyle \hat{A}_{n}^{}$ - $\displaystyle \hat{A}_{n-1}^{}$ = $\displaystyle {\frac{2}{n+1}}$.

From which we can conclude that

$\displaystyle \hat{A}_{n}^{}$ = 2Hn + 1 - 2,

and

An = 2$\displaystyle \left(\vphantom{1+\frac{1}{n}}\right.$1 + $\displaystyle {\frac{1}{n}}$ $\displaystyle \left.\vphantom{1+\frac{1}{n}}\right)$Hn - 3.

Thus, we expect on average that order lg n compares will occur when searching a random tree; well-balanced trees are common, degenerate ones are rare.

  
Hashing

A hash table is and effective data structure for implementing dictionaries. A hash table is a generalization of an ordinary array; the hash table index is computed from the key stored in the table. Under reasonable assumptions, insert, delete and search can be performed in O(1) (constant) time. First, we'll consider direct-address (associative) tables, a formal abstract data type based on arrays. Then we'll look at hash tables and what makes a good hash function Finally, we'll consider open hashing.

Direct-address (associative) tables

Pretend a dynamic set draws keys from the universal set

U = {0, 1, 2,..., m - 1}

where m is not too large. A direct-address table is an abstract data structure that can be represented as an array T[0..m - 1] of Records with three operations:

[Direct-address table]=
public class DirectAddressTable {
   Record[] T;
   public directAddressSearch (int key) {
     return T[key];
   }
   public directAddressInsert (Record record) {
     T[record.key()] = record;
   }
   public directAddressDelete (Record record) {
     T[record.key()] = null;
   }
}

Direct addressing has O(1) worst-case time complexity for each of its operations. Direct addressing is impractical when the size | U| is large. Direct addressing is space-inefficient when the set of keys K actually stored is small relative to the size | U|. The implementation assumes that Records know how to compute a unique integer within the range 0 to m - 1.

We will extend direct-address tables to hash tables where a record is found by computing a hash value using a hash function on the record's key. The hash value provides an index into the hash table where the record can be found (or inserted or deleted).

Hash functions

Consider 10 common English words:

THE, OF, TO, AND, THAT, THIS, WITH, A, IN, ON
and pretend we had a function h that mapped these words onto the integers 0 to 14. There are 1010 functions from a 10 element set to a 10 elements, but only 10! = 3, 628, 800of these are onto functions, so we have only about a 1 in 2755 chance of finding an onto function if we select on at random; and the odd become much steeper as the number of element go up. Functions which avoid duplicates are rare! However, we can find functions with only a few duplicates fairly easily. There are two primary methods for doing so: the multiplicative method and the division method. But, first we want to know what makes a good hash function.

Most hash functions assume the keys come from the set of natural numbers N = {0, 1, 2,...}. When the keys are not natural numbers they must be converted For example, character strings can be represented via their ASCII code in radix 128 notation. For example, ON could be represented as

ON = 79 . 128 + 76 = 10188,

while ON could be represented as

IN = 73 . 128 + 76 = 9420.

The division method

The division method for constructing a hash function maps a key k into one of m slots using the hash function

h(k) = k mod m

For example, if m = 15 and k = 123, then h(123) = 123 mod 15 = 3( 15 . 8 + 3 = 123). The division method is very fast since only a single division is needed.

Certain values of m should not be used.

Good values of m are primes not too close to exact powers of 2. As an example, pretend there are about n = 4000 character strings to be held in a hash table, and we don't mind up to about 3 strings hashing into the same slot. Since 4000/3 $ \approx$ 1333 we could set m = 1381, a prime not too close to 1024. The hash function could be selected as

h(k) = k mod 1381.

The multiplication method

There are two steps in the multiplication method:

1.
The key k is multiplied by a constant A in the range 0 < A < 1 and the fractional part of kA extracted.
2.
This fractional part is multiplied by m and the floor taken.
Thus, the hash function is

h(k) = $\displaystyle \lfloor$m(kA - $\displaystyle \lfloor$kA$\displaystyle \rfloor$)$\displaystyle \rfloor$.

In the multiplication method the value of m is not critical and m = 2p for some p is typically chosen. Knuth [1] suggests that one over the golden mean is generally a good value for A

A = $\displaystyle {\frac{1}{\frac{1+\sqrt{5}}{2}}}$ = $\displaystyle {\frac{\sqrt{5}-1}{2}}$ $\displaystyle \approx$ 0.6180339887...

As an example, if k = 123456, m = 10000 and A = 0.6180339887, then

k . A = 123456 . 0.6180339887 = 76300.0041089472


h(k) = $\displaystyle \lfloor$10000 . (76300.0041089472 - $\displaystyle \lfloor$76300.0041089472$\displaystyle \rfloor$)$\displaystyle \rfloor$  
  = $\displaystyle \lfloor$10000 . (.0041089472$\displaystyle \rfloor$)$\displaystyle \rfloor$  
  = $\displaystyle \lfloor$(41.1089472$\displaystyle \rfloor$)$\displaystyle \rfloor$  
  = 41  

Hash tables

A hash table requires less space than direct-address tables and still (usually) have efficient insert, delete, and search operations. Storage requirements can be reduced to O(| K|) and O(1)average running times for operations can still be attained. An record r with key k is stored in slot h(k)where h is a hash function. The hash function maps the universe U of keys into slots of a hash table T[0..m - 1]

h  :  U $\displaystyle \rightarrow$ {0, 1,..., m - 1}

This size | U| of U is very much bigger than m. Key k hashes to slot h(k); h(k) is hash value of k. Since the size m of the hash table T is much smaller than the size | U| of the universe of keys, there may be collisions when two keys hash to the same slot.

Collision resolution by chaining

Chaining places all records that collide (hash to the same slot) into a linked list.

We want to analyze the running times of these operations.

Chained hashing analysis

It is clear than we can insert a record at the head of a linked list in $ \Theta$(1) time. Notice that this allows duplicate records in the hash table. To avoid this we would need to search the list at each slot and insert the new record at the end of the list upon finding that it is not there.

Define the load factor $ \alpha$ of a hash table T to be n/m where m is the number of slots and n records are stored in the table. The load factor $ \alpha$ estimates the average number of records stored in a chain, under the assumption of simple uniform hashing, that is, records are assigned to slots following a uniform distribution.

It is clear, that the worst case running time for searching a hash table with chaining is O(n). This occurs in the (highly) non-uniform case where all keys hash to the same slot. But, with the simple uniform hashing assumption, the probability is 1/m that any key hashes into any of the m slots.

Theorem 2   An unsuccessful search has average running time O(1 + $ \alpha$)under the assumption of simple uniform hashing and collisions are resolved by chaining.

Proof 1   The time for computing h(k) is O(1). The size of any list is O($ \alpha$) and it takes this order of steps to search it. Thus the expected running time for an unsuccessful search is O(1 + $ \alpha$).

Theorem 3   A successful search takes time O(1 + $ \alpha$/2)on average under the assumption of simple uniform hashing and collisions are resolved by chaining.

Proof 2   This proof is similar to the analysis given for the expected running time for successful searches in a binary tree. Any of the n items in the hash table can be sought. The expected length of a list is the average number of compares when inserting an item at the end of the list. The number of compares for a successful search is 1 more than the expected length of the list when the record was inserted. When inserting the kth record, the expected length of the list is (k - 1)/m. Therefore, letting An be the average number of compares in a successful search, we have

An = $\displaystyle \sum_{k=0}^{n-1}$$\displaystyle {\frac{1}{n}}$$\displaystyle \left(\vphantom{1 + \frac{k}{m}}\right.$1 + $\displaystyle {\frac{k}{m}}$ $\displaystyle \left.\vphantom{1 + \frac{k}{m}}\right)$ = 1 + frac1nm$\displaystyle \sum_{k=0}^{n-1}$k = 1 + $\displaystyle {\frac{n-1}{2m}}$ = 1 + $\displaystyle {\frac{\alpha}{2}}$ - $\displaystyle {\frac{1}{m}}$.

Require essentially the same time as search since we must find the node preceeding the one to be deleted in order to set its next address to the node after the deleted one.

Open-address hashing

Open-address hashing resolves collisions by storing all records in the hash table itself. This is most useful when the number of records to be stored can be estimated in advance. Then enough room of contiguous memory can be allocated (with some room to spare) to hold all the records, and we can reduce the space requirements needed for collision resolution by chaining.

We define more general hash functions for open-address hashing. That is, our hash functions will now be functions of two arguments: the key k and the number of probes p, where 0 $ \leq$ p $ \leq$ m - 1. The probe parameter is used to bound the number of probes made so allows us to know the algorithms will terminate.

Linear probing

Linear probing in the most simple open-addressing scheme. We'll first discuss it in terms of searching a hash table. When a collision occurs we simply probe the next slot in the table. There are three possibilities that may occur:

1.
The record is found in the next slot and the search terminates successfully.
2.
The next slot is empty, so the record is not found and the search ends unsuccessfully.
3.
The next slot is occupied, but the keys do not match, so the following slot is probed.

Insertion is simple as well. Usually, one searches for the record to be inserted, and upon determining that it is not there, inserts it into the table at the empty slot where the search terminated.

For linear probing, the hash function of key k and probe p is

$\displaystyle \hat{h}$(k, p) = (h(k) + p) mod m

which simply performs a cyclic scan of the table starting from the initial slot h(k). When the end of the table T[m - 1] is reached, the next slot is T[0].

Quadratic probing

Quadratic probing uses a hash function of the form

$\displaystyle \hat{h}$(k, p) = (h(k) + c1p + c2p2) mod m

Of course, the values of c1, c2 and m determine whether or not the entire table will be used.

Open-address hashing analysis

Knuth [1] shows that linear probing computes on the order of

$\displaystyle {\textstyle\frac{1}{2}}$$\displaystyle \left(\vphantom{1 + \frac{1}{1-\alpha}}\right.$1 + $\displaystyle {\frac{1}{1-\alpha}}$ $\displaystyle \left.\vphantom{1 + \frac{1}{1-\alpha}}\right)$    probes in a successful search

$\displaystyle {\textstyle\frac{1}{2}}$$\displaystyle \left(\vphantom{1 + \left(\frac{1}{1-\alpha}\right)^2}\right.$1 + $\displaystyle \left(\vphantom{\frac{1}{1-\alpha}}\right.$$\displaystyle {\frac{1}{1-\alpha}}$ $\displaystyle \left.\vphantom{\frac{1}{1-\alpha}}\right)^{2}_{}$ $\displaystyle \left.\vphantom{1 + \left(\frac{1}{1-\alpha}\right)^2}\right)$    probes in an unsuccessful search

where $ \alpha$ = n/m is the load factor. It is useful to look at these values for various values of the load factor $ \alpha$. This increase in probes (comparisons) is due to a problem called primary clustering in which long runs of occupied slots build up.

$ \alpha$ 0.50 0.60 0.70 0.80 0.90 0.95 0.99    
Successful 1.500 1.750 2.166 3.000 5.600 10.500 50.500    
Unsuccessful 2.500 3.625 6.055 13.000 50.500 200.500 5000.500    

Clearly linear probing becomes inefficient as the load factor approaches 1.0 and one should strive to keep it below about 2/3 or so.

  
Problems

Problem 1:

What is the average running time for a successful and unsuccessful sequential search in an ordered table?

Problem 2:

Design an algorithm that deletes a node from a binary tree provided the node is in the tree and does nothing otherwise.

Problem 3:

How many binary trees with n nodes are there? This is not a trivial problem. Look up Catalan numbers.

Problem 4:

What are the odds of selecting an onto function from one that maps n elements into n elements?

Bibliography

1
D. E. KNUTH, The Art of Computer Programming: Sorting and Searching, vol. 3, Addison-Wesley, third ed., 1998.
 



William D. Shoaff
2000-06-22