next up previous


Pattern Matching in Text

by William Shoaff with lots of help


Contents

You can download a postscript version of this file (which is prettier) at

http://www.cs.fit.edu/%7Ewds/classes/algorithms/Text/text.ps

Pattern matching

The most basic text problem is PATTERN MATCHING (PMP for short) where we are given a pattern[0..m-1] of length m and text[0..n-1] of length n, where m is generally very much smaller than n, and we wish to know if pattern occurs in text.

We will study several algorithms for PMP. We are interested in their implementation as well as their time and space complexities. Some auxiliary functions will be introduced as needed.

We will write programs using an object-oriented approach with the Java language, thus we will assume a class Text that knows how to find patterns within the text property of the class.

At the moment, I can't think of any class properties; variables where there is only one copy associated with the class (declared using the static keyword). But just in case we need to define some later we've made a place for them with the chunck called Class Properties above.

The instance properties of the class include as a minimum String of text. Additional instance properties will be defined as required.

Brute-force pattern matching with left-to-right scan

The problem to solve is PMP: Does pattern[0..m-1] occur in text[0..n-1] or not? Our first few algorithms will use a left-to-right scan for the pattern in the text. In contrast, the Boyer-Moore algorithm will scan right-to-left. One thing known from the problem statement is the length of the text a nd the length of pattern.

The brute-force (naive) pattern matching algorithm compares pattern[j] with text[i+j] for each i=0..n-m and j=0..m-1 until the pattern is found or the end of the text is reached.

Let i denote the index into text where matching starts; it is bounded below by 0 (is this obvious or not?)

The text index i is bounded above by n-m, otherwise the pattern would slide off the right end of the text consider the Figure 1. If i is the rightmost legal position in text, then pattern[0..m-1] align with text[i..n-1] which implies i+m-1=n-1 or i=n-m.


  
Figure 1: Legal alignments of pattern and text.
\begin{figure}\epsffile{legal.eps}\end{figure}

The index j will serve two purposes. Not only is it a character index in pattern, it will serve as a count of the number of characters matched. Thus j is bounded below by 0 and above by m; it can take on a value 1 larger than a legal index (m-1) but only to serve as a flag that m characters have been matched.

Now, let's define a predicate for testing if the pattern index is legal.

The matching starts at some index i and as long as the pattern matches the text we simply increment the j index.

All this is neatly handled in a while statement.

There are two exits for this (inner) while loop. One is when m characters of pattern have matched text and in this case we simply return true as the answer to the decision problem PMP.

Of course when the pattern is not found we return false.

The other exit of the inner while loop occurs a mismatch occurred at index j in pattern and index i+j in text. The brute-force approach is to shift the pattern one position forward in text and reset the pattern index j to 0.

Putting all these pieces together produces the brute-force left-to-right pattern matching algorithm.

Analysis of the brute-force pattern matcher

There are several things to notice about the brute-force pattern matcher. First, it has optimal space complexity; it solves PMP in constant space needing only a few registers for indices i, j, and lengths m and n. (In general, space complexity ignores the space required for input and output). Second it has worst case quadratic (O(nm) ) time complexity and this is not very good! Of course the average case time complexity is not nearly so bad.

The Morris-Pratt pattern matcher

The brute-force pattern matcher does not use the information gathered once a mismatch occurs; it throws away the knowledge that pattern[0..j-1] == text[i..i+j-1] and pattern[j] != text[i+j] and simply restarts by comparing pattern[0] with text[i+1].

We will see how to pre-process the pattern so that this information is not wasted. The cost will be an increase in the space needed in pattern matching, but this will reduce the worst case time complexity for PMP. The preprocessing is accomplished by exploiting the invariant

pattern[0..j-1] = text[i..i+j-1].
Later we will make use of the mismatch information to improve the process even more.

The algorithm looks exactly like the brute-force algorithm with the exception of how shifts are made once a mismatch is found.

Consider what we know when pattern[0..j-1] matches text[i..i+j-1]. Figure 2 shows an alignment of the text at pattern at index i with matching characters at positions i..i+j-1. Suppose pattern is shifted from position i in text to position i+s. There are three conditions such a shift s should satisfy:


  
Figure 2: Alignment with match between pattern[0..j-1] and text[i..i+j-1].
\begin{figure}\epsffile{mismatch.eps}\end{figure}

1.
A shift of pattern s positions right should be safe,
2.
A shift of pattern s positions right should be feasible,
3.
The shift s right should be at least one position.

A shift from i to i+s is safe if pattern can not occur at any position in between. A safe shift is feasible if a match could occur at i+s (based on our current knowledge).

Since we are assuming a mismatch between pattern[j] and text[i+j], a shift of s=1 is safe; this is the brute-force (or conservative approach). Often such a small shift is infeasible, so we can safely make larger safe shifts until a feasible shift is found.

Figure 3 extends figure 2 to show the configuration when a shift occurs. For the shift to be feasible, Figure 3 shows that prefix pattern[0..j-s-1] must be a proper suffix of pattern[0..j-1]. To also be safe, s must be the smallest such feasible shift.


  
Figure 3: A safe feasible shift of pattern.
\begin{figure}\epsffile{shiftmatch.eps}\end{figure}

Word borders and periods

Before going farther let's introduce some terms and ideas that are helpful in text algorithms.

The border of a word w is any word that is both a prefix and suffix of w. For example, the word

w=abaabaaabaaba
has proper borders
abaaba, aba, a, and $\epsilon$
where $\epsilon$ stands for the empty word. Of course, the complete word itself is both a prefix and a suffix of itself, but this is not a proper border. (The word proper is used in a similar context with respect to sets and subsets: a proper subset is a subset, but not the set itself.)

Define w.border() to be the longest proper border of a non-empty word w. We can iterate this function obtaining a list of all borders of w, for the sample word w=abaabaaabaaba

w.border()=abaaba, abaaba.border()=aba, aba.border()=a
a.border()=e

Borders have a dual notion called periods, which are integers p such that $0< \texttt{p} <= \texttt{w.length()}$ and prefix w[0..k] equals subpattern w[p..p+k] where k=w.length()-p. The period of w=abaabaaabaaba are 7, 10, 12, and 13 since


\begin{tabular}{lccccccccccccc}
index & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & ...
...a & & & & & & & & & & & & a \\
p=13 & & & & & & & & & & & & & \\
\end{tabular}
where the last period 13 corresponds to the empty border. Notice that the period plus the length of the border equals the word's length \begin{displaymath}\texttt{p + w.border().length() = w.length()}.\end{displaymath}

Shifting a string by its shortest period aligns its borders. Such shifts are feasible. As we will see below 3.2, one can readily compute the length of the border of a string. So suppose we compute the length |pattern[0..j-1].border()| for each index j=1..m and store this integer in an array border[j] (we'll see border[0] should be set to -1). When a mismatch occurs the pattern is shifted by its period j-border[j]. Here's an example:

\begin{displaymath}\begin{array}{lcccccccccc}
\mbox{\texttt{pattern} index \text...
...:} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & \ldots \\
\end{array}\end{displaymath}The mismatch occurs at j=3 once we've matched aba, which has border a of length 1. The period of aba is 2. Shifting pattern 2=3-1 positions produces an alignment of the border of aba.

\begin{displaymath}\begin{array}{lcccccccccc}
\mbox{\texttt{pattern} index \text...
...:} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & \ldots \\
\end{array}\end{displaymath}We can restart the matching at j=1=border[j]. We don't need to recheck the match of pattern[0] with text[3] (of course we get an immediate mismatch at the next position -- which we'll handle later).

A shift by the period is both safe and feasible, from the above analysis we know that the shift satisfies

s = j - border[j]
where border[j] is the length of the border of pattern[0..j-1].

Now let's consider the boundary case where j=0 when the mismatch occurs. That is pattern[0] is not the same as text[i]. Clearly a shift of s=1 is warranted, and

s = 1 = 0 - border[0]
implies border[0]=-1. Because of this we must restart the matching process from pattern index j=max(0, border[j]).

  
Computation of borders

We want to compute border[0..m] for pattern which has length m; recall border[k] is the length of the border of the prefix pattern[0..k-1] and a shift by k-border[k] aligns the border at the start of pattern[0..k-1] with the border end of pattern[0..k-1]. The pattern calls computerBorders() (i.e., the call pattern.computeBorders()) is made before the call text.MorrisPratt(pattern).

Remember Text variable pattern has an internal representation that uses a String named text of length n. The variables pattern and m will be used when discussing the algorithm, but programming text and n will be used in the code when we need to refer to the instance variables.

We know border[0] = -1 and so we can set this initial value. Also, we'll store the length of the current border in a local variable k.

Now we want to compute border[1],...,border[m], where m is the length of pattern. This will be done in a for loop that loops over each non-empty prefix of pattern.

Let's pretend we've computed k=border[j-1] of pattern[0..j-2] for some j >= 1, and we want to compute border[j]. Figure 4 shows that when pattern[k] == pattern[j-1] then border[j] = k+1. That is, when the character after the prefix matches the character after the right border, the border can be extended by 1.


  
Figure 4: Extending the border when characters after prefix and right border match. Border of pattern[0..j-2] has length k and pattern[k]=pattern[j-1].
\begin{figure}\setlength{\unitlength}{3947sp} %
\begingroup \makeatletter\ifx\Se...
...ve 0 0 0 setrgbcolor}pattern\special{ps: grestore}}}}
\end{picture}
\end{figure}

Now let's consider what happens when there is a mismatch, that is pattern[k] != pattern[j-1]. This case is a little more complex to figure out, but figure 5 shows that while a mismatch occurs, the current border k needs to be reduced to border[k] until there is a match or no border is found (k = -1).


  
Figure 5: Reducing the border after a mismatch.
\begin{figure}\setlength{\unitlength}{3947sp} %
\begingroup \makeatletter\ifx\Se...
...s: gsave 0 0 0 setrgbcolor}b\special{ps: grestore}}}}
\end{picture}
\end{figure}

Example 1:

Consider the pattern abaababa, of length 8 . We want to fill out the array border[0..8] of border lengths.


\begin{tabular}{cllrcc}
\textbf {j} & \textbf {Prefix} & \textbf {Border String}...
... & \texttt{abaababa} & \textbf{aba} & 3 & $\epsilon$\space & a \\
\end{tabular}

Using the above code chunks we can put together the computeBorders() function easily.

One final act, so the Java compiler does not complain: we declare the integer array border[] use in the Morris-Pratt algorithm

Analysis of the Morris-Pratt pattern matcher

The need for the array border[0..m] increases the space complexity from the constant space requirement for the brute-force algorithm. The benefit is a linear worst case time complexity.

Theorem 1   Theorem: The maximal number of symbol compares in the Morris-Pratt algorithm is 2n-m.

Proof 1   There is at most one unsuccessful comparison for each index i, which ranges from 0 to n-m. We can give an upper bound for the number of successful compares by considering the sum i+j. The least value for i+j is 0 and the greatest value is n-1. Each time a successful compare is made i+j increases by 1 and it never decreases, thus there are at most n successful compares. Finally not both successful and unsuccessful compares can attain their maximum, thus there are at most n + (n-m+1) - 1 = 2n-m compares in the Morris-Pratt algorithm.

Knuth-Morris-Pratt pattern matching

The Morris-Pratt algorithm can be improved by using additional information known at the time a mismatch occurs. In particular, the complete invariant is

pattern[0..j-1]==text[i..i+j-1] and pattern[j] != text[i+j].

The Knuth-Morris-Pratt (KMP) algorithm [2] makes use of this additional one-bit of mismatch information to allow longer shifts of pattern in text. Otherwise, the algorithm is the same as the previous ones.

One might call the idea used in the KMP shift ``strict borders.'' The length of these strict borders are stored in an array strictBorder[], and the shift is computed just as in the Morris-Pratt case using this new array.

The strictBorder[] array used in the Knuth-Morris-Pratt algorithm is declared, and its elements initialized to -1. To fill out the strictBorder[] array, consider what we know:

pattern[0..j-1]==text[i..i+j-1] and pattern[j] != text[i+j].

So if pattern[0..k] is the border of pattern[0..j-1] and pattern[k+1] = pattern[j] != text[i+j] then there will be an immediate mismatch when we shift to align borders. In this case we can safely shift farther aligning the border of pattern[0..k] with the tail of pattern[0..j-1]. On the other hand, if pattern[k+1] != pattern[j] then perhaps pattern[k+1] = text[i+j] and we can only safely shift the the length of border pattern[0..j-1].

Example 2:

Consider the pattern abaababa, of length 8 . We want to fill out the array border[0..8] of border lengths. The index j in the left column of the table below denotes the number of characters that have been matched. Note that the Morris-Pratt (MP) shift is computed by j - border.length(j) and the matching restarts at j = border.length(j) (or j = 0, if the border's length is negative). For the Knuth-Morris-Pratt (KMP) algorithm, the shift is j - strictBorder(j) and the matching restarts at j = Strictborder.length(j) (or j = 0, if the strict border is is negative).


\begin{tabular}{lllrrrrrr}
\textbf {j} & \textbf {Prefix} & \textbf {Border Stri...
...tt{abaababa} & \textbf{aba} & 3 & 5 & $\epsilon$\space & a & 8 \\
\end{tabular}

Example 3:

Here's an example of the shift that can occur on a mismatch.


\begin{tabular}{lllllllllllll}
index: & 0 & 1 & 2 & 3 & 4 & 5 & 6 \\
pattern: &...
... & a \\
text: & a & b & a & a & b & a & c & a & b & a & a & b \\
\end{tabular}

pattern[6] != text[0+6] and the border of abaaba is aba, which has length 3, implying a shift of 6-3=3. A shift of 3 leads to an immediate mismatch since pattern[3]=a will be compared with text[6]=c. Thus, we can consider the border of the border of abaaba, that is the border of aba, or a, which has length 1 and shift by 6-1=5.

Analysis of the Knuth-Morris-Pratt pattern matcher

The KMP pattern matcher has space complexity S(n+m)=O(m). This reflects the storage for array strictBorder[0..m] and the constant space required for indices and lengths. The time complexity T(n+m) = O(n+m) is generally better than the Morris-Pratt algorithm, but may be no better than it.

Right-to-left scanning for pattern matching

Now we want to develop a brute-force algorithm that will match pattern against text using a right-to-left scan of pattern. The previous left-to-right algorithm is modified in these ways:

1.
pattern index j starts at the end of pattern.
2.
The inner scan decrements j.
3.
When j runs off the left-end (j=-1) a complete match has occurred.
4.
The shift moves the pattern right one place and resets j to the end of pattern.

The right-to-left scan starts at the end of pattern.

And decrements pattern index j so long as text matches pattern.

When a mismatch occurs, slide pattern one position right (i = i+1) and reset j to point to the end of pattern.

The Boyer-Moore algorithm

The Boyer-Moore algorithm [1] uses the knowledge gained by the above brute-force algorithm to leverage an improved pattern matcher. What do we know? When a mismatch occurs, we know the invariant

pattern[j] != text[i+j]
and
pattern[j+1..m-1] = text[i+j+1..i+m-1]
This invariant is shown in figure 6.


  
Figure 6: Alignment with match using right-to-left scan.
\begin{figure}\epsffile{rightleft.eps}\end{figure}

Let's pretend, after a mismatch, that we shift pattern s positions to the right where 1 <= s <= j. This aligns pattern[0..m-1-s] and pattern[s..m-1] as shown in figure 7. In particular, pattern[j-s] aligns with pattern[j] and text[i+j]. Thus, we require that the shift s satisfy:

1.
  pattern[j-s] != pattern[j]. If they were equal an mismatch between pattern[j-s] and text[i+j] would occur; such a shift is not feasible.
2.
  A border of the reversed prefix pattern[m-1..j-s+1] has length m-1-j, that is, pattern[m-1..j+1] = pattern[m-1-s..j+1-s].
Notice condition 2 is a statement about some border of a prefix of the reversed string; this border may not be the border.


  
Figure 7: A Boyer-Moore shift between 1 and j positions.
\begin{figure}\epsffile{bmShortShift.eps}\end{figure}

Now let's pretend that we shift j+1 <= s < m characters. Such a shift aligns pattern[0..m-1-s] and pattern[s..m-1] with text[i+s..i+m-1], see figure 8. Such a shift s satisfies the condition:

1.
  pattern has some border pattern[0..m-1-s] of length m-j-1 or less.


  
Figure 8: A Boyer-Moore shift between j+1 and m positions.
\begin{figure}
\end{figure}

When either the first two conditions or the third condition fail to hold we can safely shift pattern the maximal amount m. Given the pattern we can compute if there are shifts satisfying the conditions above. Such shifts are safe and feasible. For each j the longest safe and feasible shift is stored in a look-up table which is used when a mismatch occurs. The algorithm is identical to the brute-force right-to-left scan, except for this look-up of the shift.

We'll call the table of shifts goodSuffix[]. Then when a mismatch occurs on pattern index j, the shift goodSuffix[j] is added to i and pattern index j is reset to the end of pattern.

<!- this file was generated automatically by noweave -- better not edit it-> <html><head><title>bmShift.nw</title></head><body><pre><a name="NWbmSA-BoyS-1" href="#NWbmSA-BoyS-1"><dfn>&lt;Boyer-Moore shift of pattern&gt;=</dfn></a> i += pattern.goodSuffix[j]; j = m-1; </pre>

<ul> <li><a href="#NWbmSA-BoyS-1"><i>&lt;Boyer-Moore shift of pattern&gt;</i></a>: <a href="#NWbmSA-BoyS-1">D1</a> </ul> <ul> </ul> </body></html>


\begin{tabular}{lllcccccc}
\textbf {j} & \textbf {Suffix} & \textbf {Border} & \...
...\texttt{abaababa} & \textbf{aba} & 3 & $\epsilon$\space & a & 2\\
\end{tabular}

Computing the goodSuffix[] array

We start by declaring an instance of the goodSuffix[] array. It's length will be set to m when pattern is created and each element will be initialized to zero, the Java default. (Remember Text variable pattern has an internal representation with a String named text of length n. The variable m will be used when discussing the algorithm, but n will be used in the code.)

The code for goodSuffix[] is abstruse. Essentially, we want to test the conditions discussed above. Our implementation is driven more by a need for clarity than efficiency in time and space. Here is the complete algorithm for computegoodSuffix().

Let's start with an auxiliary routine that reverses the character is a string. This will be useful in testing condition 2.

Borders for both the pattern and its reverse are used, and a variable called s will denote the shift.

We'll start by being opportunistic and set, for each j, goodSuffix[j] = m the largest possible shift. As we find that shorter shifts are safe and feasible we'll reset goodSuffix[j] to these smaller values.

Boundary conditions

To develop the shift table for the Boyer-Moore algorithm , we'll consider boundary cases first.

First compare is a mismatch.

When there is an immediate mismatch between pattern[m-1] and text[i+m-1], a shift of 1 is appropriate, but so is a shift by the smallest value s such that pattern[m-1-s] != pattern[m-1]. This is condition 1 for the case j=m-1. The requirement is that s be the smallest value satisfying reverse[0..s+1] = 0.

No compare is a mismatch.

Here it must be that j, our pattern position index, has fallen off the left end of pattern, that is j == -1. Our decision algorithm simply return true when this occurs.

Mismatch on the last compare.

Now let's consider the case that pattern[0] != text[i]. That is, we've match all characters except the first. I hope it is obvious that a shift by the period of pattern, that is m-border[m], is both safe and feasible. The border is a good suffix where a shift by the period will produce a potential pattern-text match; no shorter shift can.

Non-boundary cases

We restrict our attention to the case where a mismatch occurs at pattern[j] and $0 < j < m-1$. This scenario is shown in figure 6, and there are two cases to consider. These are illustrated in figures 7 and 8. In figure 7, the proposed shift s is no more than j. In figure 8, s is larger than j, but less than m-1.

A shift 1 <= s <= j.

Let's pretend that the situation of figure 7 occurs to determine how to build the code that enforces the situation. Thus s is between 1 and j where a mismatch occurs at pattern[j]. Figure 7 illustates that two conditions must hold:

 \begin{equation}
\texttt{pattern[j-s] != pattern[j}]
\end{equation}and  \begin{equation}
\texttt{pattern[j+1-s..m-1-s] = pattern[j+1..m-1}].
\end{equation}Condition 1 is the Knuth-Morris-Pratt strict border condition and condition 2 is the Morris-Pratt border condition for the reversed pattern. With reverse being the reversal of pattern, condition 2 says that prefix reverse[0..m-j+s-2] has length m-j-1.

For a fixed j between 1 and n-1,n we'll start the shift s at 1 and increment s until both conditions hold or s exceeds j. So, for a for each proposed shift s, we'll test if the strict border condition 1 holds and when it does we'll determine if the border condition 2 holds on the reverse pattern; variable k is the length of a border. The strict border condition is:

The prefix of reverse, whose borders we want to test, is reverse[0..(n-j-1+s)]. We'll start by setting variable k to the border of this prefix; k will be decremented while it is larger than the length of the tail of pattern we want to match, that is, n-1-j.

The length of the tail that has been matched when a mismatch occurs at j is n-1-(j+1)-1 = n-1-j.

The next smaller border is found by looking at the border of the border.

At this point a shift s that satisfies conditions 1 and 2 has been found. We can array to exit the while (s <= j) loop by setting s = j; it will then be incremented forcing an exit of the loop.

Putting all of these pieces together gives the code below.

A shift j < s.

Now let's develop the code when no shift between 1 and j can be found, but perhaps a shift greater than j exists. Figure 8 depicts the situation that is represented by the equation  \begin{equation}
\texttt{pattern[0..m-1-s] = pattern[s..m-1}].
\end{equation} Here's the outline of what we need to do.

The shortest shift of the type under consideration is determined by the period of pattern. We'll initialize k to the length of pattern's border and let k become successive (shorter) border lengths as we search for longer shifts.

<<Set the border length>>= int k = border[n];

The search continues as long as there is a non-empty border. After each search for a shift with one border, we reset the border length k to the length of the next border.

With a border of length k the period to shift aligning pattern borders is n-k.

A placeholder start will be used to control the search over j. The first time through pattern index j starts at 0.

Once we've searched over a range start <= j < s, the next seach can be over a range that begins with start = s.

And that is the code which enforces condition 3.

Concerns about the derivation of computeGoodSuffix()

The above derivation of computeGoodSuffix() is not very efficient, but it may be more clear than other developments of the code.

The Last Occurrence Function

The classical Boyer-Moore algorithm uses what is known as the last occurance or bad character heuristic. It says, when a mismatch occurs between pattern[j] and text[i+j, find the right-most (last) occurance of text[i+j] in pattern and shift to align these, see figure 9 which shows this shift.


  
Figure 9: Character (a) at the pattern's end does not match the bad character (b) in text. The last occurance of b in pattern is at j. No shift less than m-1-j could produce a pattern-text match.
\begin{figure}
\end{figure}

When the last occurance of b in pattern is at index k, the lastOccurrence shift on a mismatch at j is lastOccurrence[j] = j-k. Notice that when k > j this is a negative (leftward) shift! Also when b does not occur in pattern a shift of j+1 characters is appropriate, thus we'll define lastOccurance[b] = -1 when b does not occur in pattern. To create a lastOccurance[] table requires |A| space (A is the alphabet and |A| is its cardinality).

Some authors eschew the use of a lastOccurance[] table, other extoll it. It does require space that is dependent on the alphabet, something we've not seen before. It's utility depends on the alphabet size and distribution of characters in pattern.

Analysis of the Boyer-Moore pattern matcher

Establishing a tight upper bound on the number of comparisons is beyond the scope of these notes. A bound of 4n is fairly simple to prove, althougth 3n is a better approximation. When pattern is relatively long and the alphabet is large, Boyer-Moore is likely to be the most efficient pattern matcher. Empirically, in the average case, the number of compares is often sub-linear, that is cn where c < 1.

Finishing up

Testing the Algorithms

Problems

Problem 1:

Problem 1: An alternate to terminating the search in the brute-force sequential search algorithm is to continue looking for a second or more occurrences of the pattern. An on-line algorithm which continually accepts input until an end of input marker is found would usually do this. Re-write the code to handle a continuous stream of characters. It will output a stream of 0's and 1's indicating the pattern was not or was found.

Problem 2:

Problem 2: Show that in the worst case, bruteForce's the inner while loop can execute m times and the outer while loop can execute n-m+1 times. Show that the maximum number of comparisons is (n+1)2/4 and give example strings for pattern and text where this worst case is realized. Hint: maximize the quadratic expression nm-m2+m as a function of m .

Problem 3:

Problem 3: Turn the brute-force algorithm given in these notes into a working program with input and output. Test the average case behavior of the code. Use the words in a dictionary (for example /usr/dict/words on a Unix system) as patterns for which to search. Find a large text document on the World Wide Web (for example www.gutenberg.net has a collection of great books that can serve as text files).

Problem 4:

Problem 4: Define Fibonacci words over the alphabet $\{a, b\}$ by
$F_0 = e,\, F_1 = b,\, F_2 = a,\,\mbox{and} F_n=F_{n-1}F_{n-2}\,\mbox{for}\, n >= 2$
Determine the length of Fn . Find the periods and borders of Fn .

Problem 5:

Problem 5: Develop an algorithm that computes the strict border of a pattern. You may find it useful to know that strictBorder[j] = border[j] if pattern[border[j-1]+1] != pattern[j], while when this inequality does not hold we set j = border[j] until it does or j becomes negative. Show that your algorithm is correct and estimate its time complexity.

Problem 6:

Problem 6: Provide a time and space complexity analysis of the presented code for computeGoodSuffix().

Problem 7:

Problem 7: Develop an alternative more efficient (in time and space) algorithm for computeGoodSuffix(). Some things to consider. Declaring the reverse of pattern requires significant extra space; it can be eliminated. The time spend of computing goodSuffix[n-1] is large; this computation can be folded into the computation when goodSuffix[j] <= j.

Problem 8:

Problem 8: Write a program computeLastOccurrence(), which when given an alphabet A and a pattern determines the last occurrence (rightmost) of each character in A in pattern. Use this algorithm to improve the Boyer-Moore algorithm. Emperically compare the time and space complexity of Boyer-Moore with and without this improvement by using a large text and multiple patterns.

Bibliography

1
R. S. BOYER AND J. S. MOORE, A fast string searching algorithm, Communications of the ACM, 20 (1977), pp. 762-772.

2
D. E. KNUTH, J. H. MORRIS, AND V. R. PRATT, Fast pattern matching in strings, SIAM Journal of Computing, 6 (1977), pp. 240-267.


next up previous
William Shoaff
2000-11-13