by 1.4in

Pattern Matching in Text

William D. Shoaff
Florida Institute of Technology

Table of Contents

Pattern matching

The most basic text problem is PATTERN MATCHING (PMP for short) where we are given a pattern[0..m-1] of length m and text[0..n-1] of length n, where m is generally very much smaller than n, and we wish to know if pattern occurs in text.

We will study several algorithms for PMP. We are interested in their implementation as well as their time and space complexities. Some auxiliary functions will be introduced as needed.

We are programming using an object-oriented approach using the Java language, thus we will assume a class Text that knows how to find patterns within the text property of the class.

<Text.java>=
public class Text {
  <Class properties>
  <Instance properties>
  <Brute-force pattern matching with left-to-right scan>
  <Morris-Pratt pattern matching>
  <Knuth-Morris-Pratt pattern matching>
  <Brute-force pattern matching with right-to-left scan>
  <Boyer-Moore pattern matching>
  <Auxiliary functions>
  <Test the pattern matching algorithms>
}

At the moment, I can't think of any class properties; variables where there is only one copy associated with the class (declared using the static keyword). But just in case we need to define some later we'll make a place for them.

<Class properties>= (<-U)

The instance properties of the class include as a minimum String of text.

Additional instance properties will be defined as required.

<Instance properties>= (<-U) [D->]
  String text;

Brute-force pattern matching with left-to-right scan

The problem to solve is PMP: Does pattern[0..m-1] occur in text[0..n-1] or not? Our first few algorithms will use a left-to-right scan for the pattern in the text. In contrast, the Boyer-Moore algorithm will scan right-to-left. One thing known from the problem statement is the length of the text.

<initialize text length>= (U-> U-> U-> U-> U-> U-> U-> U->)
  int n = text.length();

And the length of pattern

<initialize pattern length>= (U-> U-> U-> U-> U->)
  int m = pattern.length();

The brute-force (naive) pattern matching algorithm compares pattern[j] with text[i+j] for each i=0..n-m and j=0..m-1 until the pattern is found or the end of the text is reached.

Let i denote the index into text where matching starts; it is bounded below by 0 (is this obvious or not?)

<set text index to start of text>= (U-> U-> U-> U-> U->)
  int i = 0; 

The text index i is bounded above by n-m, otherwise p would slide off the right end of t consider the Figure [->]. If i is the rightmost legal position in text, then pattern[0..m-1] align with text[i..n-1] which implies i+m-1=n-1 or i=n-m.


Legal alignments of pattern and text.[*]

<text index is legal>= (U-> U-> U-> U-> U->)
  (i <= n-m) 

The index j will serve two purposes. Not only is it a character index in pattern, it will serve as a count of the number of characters matched. Thus j is bounded below by 0 and above by m; it can take on a value 1 larger than a legal index (m-1) but only to serve as a flag that m characters have matched.

<set pattern index to start of pattern>= (U-> U-> U->)
  int j =0;

Define a predicate for testing if the pattern index is legal.

<pattern index is legal>= (U->)
  (j<m)

The matching starts at some index i and as long as the pattern matches the text we simply increment the j index

<pattern[j] matches text[i+j]>= (U-> U->)
  (pattern.charAt(j) == text.charAt(i+j))

<increment pattern index>= (U->)
  j++;

All this is neatly handled in a while statement.

<left-to-right scan>= (U-> U-> U->)
   while (<pattern index is legal> && <pattern[j] matches text[i+j]>) {
      <increment pattern index>
   }

There are two exits for the inner while loop. One is when m characters of pattern have matched text and in this case we simply return true as the answer to the decision problem PMP.

Problem 1: An alternate is to continue looking for a second or more occurrences of the pattern. An on-line algorithm which continually accepts input until an end of input marker is found would usually do this. Re-write the code to handle a continuous stream of characters. It will output a stream of 0's and 1's indicating the pattern was not or was found.

<pattern found in text>= (U-> U-> U->)
  (j==m)

Of course when the pattern is not found we return false.

<pattern not found in text>= (U-> U-> U-> U-> U->)
  return false;

The other exit of the inner while loop occurs a mismatch occurred at index j in pattern] and index [[i+j in text. The brute-force approach is to shift the pattern one position forward in text and reset the pattern to 0.

<brute-force shift of pattern>= (U->)
  ++i; 
  j=0;

Putting all these pieces together produces the brute-force left-to-right pattern matching algorithm.

<Brute-force pattern matching with left-to-right scan>= (<-U)
public boolean patternMatcher (Text pattern) {
  <initialize text length>
  <initialize pattern length>
  <set text index to start of text>   
  <set pattern index to start of pattern>     
  while <text index is legal> {
     <left-to-right scan>
     if <pattern found in text> return true;
     <brute-force shift of pattern> 
  }
  <pattern not found in text>
}

Analysis of the brute-force pattern matcher

There are several things to notice about the brute-force pattern matcher. First, it has optimal space complexity; it solves PMP in constant space needing only a few registers for indices i, j, and lengths m and n. (In general, space complexity ignores the space required for input and output). Second it has worst case quadratic time complexity and this is not very good!

Problem 2: Show that in the worst case patternMatcher's the inner while loop can execute m times and the outer while loop can execute n-m+1 times. Show that the maximum number of comparisons is (n+1)^2/4 and give example strings for pattern and text where this worst case is realized. Hint: maximize the quadratic expression nm-m^2+m as a function of m.

Of course the average case time complexity is not nearly so bad.

Problem 3: Turn the brute-force algorithm given in these notes into a working program with input and output. Test the average case behavior of the code. Use the words in a dictionary (for example /usr/dict/words on a Unix system) as patterns for which to search. Find a large text document on the World Wide Web (for example www.gutenberg.net has a collection of great books that can serve as text files).

The Morris-Pratt pattern matcher

The brute-force pattern matcher does not use the information gathered once a mismatch occurs; it throws away the knowledge that pattern[0..j-1 == text[i..i+j-1]] and pattern[j] != text[i+j] and simply restarts by comparing pattern[0] with text[i+1].

We will see how to pre-process the pattern so that this information is not wasted. The cost will be an increase in the space needed in pattern matching, but this will reduce the worst case time complexity for PMP. The preprocessing is accomplished by exploiting the invariant

pattern[0..j-1] = text[i..i+j-1].
Later we will make use of the mismatch information to improve the process even more.

The algorithm looks exactly like the brute-force algorithm with the exception of how shifts are made once a mismatch is found.

<Morris-Pratt pattern matching>= (<-U)
public boolean MorrisPratt (Text pattern) {
  <initialize text length>
  <initialize pattern length>
  <set text index to start of text>   
  <set pattern index to start of pattern>     
  while <text index is legal> {
     <left-to-right scan>
     if <pattern found in text> return true;
     <Morris-Pratt shift> 
  }
  <pattern not found in text>
}

Consider what we know when pattern[0..j-1] matches text[i..i+j-1]. Figure [->] shows an alignment of the text at pattern at index i with matching characters at positions i..i+j-1. Suppose pattern is shifted from position i in text to position i+s. There are three conditions such a shift s should satisfy:


Alignment with match between pattern[0..j-1] and text[i..i+j-1]. [*]

  1. s should be safe,
  2. s should be feasible,
  3. s should be at least one.

A shift from i to i+s is safe if pattern can not occur at any position in between.

A safe shift is feasible if a match could occur at i+s (based on our current knowledge).

Since we are assuming a mismatch between pattern[j] and text[i+j], a shift of s=1 is safe; this is the brute-force (or conservative approach). We'd like to conclude that such a small shift is infeasible, so we can safely make larger safe shifts until a feasible shift is found.

Figure [->] extends figure [<-] to show the configuration when a shift occurs. For the shift to be feasible, Figure [->] shows that prefix pattern[0..j-s-1] must be a proper suffix of pattern[0..j-1]. To also be safe, s must be the smallest such feasible shift.


A safe feasible shift of pattern. [*]

Word borders and periods

Before going farther let's introduce some terms and ideas that are helpful in text algorithms.

The border of a word w is any word that is both a prefix and suffix of w. For example, the word

w=abaabaaabaaba
has borders
abaaba, aba, a, and e
where e stands for the empty word.

Define w.border() to be the longest proper border of a non-empty word w. We can iterate this function obtaining a list of all borders of w, for the sample word w=abaabaaabaaba

w.border()=abaaba, abaaba.border()=aba, aba.border()=a
a.border()=e

Borders have a dual notion called a period, which is an integer p such that 0< p <= |w| and prefix w[0..k]=w[p..p+k] where k=|w|-p. The periods of w=abaabaaabaaba are 7, 10, 12 and 13. The period plus the length of the border equals the word's length

p + |w.border()| = |w|.
Shifting a string by its shortest period aligns its borders. Such shifts are feasible.

Problem 4: Define Fibonacci words over the alphabet {a, b} by

F_0 = e, F_1 = b, F_2 = a, and F_n=F_n-1F_n-2 for n >= 2
Determine the length of F_n. Find the periods and borders of F_n.

As we will see below [->], one can readily compute the length of the border of a string. So suppose we compute the length |pattern[0..j-1].border()| for each index j=1..m and store this integer in an array border[j] (we'll see border[0] should be set to -1). When a mismatch occurs the pattern is shifted by its period j-border[j]. Here's an example:

pattern index j: 0 1 2 3 4 5
pattern: a b a a b a
text: a a b a c a b a a ...
text index i: 0 1 2 3 4 5 6 7 8 ...
The mismatch occurs at j=3 once we've matched aba, which has border a of length 1. The period of aba is 2. Shifting pattern 2=3-1 positions produces an alignment of the border of aba.

pattern index j: 0 1 2 3 4 5
pattern: a b a a b a
text: a a b a c a b a a ...
text index i: 0 1 2 3 4 5 6 7 8 ...
We can restart the matching at j=1=border[j]. We don't need to recheck the match of pattern[0] with text[3] (of course we get an immediate mismatch at the next position --- which we'll handle later).

A shift by the period is both safe and feasible, from the above analysis we know that the shift satisfies

s = j - border[j]
where border[j] is the length of the border of pattern[0..j-1].

Now let's consider the boundary case where j=0 when the mismatch occurs. That is pattern[0] is not the same as text[i]. Clearly a shift of s=1 is warranted, and

s = 1 = 0 - border[0]
implies border[0]=-1. Because of this we must restart the matching process from pattern index j=max(0, border[j]).

<Morris-Pratt shift>= (U->)
  i += j-pattern.border[j]; 
  j = (0 < pattern.border[j]) ? pattern.border[j] : 0;

Computation of borders[*]

We want to compute border[0..m] for pattern which has length m; recall border[k] is the length of the border of the prefix pattern[0..k-1] and a shift by k-border[k] aligns the border at the start of pattern[0..k-1] with the border end of pattern[0..k-1]. The pattern calls computerBorders() (i.e., pattern.computeBorders()) before the call text.MorrisPratt(pattern).

Remember Text variable pattern has an internal representation that uses a String named text of length n. The variables pattern and m will be used when discussing the algorithm, but programming text and n will be used in the code.

We know border[0] = -1 and so we can set this initial value. Also, we'll store the length of the current border in a local variable k.

<Define border[0]>= (U->)
  border[0] = -1; 
  int k = -1; 

Now we want to compute border[1],...,border[m], where m is the length of pattern. This will be done in a for loop that loops over each non-empty prefix of pattern.

<each non-empty prefix>= (U->)
  (int j = 1; j <= n; j++) 

Let's pretend we've computed k=border[[j-1] of pattern[0..j-2] for some j >= 1, and we want to compute border[j]. Figure [->] shows that when pattern[k] == pattern[j-1] then border[j] = k+1.


Extending the border when characters after border and prefix match. [*]

That is, when the character after the border matches the character after the prefix, the border is extended by 1.

<Extend the border by 1 when next characaters match>= (U->)
  ++k;
  border[j] = k;
  

Now let's consider what happens when there is a mismatch, that is pattern[k] != pattern[j-1]. This case is a little more complex to figure out, but figure [->] shows that while a mismatch occurs, the current border k needs to be reduced to border[k] until there is a match or no border is found (k = -1).


Reducing the border after a mismatch.[*]

<Reduce the border until a match is found or no border exists>= (U->)
  while ((k >= 0) && (text.charAt(k) != text.charAt(j-1))) { 
    k = border[k]; 
  }

Using the above code chunks we can put together the computeBorders() function easily.

<Auxiliary functions>= (<-U) [D->]
public void computeBorders() {
  <initialize text length>
  <Define border[0]>
  for <each non-empty prefix> {
    <Reduce the border until a match is found or no border exists>
    <Extend the border by 1 when next characaters match>
  }
}

One final act, so the Java compiler does not complain: we declare the integer array border[] use in the Morris-Pratt algorithm

<Instance properties>+= (<-U) [<-D->]
  public int[] border;

Analysis of the Morris-Pratt pattern matcher

The need for the array border[0..m] increases the space complexity from the constant space requirement for the brute-force algorithm. The benefit is a linear worst case time complexity.

Theorem: The maximal number of symbol compares in the Morris-Pratt algorithm is 2n-m.

Proof
There is at most one unsuccessful comparison for each index i, which ranges from 0 to n-m. We can give an upper bound for the number of successful compares by considering the sum i+j. The least value for i+j is 0 and the greatest value is n-1. Each time a successful compare is made i+j increases by 1 and it never decreases, thus there are at most n successful compares. Finally not both successful and unsuccessful compares can attain their maximum, thus there are at most n + (n-m+1) - 1 = 2n-m compares in the Morris-Pratt algorithm.

Knuth-Morris-Pratt pattern matching

The Morris-Pratt algorithm can be improved by using additional information known at the time a mismatch occurs. In particular, the complete invariant is

pattern[0..j-1]==text[i..i+j-1] and pattern[j] != text[i+j].

The Knuth-Morris-Pratt (KMP) algorithm [cite knmp:77] makes use of this additional one-bit of mismatch information to allow longer shifts of pattern in text. Otherwise, the algorithm is the same as the previous ones.

<Knuth-Morris-Pratt pattern matching>= (<-U)
public boolean KMP (Text pattern) {
  <initialize text length>
  <initialize pattern length>
  <set text index to start of text>
  <set pattern index to start of pattern>
  while <text index is legal> {
     <left-to-right scan>
     if <pattern found in text> return true;
     <Knuth-Morris-Pratt shift> 
  }
  <pattern not found in text>
}

One might call the idea used in the KMP shift ``strict borders.'' The length of these strict borders are stored in an array strictBorder[], and the shift is compute just as in the Morris-Pratt case using this new array.

<Knuth-Morris-Pratt shift>= (U->)
  i += j-pattern.strictBorder[j]; 
  j = (0 < pattern.strictBorder[j]) ? pattern.strictBorder[j] : 0;

The strictBorder[] array used in the Knuth-Morris-Pratt algorithm is declared.

<Instance properties>+= (<-U) [<-D->]
  public int[] strictBorder;

And its elements initialized to -1.

<initialize border>=
  for (int j = 0; j < text.length(); j++) strictBorder[j]=-1;
 

To fill out the strictBorder[] array, consider what we know:

pattern[0..j-1]==text[i..i+j-1] and pattern[j] != text[i+j].

So if pattern[0..k] is the border of pattern[0..j-1 and pattern[k+1] = pattern[j] != text[i+j] then there will be an immediate mismatch when we shift to align borders. In this case we can safely shift farther aligning the border of pattern[0..k with the tail of pattern[0..j-1]. On the other hand, if pattern[k+1] != pattern[j] then perhaps pattern[k+1] = text[i+j] and we can only safely shift the the length of border pattern[0..j-1].

Here's an example.

index: 0123456
pattern: abaabaa
text: abaabacabaab

pattern[6] != text[0+6] and the border of abaaba is aba, which has length 3, implying a shift of 6-3=3. A shift of 3 leads to an immediate mismatch since pattern[3]=a will be compared with text[6]=c. Thus, we can consider the border of the border of abaaba, that is the border of aba, or a, which has length 1 and shift by 6-1=5.

Problem 5: Develop an algorithm that computes the strict border of a pattern. You may find it useful to know that strictBorder[j] = border[j] if pattern[border[j-1]+1] != pattern[j], while when this inequality does not hold we set j = border[j] until it does or j becomes negative. Show that your algorithm is correct and estimate its time complexity.

Analysis of the Knuth-Morris-Pratt pattern matcher

The KMP pattern matcher has space complexity S(n+m)=O(m). This reflects the storage for array strictBorder[0..m] and the constant space required for indices and lengths.

The time complexity T(n+m) = O(n+m) is generally better than the Morris-Pratt algorithm, but may be no better than it.

Right-to-left scanning for pattern matching

Now we want to develop a brute-force algorithm that will match pattern against text using a right-to-left scan of pattern. The previous left-to-right algorithm is modified in these ways:

  1. pattern index j starts at the end of pattern.
  2. The inner scan decrements j.
  3. When j runs off the left-end (j=-1) a complete match has occurred.
  4. The shift moves the pattern right one place and resets j to the end of pattern.

<Brute-force pattern matching with right-to-left scan>= (<-U)
public boolean patternMatcher2 (Text pattern) {
  <initialize text length>
  <initialize pattern length>
  <set text index to start of text>
  <set pattern index to end of pattern>
  while <text index is legal> {
     <right-to-left scan>
     if (j == -1) return true;
     <right-to-left brute-force shift of pattern> 
  }
  <pattern not found in text>
}

The right-to-left scan starts at the end of pattern.

<set pattern index to end of pattern>= (U-> U->)
  int j = m-1;

And decrements pattern index j so long as text matches pattern.

<right-to-left scan>= (U-> U->)
  while ((j > -1) && <pattern[j] matches text[i+j]>) { --j; }

When a mismatch occurs, slide pattern one position right (i = i+1) and reset j to point to the end of pattern.

<right-to-left brute-force shift of pattern>= (U->)
  ++i; 
  j = m-1;

The Boyer-Moore algorithm

The Boyer-Moore algorithm [cite bomo:77] uses the knowledge gained by the above brute-force algorithm to leverage an improved pattern matcher. What do we know? When a mismatch occurs, we know the invariant

pattern[j] != text[i+j]]]
and
pattern[j+1..m-1] = text[i+j+1..i+m-1]
This invariant is shown in figure [->].


Alignment with match using right-to-left scan.[*]

Let's pretend, after a mismatch, that we shift pattern s positions to the right where 1 <= s <= j. This aligns pattern[0..m-1-s] and pattern[s..m-1] as shown in figure [->]. In particular, pattern[j-s] aligns with pattern[j] and text[i+j]. Thus, we require that the shift s satisfy:

  1. [*] pattern[j-s] != pattern[j]. If they were equal an mismatch between pattern[j-s] and text[i+j] would occur; such a shift is not feasible.
  2. [*] A border of the reversed prefix pattern[m-1..j-s+1] has length m-1-j, that is, pattern[m-1..j+1] = pattern[m-1-s..j+1-s].
Notice condition [<-] is a statement about some border of a prefix of the reversed string; this border may not be the border.


A Boyer-Moore shift between 1 and j positions.[*]

Now let's pretend that we shift j+1 <= s < m characters. Such a shift aligns pattern[0..m-1-s] and pattern[s..m-1] with text[i+s..i+m-1], see figure [->]. Such a shift s satisfies the condition:

  1. [*] pattern has some border pattern[0..m-1-s] of length m-j-1 or less.


A Boyer-Moore shift between j+1 and m positions.[*]

When either the first two conditions or the third condition fail to hold we can safely shift pattern the maximal amount m. Given the pattern we can compute if there are shifts satisfying the conditions above. Such shifts are safe and feasible. For each j the longest safe and feasible shift is stored in a look-up table which is used when a mismatch occurs. The algorithm is identical to the brute-force right-to-left scan, except for this look-up of the shift.

<Boyer-Moore pattern matching>= (<-U)
public boolean BoyerMoore (Text pattern) {
  <initialize text length>
  <initialize pattern length>
  <set text index to start of text>
  <set pattern index to end of pattern>
  while <text index is legal> {
     <right-to-left scan>
     if (j == -1) return true;
     <Boyer-Moore shift of pattern> 
  }
  <pattern not found in text>
}

We'll call the table of shifts goodSuffix[]. Then when a mismatch occurs on pattern index j, the shift goodSuffix[j] is added to i and pattern index j is reset to the end of pattern.

<Boyer-Moore shift of pattern>= (<-U)
  i += pattern.goodSuffix[j];
  j = m-1;

Computing the goodSuffix[] array

We start by declaring an instance of the goodSuffix[] array. It's length will be set to m when pattern is created and each element will be initialized to zero, the Java default. (Remember Text variable pattern has an internal representation with a String named text of length n. The variable m will be used when discussing the algorithm, but n will be used in the code.)

<Instance properties>+= (<-U) [<-D]
  public int[] goodSuffix;

The code for goodSuffix[] is abstruse. Essentially, we want to test the conditions discussed above. Our implementation is driven more by a need for clarity than efficiency in time and space. Here is the complete algorithm for computegoodSuffix().

<Auxiliary functions>+= (<-U) [<-D->]
public void computeGoodSuffix() {
  <Declarations and initializations>
  <Set each shift to the maximal value>
  <Reset shifts when 1 <= goodSuffix[j] <= j>
  <Reset shifts when goodSuffix[j] > j>
} 

Let's start with an auxiliary routine that reverses the character is a string. This will be useful in testing condition [<-].

<Auxiliary functions>+= (<-U) [<-D->]
public Text reverse() {
  <initialize text length>
  StringBuffer reverse = new StringBuffer();
  for (int i = n-1; i > -1; i--) {
     reverse.append(text.charAt(i));
  }
  return new Text(reverse.toString());
}

Borders for both the pattern and its reverse are used, and a variable called s will denote the shift.

<Declarations and initializations>= (<-U)
  <initialize text length>
  Text reverse = reverse();
  computeBorders();
  reverse.computeBorders();
  int s; 

We'll start by being opportunistic and set, for each j, goodSuffix[j] = m the largest possible shift. As we find that shorter shifts are safe and feasible we'll reset goodSuffix[j] to these smaller values.

<Set each shift to the maximal value>= (<-U)
  for (int j = 0; j < n; j++) {
    goodSuffix[j] = n;
  }

Boundary conditions

To develop the shift table for the Boyer-Moore algorithm , we'll consider boundary cases first.

First compare is a mismatch.
When there is an immediate mismatch between pattern[m-1] and text[i+m-1], a shift of 1 is appropriate, but so is a shift by the smallest value s such that pattern[m-1-s] != pattern[m-1]. This is condition [<-] for the case j=m-1. The requirement is that s be the smallest value satisfying reverse[0..s+1] = 0.

No compare is a mismatch.
Here it must be that j, our pattern position index, has fallen off the left end of pattern, that is j == -1. Our decision algorithm simply return true when this occurs.

Mismatch on the last compare.
Now let's consider the case that pattern[0] != text[i]. That is, we've match all characters except the first. I hope it is obvious that a shift by the period of pattern, that is m-border[m], is both safe and feasible. The border is a good suffix where a shift by the period will produce a potential pattern-text match; no shorter shift can.

Non-boundary cases

We restrict our attention to the case where a mismatch occurs at pattern[j] and 0 < j < m-1. This scenario is shown in figure [<-], and there are two cases to consider. These are illustrated in figures [<-] and [<-]. In figure [<-], the proposed shift s is no more than j. In figure [<-], s is larger than j, but less than m-1.

A shift 1 <= s <= j.
Let's pretend that the situation of figure [<-] occurs to determine how to build the code that enforces the situation. Thus s is between 1 and j where a mismatch occurs at pattern[j]. Figure [<-] illustates that two conditions must hold:

[*] pattern[j-s] != pattern[j] and [*] pattern[j+1-s..m-1-s] = pattern[j+1..m-1]. Condition [<-] is the Knuth-Morris-Pratt strict border condition and condition [<-] is the Morris-Pratt border condition for the reversed pattern. With reverse being the reversal of pattern, condition [<-] says that prefix reverse[0..m-j+s-2] has length m-j-1.

For a fixed j between 1 and n-1,n we'll start the shift s at 1 and increment s until both conditions hold or s exceeds j. So, for a for each proposed shift s, we'll test if the strict border condition [->] holds and when it does we'll determine if the border condition [<-] holds on the reverse pattern; variable k is the length of a border.

<Search for a  safe shift between 1 and j>= (U->)
  s = 1;
  while (s <= j) {
     if <Strict border condition> { 
       <Initialize border of reverse[0..(n-j-1+s)]>
       while <Border greater than tail to match> {
         <Reset to smaller border>
       }
       if (k == n-j-1) { // border condition satisfied
         <Set goodSuffix[j] and exit while loop> 
       }
     }
      ++s;
   } 

The strict border condition is:

<Strict border condition>= (<-U)
  (text.charAt(j-s) != text.charAt(j))

The prefix of reverse, whose borders we want to test, is reverse[0..(n-j-1+s)]. We'll start by setting variable k to the border of this prefix; k will be decremented while it is larger than the length of the tail of pattern we want to match, that is, n-1-j.

<Initialize border of reverse[0..(n-j-1+s)]>= (<-U)
  int k = reverse.border[n-j-1+s];

The length of the tail that has been matched when a mismatch occurs at j is n-1-(j+1)-1 = n-1-j.

<Border greater than tail to match>= (<-U)
  (k > n-j-1)

The next smaller border is found by looking at the border of the border.

<Reset to smaller border>= (<-U U->)
  k = reverse.border[k];  

At this point a shift s that satisfies conditions [->] and [<-] has been found. We can array to exit the while (s <= j) loop by setting s = j; it will then be incremented forcing an exit of the loop.

<Set goodSuffix[j] and exit while loop>= (<-U)
  goodSuffix[j] = s; 
  s = j;

Putting all of these pieces together gives the code below.

<Reset shifts when 1 <= goodSuffix[j] <= j>= (<-U)
  for (int j=1; j < n; j++) {
    <Search for a  safe shift between 1 and j>
  }

A shift j < s.
Now let's develop the code when no shift between 1 and j can be found, but perhaps a shift greater than j exists. Figure [<-] depicts the situation that is represented by the equation [*] pattern[0..m-1-s] = pattern[s..m-1]. Here's the outline of what we need to do.

<Reset shifts when goodSuffix[j] > j>= (<-U)
  <Set the border length>
  <Initialize the index where the search starts>
  while (<There is a non-empty border>) {
    <Set shift to the period of the current border>
    for (int j = start; j < s; j++) {
      goodSuffix[j] = (s < goodSuffix[j]) ? s : goodSuffix[j];
    System.out.println("**gs["+j+"]="+goodSuffix[j]);
    }

    <Reset the start index>
    <Reset to smaller border>
  }

The shortest shift of the type under consideration is determined by the period of pattern. We'll initialize k to the length of pattern's border and let k become successive (shorter) border lengths as we search for longer shifts.

<Set the border length>= (<-U)
  int k = border[n];

The search continues as long as there is a non-empty border. After each search for a shift with one border, we reset the border length k to the length of the next border.

<There is a non-empty border>= (<-U)
  (k > 0)

With a border of length k the period to shift aligning pattern borders is n-k.

<Set shift to the period of the current border>= (<-U)
  s = n - k;

A placeholder start will be used to control the search over j. The first time through pattern index j starts at 0.

<Initialize the index where the search starts>= (<-U)
  int start = 0;

Once we've searched over a range start <= j < s, the next seach can be over a range that begins with start = s.

<Reset the start index>= (<-U)
  start = s;

And that is the code which enforces condition [<-].

Concerns about the derivation of computeGoodSuffix()

The above derivation of computeGoodSuffix() is not very efficient, but it may be more clear than other developments of the code.

Problem 6: Provide a time and space complexity analysis of the presented code for computeGoodSuffix().

Problem 7: Develop an alternative more efficient (in time and space) algorithm for computeGoodSuffix(). Some things to consider. Declaring the reverse of pattern requires significant extra space; it can be eliminated. The time spend of computing goodSuffix[n-1] is large; this computation can be folded into the computation when goodSuffix[j] <= j.

The Last Occurrence Function

The classical Boyer-Moore algorithm uses what is known as the last occurance or bad character heuristic. It says, when a mismatch occurs between pattern[j] and text[i+j, find the right-most (last) occurance of text[i+j] in pattern and shift to align these, see figure [->] which shows this shift.


Character (a) at the pattern's end does not match the bad character (b) in text. The last occurance of b in pattern is at j. No shift less than m-1-j could produce a pattern-text match.[*]

When the last occurance of b in pattern is at index k, the lastOccurrence shift on a mismatch at j is lastOccurrence[j] = j-k. Notice that when k > j this is a negative (leftward) shift! Also when b does not occur in pattern a shift of j+1 characters is appropriate, thus we'll define lastOccurance[b] = -1 when b does not occur in pattern. To create a lastOccurance[] table requires |A| space (A is the alphabet and |A| is its cardinality).

Some authors eschew the use of a lastOccurance[] table, other extoll it. It does require space that is dependent on the alphabet, something we've not seen before. It's utility depends on the alphabet size and distribution of characters in pattern.

Problem 8: Write a program computeLastOccurrence(), which when given an alphabet A and a pattern determines the last occurrence (rightmost) of each character in A in pattern. Use this algorithm to improve the Boyer-Moore algorithm. Emperically compare the time and space complexity of Boyer-Moore with and without this improvement by using a large text and multiple patterns.

Analysis of the Boyer-Moore pattern matcher

Establishing a tight upper bound on the number of comparisons is beyond the scope of these notes. A bound of 4n is fairly simple to prove, althougth 3n is a better approximation. When pattern is relatively long and the alphabet is large, Boyer-Moore is likely to be the most efficient pattern matcher. Empirically, in the average case, the number of compares is often sub-linear, that is cn where c < 1.

Finishing up

To complete the class we'll define a constructor. It has one String argument, and this is set to the text. It will also initialize the tables (arrays) used to look up shifts.

<Auxiliary functions>+= (<-U) [<-D->]
  public Text(String t) {
    text = t;
    border = new int[text.length()+1];
    strictBorder = new int[text.length()+1];
    goodSuffix = new int[text.length()];
  }

Another useful method returns the length of the text string.

<Auxiliary functions>+= (<-U) [<-D->]
  public int length() {
    return text.length();
  }

And another useful method returns the character at a position k in the text string.

<Auxiliary functions>+= (<-U) [<-D]
  public char charAt(int k) {
    return text.charAt(k);
  }

The Source

This document is written using Norman Ramsey's noweb tools for literate programming. The source for the document is a file named Text.nw, which can be translated into an HTML file for online distribution, a LaTeX file for printing, or a Java file.

You can ftp the Java file, the LaTeX file, or its PostScript translation by anonymous ftp from tsunami.cs.fit.edu. They are in the pub/algo directory.

Testing the Algorithms

Now we'll do one last, but important thing. We'll write some test cases that helps us to believe that no defects occur in our code.

The main routine will read two strings from command line and then perform various tests to see that our algorithms work correctly (at least on the test cases). The first string is the text and the second is the pattern.

<Test the pattern matching algorithms>= (<-U)
public static void main(String[] args) {
  Text text = new Text(args[0]);
  Text pattern = new Text(args[1]);
  <Test the left-to-right scan brute-force patternMatcher()>
  <Test computeBorders()>
  <Test MorrisPratt()>
  <Test computeStrictBorders()>
  <Test KnuthMorrisPratt()>
  <Test the right-to-left scan brute-force patternMatcher2()>
  <Test computeGoodSuffix()>
  <Test BoyerMoore()>
}

 


The first test will be of the brute-force left-to-right scan pattern matcher.

<Test the left-to-right scan brute-force patternMatcher()>= (<-U)
  System.out.println(text.patternMatcher(pattern));

One thing to test is that the border[] array is correctly computed.

<Test computeBorders()>= (<-U)
  pattern.computeBorders();
  for (int j = 0; j <= pattern.length(); j++) {
    System.out.println("border[" + j + "] = " + pattern.border[j]);
  }

Now let's test that our implementation of the Morris-Pratt algorithm works correctly.

<Test MorrisPratt()>= (<-U)
  System.out.println(text.MorrisPratt(pattern));

We can not test the KMP algorithm since we've left its completion as an exercise.

<Test computeStrictBorders()>= (<-U)
//  pattern.computeStrictBorders();
//  for (int j = 0; j <= pattern.length(); j++) {
//    System.out.println("border[" + j + "] = " + pattern.border[j]);
//  }

<Test KnuthMorrisPratt()>= (<-U)
//  System.out.println(text.KnuthMorrisPratt(pattern));

<Test the right-to-left scan brute-force patternMatcher2()>= (<-U)
  System.out.println(text.patternMatcher2(pattern));

Before testing Boyer-Moore we see if goodSuffix[] is calculated properly.

<Test computeGoodSuffix()>= (<-U)
  pattern.computeGoodSuffix();
  for (int j = 0; j < pattern.length(); j++) {
    System.out.println("goodSuffix[" + j + "] = " + pattern.goodSuffix[j]);
  }

And now our test of BoyerMoore().

<Test BoyerMoore()>= (<-U)
  System.out.println(text.BoyerMoore(pattern));

Additional Problems

These notes suffer from empty figures. Devise good diagrams for the concept the figure is to illustrate and provide .gif and .ps files that can be incorporated the notes.

References

[1] R. S. Boyer and J. S. Moore, A fast string searching algorithm, Communications of the ACM, 20 (1977), pp. 762--772.

[2] D. E. Knuth, J. H. Morris, and V. R. Pratt, Fast pattern matching in strings, SIAM Journal of Computing, 6 (1977), pp. 240--267.