by William Shoaff with lots of help
You can download a postscript version of this file (which is prettier) at
The most basic text problem is PATTERN MATCHING (PMP for short) where we are given a pattern[0..m-1] of length m and text[0..n-1] of length n, where m is generally very much smaller than n, and we wish to know if pattern occurs in text.
We will study several algorithms for PMP. We are interested in their implementation as well as their time and space complexities. Some auxiliary functions will be introduced as needed.
We will write programs using an object-oriented approach with the Java language, thus we will assume a class Text that knows how to find patterns within the text property of the class.
At the moment, I can't think of any class properties; variables where there is only one copy associated with the class (declared using the static keyword). But just in case we need to define some later we've made a place for them with the chunck called Class Properties above.
The instance properties of the class include as a minimum String of text. Additional instance properties will be defined as required.
The problem to solve is PMP: Does pattern[0..m-1] occur in text[0..n-1] or not? Our first few algorithms will use a left-to-right scan for the pattern in the text. In contrast, the Boyer-Moore algorithm will scan right-to-left. One thing known from the problem statement is the length of the text a nd the length of pattern.
The brute-force (naive) pattern matching algorithm compares pattern[j] with text[i+j] for each i=0..n-m and j=0..m-1 until the pattern is found or the end of the text is reached.
Let i denote the index into text where matching starts; it is bounded below by 0 (is this obvious or not?)
The text index i is bounded above by n-m, otherwise the pattern would slide off the right end of the text consider the Figure 1. If i is the rightmost legal position in text, then pattern[0..m-1] align with text[i..n-1] which implies i+m-1=n-1 or i=n-m.
The index j will serve two purposes. Not only is it a character index in pattern, it will serve as a count of the number of characters matched. Thus j is bounded below by 0 and above by m; it can take on a value 1 larger than a legal index (m-1) but only to serve as a flag that m characters have been matched.
Now, let's define a predicate for testing if the pattern index is legal.
The matching starts at some index i and as long as the pattern matches the text we simply increment the j index.
All this is neatly handled in a while statement.
There are two exits for this (inner) while loop. One is when m characters of pattern have matched text and in this case we simply return true as the answer to the decision problem PMP.
Of course when the pattern is not found we return false.
The other exit of the inner while loop occurs a mismatch occurred at index j in pattern and index i+j in text. The brute-force approach is to shift the pattern one position forward in text and reset the pattern index j to 0.
Putting all these pieces together produces the brute-force left-to-right pattern matching algorithm.
There are several things to notice about the brute-force pattern matcher. First, it has optimal space complexity; it solves PMP in constant space needing only a few registers for indices i, j, and lengths m and n. (In general, space complexity ignores the space required for input and output). Second it has worst case quadratic (O(nm) ) time complexity and this is not very good! Of course the average case time complexity is not nearly so bad.
The brute-force pattern matcher does not use the information gathered once a mismatch occurs; it throws away the knowledge that pattern[0..j-1] == text[i..i+j-1] and pattern[j] != text[i+j] and simply restarts by comparing pattern[0] with text[i+1].
We will see how to pre-process the pattern so that this information is not wasted. The cost will be an increase in the space needed in pattern matching, but this will reduce the worst case time complexity for PMP. The preprocessing is accomplished by exploiting the invariant
The algorithm looks exactly like the brute-force algorithm with the exception of how shifts are made once a mismatch is found.
Consider what we know when pattern[0..j-1] matches text[i..i+j-1]. Figure 2 shows an alignment of the text at pattern at index i with matching characters at positions i..i+j-1. Suppose pattern is shifted from position i in text to position i+s. There are three conditions such a shift s should satisfy:
A shift from i to i+s is safe if pattern can not occur at any position in between. A safe shift is feasible if a match could occur at i+s (based on our current knowledge).
Since we are assuming a mismatch between pattern[j] and text[i+j], a shift of s=1 is safe; this is the brute-force (or conservative approach). Often such a small shift is infeasible, so we can safely make larger safe shifts until a feasible shift is found.
Figure 3 extends figure 2 to show the configuration when a shift occurs. For the shift to be feasible, Figure 3 shows that prefix pattern[0..j-s-1] must be a proper suffix of pattern[0..j-1]. To also be safe, s must be the smallest such feasible shift.
Before going farther let's introduce some terms and ideas that are helpful in text algorithms.
The border of a word w is any word that is both a prefix and suffix of w. For example, the word
Define w.border() to be the longest proper border of a non-empty word w. We can iterate this function obtaining a list of all borders of w, for the sample word w=abaabaaabaaba
Borders have a dual notion called periods, which are integers p
such that
and prefix
w[0..k] equals subpattern w[p..p+k]
where k=w.length()-p.
The period of w=abaabaaabaaba are 7, 10, 12, and 13 since
where the last period 13 corresponds to the empty border.
Notice that the period plus the length of the border equals the word's length
Shifting a string by its shortest period aligns its borders. Such shifts are feasible. As we will see below 3.2, one can readily compute the length of the border of a string. So suppose we compute the length |pattern[0..j-1].border()| for each index j=1..m and store this integer in an array border[j] (we'll see border[0] should be set to -1). When a mismatch occurs the pattern is shifted by its period j-border[j]. Here's an example:
The mismatch occurs at j=3 once we've matched aba, which
has border a of length 1. The period of aba is 2.
Shifting pattern 2=3-1 positions produces an alignment
of the border of aba.
We can restart the matching at j=1=border[j].
We don't need to recheck the match of pattern[0] with text[3]
(of course we get an immediate mismatch at the next position -- which we'll
handle later).
A shift by the period is both safe and feasible, from the above analysis we know that the shift satisfies
Now let's consider the boundary case where j=0 when the mismatch occurs. That is pattern[0] is not the same as text[i]. Clearly a shift of s=1 is warranted, and
We want to compute border[0..m] for pattern which has length m; recall border[k] is the length of the border of the prefix pattern[0..k-1] and a shift by k-border[k] aligns the border at the start of pattern[0..k-1] with the border end of pattern[0..k-1]. The pattern calls computerBorders() (i.e., the call pattern.computeBorders()) is made before the call text.MorrisPratt(pattern).
Remember Text variable pattern has an internal representation that uses a String named text of length n. The variables pattern and m will be used when discussing the algorithm, but programming text and n will be used in the code when we need to refer to the instance variables.
We know border[0] = -1 and so we can set this initial value. Also, we'll store the length of the current border in a local variable k.
Now we want to compute border[1],...,border[m], where m is the length of pattern. This will be done in a for loop that loops over each non-empty prefix of pattern.
Let's pretend we've computed k=border[j-1] of pattern[0..j-2] for some j >= 1, and we want to compute border[j]. Figure 4 shows that when pattern[k] == pattern[j-1] then border[j] = k+1. That is, when the character after the prefix matches the character after the right border, the border can be extended by 1.
![]() |
Using the above code chunks we can put together the computeBorders() function easily.
One final act, so the Java compiler does not complain: we declare the integer array border[] use in the Morris-Pratt algorithm
The need for the array border[0..m] increases the space complexity from the constant space requirement for the brute-force algorithm. The benefit is a linear worst case time complexity.
The Morris-Pratt algorithm can be improved by using additional information known at the time a mismatch occurs. In particular, the complete invariant is
The Knuth-Morris-Pratt (KMP) algorithm [2] makes use of this additional one-bit of mismatch information to allow longer shifts of pattern in text. Otherwise, the algorithm is the same as the previous ones.
One might call the idea used in the KMP shift ``strict borders.'' The length of these strict borders are stored in an array strictBorder[], and the shift is computed just as in the Morris-Pratt case using this new array.
The strictBorder[] array used in the Knuth-Morris-Pratt algorithm is declared, and its elements initialized to -1. To fill out the strictBorder[] array, consider what we know:
So if pattern[0..k] is the border of pattern[0..j-1] and pattern[k+1] = pattern[j] != text[i+j] then there will be an immediate mismatch when we shift to align borders. In this case we can safely shift farther aligning the border of pattern[0..k] with the tail of pattern[0..j-1]. On the other hand, if pattern[k+1] != pattern[j] then perhaps pattern[k+1] = text[i+j] and we can only safely shift the the length of border pattern[0..j-1].
pattern[6] != text[0+6] and the border of abaaba is aba, which has length 3, implying a shift of 6-3=3. A shift of 3 leads to an immediate mismatch since pattern[3]=a will be compared with text[6]=c. Thus, we can consider the border of the border of abaaba, that is the border of aba, or a, which has length 1 and shift by 6-1=5.
The KMP pattern matcher has space complexity S(n+m)=O(m). This reflects the storage for array strictBorder[0..m] and the constant space required for indices and lengths. The time complexity T(n+m) = O(n+m) is generally better than the Morris-Pratt algorithm, but may be no better than it.
Now we want to develop a brute-force algorithm that will match pattern against text using a right-to-left scan of pattern. The previous left-to-right algorithm is modified in these ways:
The right-to-left scan starts at the end of pattern.
And decrements pattern index j so long as text matches pattern.
When a mismatch occurs, slide pattern one position right (i = i+1) and reset j to point to the end of pattern.
The Boyer-Moore algorithm [1] uses the knowledge gained by the above brute-force algorithm to leverage an improved pattern matcher. What do we know? When a mismatch occurs, we know the invariant
Let's pretend, after a mismatch, that we shift pattern s positions to the right where 1 <= s <= j. This aligns pattern[0..m-1-s] and pattern[s..m-1] as shown in figure 7. In particular, pattern[j-s] aligns with pattern[j] and text[i+j]. Thus, we require that the shift s satisfy:
Now let's pretend that we shift j+1 <= s < m characters. Such a shift aligns pattern[0..m-1-s] and pattern[s..m-1] with text[i+s..i+m-1], see figure 8. Such a shift s satisfies the condition:
When either the first two conditions or the third condition fail to hold we can safely shift pattern the maximal amount m. Given the pattern we can compute if there are shifts satisfying the conditions above. Such shifts are safe and feasible. For each j the longest safe and feasible shift is stored in a look-up table which is used when a mismatch occurs. The algorithm is identical to the brute-force right-to-left scan, except for this look-up of the shift.
We'll call the table of shifts goodSuffix[]. Then when a mismatch occurs on pattern index j, the shift goodSuffix[j] is added to i and pattern index j is reset to the end of pattern.
<!- this file was generated automatically by noweave -- better not edit it-> <html><head><title>bmShift.nw</title></head><body><pre><a name="NWbmSA-BoyS-1" href="#NWbmSA-BoyS-1"><dfn><Boyer-Moore shift of pattern>=</dfn></a> i += pattern.goodSuffix[j]; j = m-1; </pre>
<ul> <li><a href="#NWbmSA-BoyS-1"><i><Boyer-Moore shift of pattern></i></a>: <a href="#NWbmSA-BoyS-1">D1</a> </ul> <ul> </ul> </body></html>
We start by declaring an instance of the goodSuffix[] array. It's length will be set to m when pattern is created and each element will be initialized to zero, the Java default. (Remember Text variable pattern has an internal representation with a String named text of length n. The variable m will be used when discussing the algorithm, but n will be used in the code.)
The code for goodSuffix[] is abstruse. Essentially, we want to test the conditions discussed above. Our implementation is driven more by a need for clarity than efficiency in time and space. Here is the complete algorithm for computegoodSuffix().
Let's start with an auxiliary routine that reverses the character is a string. This will be useful in testing condition 2.
Borders for both the pattern and its reverse are used, and a variable called s will denote the shift.
We'll start by being opportunistic and set, for each j, goodSuffix[j] = m the largest possible shift. As we find that shorter shifts are safe and feasible we'll reset goodSuffix[j] to these smaller values.
To develop the shift table for the Boyer-Moore algorithm , we'll consider boundary cases first.
We restrict our attention to the case where a mismatch occurs at
pattern[j] and
.
This scenario is shown in figure 6,
and there are two cases to consider.
These are illustrated in figures 7 and 8.
In figure 7, the proposed shift s is no more than j.
In figure 8, s is larger than j, but less than m-1.
and
Condition 1 is the Knuth-Morris-Pratt strict border condition
and condition 2 is the Morris-Pratt border condition
for the reversed pattern.
With reverse being the reversal of pattern,
condition 2 says that prefix reverse[0..m-j+s-2]
has length m-j-1.
For a fixed j between 1 and n-1,n we'll start the shift s at 1 and increment s until both conditions hold or s exceeds j. So, for a for each proposed shift s, we'll test if the strict border condition 1 holds and when it does we'll determine if the border condition 2 holds on the reverse pattern; variable k is the length of a border. The strict border condition is:
The prefix of reverse, whose borders we want to test, is reverse[0..(n-j-1+s)]. We'll start by setting variable k to the border of this prefix; k will be decremented while it is larger than the length of the tail of pattern we want to match, that is, n-1-j.
The length of the tail that has been matched when a mismatch occurs at j is n-1-(j+1)-1 = n-1-j.
The next smaller border is found by looking at the border of the border.
At this point a shift s that satisfies conditions 1 and 2 has been found. We can array to exit the while (s <= j) loop by setting s = j; it will then be incremented forcing an exit of the loop.
Putting all of these pieces together gives the code below.
The shortest shift of the type under consideration is determined by the period of pattern. We'll initialize k to the length of pattern's border and let k become successive (shorter) border lengths as we search for longer shifts.
<<Set the border length>>= int k = border[n];
The search continues as long as there is a non-empty border. After each search for a shift with one border, we reset the border length k to the length of the next border.
With a border of length k the period to shift aligning pattern borders is n-k.
A placeholder start will be used to control the search over j. The first time through pattern index j starts at 0.
Once we've searched over a range start <= j < s, the next seach can be over a range that begins with start = s.
And that is the code which enforces condition 3.
The classical Boyer-Moore algorithm uses what is known as the last occurance or bad character heuristic. It says, when a mismatch occurs between pattern[j] and text[i+j, find the right-most (last) occurance of text[i+j] in pattern and shift to align these, see figure 9 which shows this shift.
When the last occurance of b in pattern is at index k, the lastOccurrence shift on a mismatch at j is lastOccurrence[j] = j-k. Notice that when k > j this is a negative (leftward) shift! Also when b does not occur in pattern a shift of j+1 characters is appropriate, thus we'll define lastOccurance[b] = -1 when b does not occur in pattern. To create a lastOccurance[] table requires |A| space (A is the alphabet and |A| is its cardinality).
Some authors eschew the use of a lastOccurance[] table, other extoll it. It does require space that is dependent on the alphabet, something we've not seen before. It's utility depends on the alphabet size and distribution of characters in pattern.
Establishing a tight upper bound on the number of comparisons is beyond the scope of these notes. A bound of 4n is fairly simple to prove, althougth 3n is a better approximation. When pattern is relatively long and the alphabet is large, Boyer-Moore is likely to be the most efficient pattern matcher. Empirically, in the average case, the number of compares is often sub-linear, that is cn where c < 1.