The most basic text problem is PATTERN MATCHING (PMP for short) where we
are given a pattern[0..m-1]
of length m
and text[0..n-1]
of
length n
, where m
is generally very much smaller than n
,
and we wish to know if pattern
occurs in text
.
We will study several algorithms for PMP. We are interested in their implementation as well as their time and space complexities. Some auxiliary functions will be introduced as needed.
We are programming using an object-oriented approach using the Java language,
thus we will assume a class Text
that knows how to find patterns
within the text
property of the class.
<Text.java>= public class Text { <Class properties> <Instance properties> <Brute-force pattern matching with left-to-right scan> <Morris-Pratt pattern matching> <Knuth-Morris-Pratt pattern matching> <Brute-force pattern matching with right-to-left scan> <Boyer-Moore pattern matching> <Auxiliary functions> <Test the pattern matching algorithms> }
At the moment, I can't think of any class properties; variables where there is only one copy associated with the class (declared using the static keyword). But just in case we need to define some later we'll make a place for them.
<Class properties>= (<-U)
The instance properties of the class include as a minimum
String
of text
.
Additional instance properties will be defined as required.
<Instance properties>= (<-U) [D->] String text;
The problem to solve is PMP: Does pattern[0..m-1]
occur in
text[0..n-1]
or not?
Our first few algorithms will use a left-to-right scan for the
pattern
in the text.
In contrast, the Boyer-Moore algorithm will scan right-to-left.
One thing known from the problem statement is the length of the text
.
<initialize text length>= (U-> U-> U-> U-> U-> U-> U-> U->) int n = text.length();
And the length of pattern
<initialize pattern length>= (U-> U-> U-> U-> U->) int m = pattern.length();
The brute-force (naive) pattern matching algorithm compares
pattern[j]
with text[i+j]
for each i=0..n-m
and j=0..m-1
until the pattern is found or the end of the text
is reached.
Let i
denote the index into text
where matching starts;
it is bounded below by 0 (is this obvious or not?)
<set text index to start of text>= (U-> U-> U-> U-> U->) int i = 0;
The text index i
is bounded above by n-m, otherwise p would slide
off the right end of t consider the Figure [->].
If i
is the rightmost legal position in text
,
then pattern[0..m-1]
align with text[i..n-1]
which implies
i+m-1=n-1
or i=n-m
.
pattern
and text
.[*]
<text index is legal>= (U-> U-> U-> U-> U->) (i <= n-m)
The index j
will serve two purposes.
Not only is it a character index in pattern
,
it will serve as a count of the number of characters matched.
Thus j
is bounded below by 0 and above by m
;
it can take on a value 1 larger than a legal index (m-1
)
but only to serve as a flag that m
characters have matched.
<set pattern index to start of pattern>= (U-> U-> U->) int j =0;
Define a predicate for testing if the pattern
index is legal.
<pattern index is legal>= (U->) (j<m)
The matching starts at some index i and as long as the pattern
matches the text
we simply increment the j
index
<pattern[j] matches text[i+j]>= (U-> U->) (pattern.charAt(j) == text.charAt(i+j))
<increment pattern index>= (U->) j++;
All this is neatly handled in a while statement.
<left-to-right scan>= (U-> U-> U->) while (<pattern index is legal> && <pattern[j] matches text[i+j]>) { <increment pattern index> }
There are two exits for the inner while
loop.
One is when m
characters of pattern
have matched text
and in this case we simply return true
as the answer to
the decision problem PMP.
Problem 1: An alternate is to continue looking for a second or more occurrences of the pattern. An on-line algorithm which continually accepts input until an end of input marker is found would usually do this. Re-write the code to handle a continuous stream of characters. It will output a stream of 0's and 1's indicating the pattern was not or was found.
<pattern found in text>= (U-> U-> U->) (j==m)
Of course when the pattern
is not found we return false
.
<pattern not found in text>= (U-> U-> U-> U-> U->) return false;
The other exit of the inner while loop occurs a mismatch occurred at index
j
in pattern] and index [[i+j
in text
. The brute-force
approach is to shift the pattern
one position forward in text
and
reset the pattern
to 0.
<brute-force shift of pattern>= (U->) ++i; j=0;
Putting all these pieces together produces the brute-force left-to-right pattern matching algorithm.
<Brute-force pattern matching with left-to-right scan>= (<-U) public boolean patternMatcher (Text pattern) { <initialize text length> <initialize pattern length> <set text index to start of text> <set pattern index to start of pattern> while <text index is legal> { <left-to-right scan> if <pattern found in text> return true; <brute-force shift of pattern> } <pattern not found in text> }
There are several things to notice about the brute-force pattern matcher.
First, it has optimal space complexity; it solves PMP in constant space
needing only a few registers for indices i
, j
, and lengths m
and n
. (In general, space complexity ignores the space required for
input and output).
Second it has worst case quadratic time complexity and this is not very
good!
Problem 2: Show that in the worst case patternMatcher
's the inner while
loop can execute m
times and the outer while loop
can execute n-m+1
times. Show that the maximum number of comparisons
is (n+1)^2/4 and give example strings for pattern
and text
where
this worst case is realized. Hint: maximize the quadratic expression
nm-m^2+m as a function of m.
Of course the average case time complexity is not nearly so bad.
Problem 3: Turn the brute-force algorithm given in these notes into a working program with input and output. Test the average case behavior of the code. Use the words in a dictionary (for example /usr/dict/words on a Unix system) as patterns for which to search. Find a large text document on the World Wide Web (for example www.gutenberg.net has a collection of great books that can serve as text files).
The brute-force pattern matcher does not use the information
gathered once a mismatch occurs; it throws away the knowledge
that pattern[0..j-1
== text[i..i+j-1]] and
pattern[j] != text[i+j]
and simply restarts
by comparing pattern[0]
with text[i+1]
.
We will see how to pre-process the pattern
so that this
information is not wasted. The cost will be an increase
in the space needed in pattern matching, but this will
reduce the worst case time complexity for PMP.
The preprocessing is accomplished by exploiting the invariant
pattern[0..j-1] = text[i..i+j-1]
.
Later we will make use of the mismatch information to improve
the process even more.
The algorithm looks exactly like the brute-force algorithm with the exception of how shifts are made once a mismatch is found.
<Morris-Pratt pattern matching>= (<-U) public boolean MorrisPratt (Text pattern) { <initialize text length> <initialize pattern length> <set text index to start of text> <set pattern index to start of pattern> while <text index is legal> { <left-to-right scan> if <pattern found in text> return true; <Morris-Pratt shift> } <pattern not found in text> }
Consider what we know when pattern[0..j-1]
matches
text[i..i+j-1]
. Figure [->] shows an alignment
of the text
at pattern
at index i
with matching characters
at positions i..i+j-1
. Suppose pattern
is shifted from
position i
in text
to position i+s
.
There are three conditions such a shift s
should satisfy:
pattern[0..j-1]
and text[i..i+j-1]
. [*]
s
should be safe,
s
should be feasible,
s
should be at least one.
A shift from i
to i+s
is safe if pattern
can not occur
at any position in between.
A safe shift is feasible if a match could occur at i+s
(based on
our current knowledge).
Since we are assuming a mismatch between pattern[j]
and text[i+j]
,
a shift of s=1
is safe; this is the brute-force (or conservative
approach). We'd like to conclude that such a small shift is infeasible,
so we can safely make larger safe shifts until a feasible shift is found.
Figure [->] extends figure [<-] to show the configuration
when a shift occurs. For the shift to be feasible, Figure [->] shows
that prefix pattern[0..j-s-1]
must be a proper suffix
of pattern[0..j-1]
. To also be safe, s
must be the smallest
such feasible shift.
pattern
. [*]
Before going farther let's introduce some terms and ideas that are helpful in text algorithms.
The border of a word w
is any word that is both a prefix
and suffix of w
. For example, the word
w=abaabaaabaaba
has borders
whereabaaba
,aba
,a
, ande
e
stands for the empty word.
Define w.border()
to be the longest proper border of a non-empty word
w
. We can iterate this function obtaining a list of all borders of
w
, for the sample word w=abaabaaabaaba
w.border()=abaaba
,abaaba.border()=aba
,aba.border()=a
a.border()=e
Borders have a dual notion called a period, which is an integer p
such that 0< p
<= |w
| and prefix w[0..k]=w[p..p+k]
where k=|w|-p
. The periods of w=abaabaaabaaba
are 7, 10, 12 and 13.
The period plus the length of the border equals the word's length
p + |w.border()| = |w|
.
Shifting a string by its shortest period aligns its borders.
Such shifts are feasible.
Problem 4: Define Fibonacci words over the alphabet {a, b} by
F_0 = e, F_1 = b, F_2 = a, and F_n=F_n-1F_n-2 for n >= 2Determine the length of F_n. Find the periods and borders of F_n.
As we will see below [->],
one can readily compute the length of the border of a string.
So suppose we compute the length |pattern[0..j-1].border()|
for
each index j=1..m
and store this integer in an array border[j]
(we'll see border[0]
should be set to -1).
When a mismatch occurs the pattern is shifted by its period
j-border[j]
. Here's an example:
The mismatch occurs at
pattern
indexj
:0 1 2 3 4 5 pattern
:a b a a b a text
:a a b a c a b a a ... text
indexi
:0 1 2 3 4 5 6 7 8 ...
j=3
once we've matched aba
, which
has border a
of length 1
. The period of aba
is 2
.
Shifting pattern
2=3-1
positions produces an alignment
of the border of aba
.
We can restart the matching at
pattern
indexj
:0 1 2 3 4 5 pattern
:a b a a b a text
:a a b a c a b a a ... text
indexi
:0 1 2 3 4 5 6 7 8 ...
j=1=border[j]
.
We don't need to recheck the match of pattern[0]
with text[3]
(of course we get an immediate mismatch at the next position --- which we'll
handle later).
A shift by the period is both safe and feasible, from the above analysis we know that the shift satisfies
s = j - border[j]
where border[j]
is the length of the border of pattern[0..j-1]
.
Now let's consider the boundary case where j=0
when the mismatch occurs.
That is pattern[0]
is not the same as text[i]
.
Clearly a shift of s=1
is warranted, and
s = 1 = 0 - border[0]
implies border[0]=-1
.
Because of this we must restart the matching process from
pattern
index j=max(0, border[j])
.
<Morris-Pratt shift>= (U->) i += j-pattern.border[j]; j = (0 < pattern.border[j]) ? pattern.border[j] : 0;
We want to compute border[0..m]
for pattern
which has length m
;
recall border[k]
is the length of the border of the prefix pattern[0..k-1]
and a shift by k-border[k]
aligns the border at the start of
pattern[0..k-1]
with the border end of pattern[0..k-1]
.
The pattern
calls computerBorders()
(i.e., pattern.computeBorders()
) before the call
text.MorrisPratt(pattern)
.
Remember Text
variable pattern
has an internal
representation that uses a String
named text
of length n
.
The variables pattern
and m
will be used when discussing the algorithm,
but programming text
and n
will be used in the code.
We know border[0] = -1
and so we can set this initial value.
Also, we'll store the length of the current border in a local variable k
.
<Define border[0]>= (U->) border[0] = -1; int k = -1;
Now we want to compute border[1],...,border[m]
, where
m
is the length of pattern
. This will be done in a
for
loop that loops over each non-empty prefix of pattern
.
<each non-empty prefix>= (U->) (int j = 1; j <= n; j++)
Let's pretend we've computed k=border[[j-1]
of pattern[0..j-2]
for some j >= 1
, and we want to compute border[j]
.
Figure [->] shows that when
pattern[k] == pattern[j-1]
then border[j] = k+1
.
That is, when the character after the border matches the character after the prefix, the border is extended by 1.
<Extend the border by 1 when next characaters match>= (U->) ++k; border[j] = k;
Now let's consider what happens when there is a mismatch,
that is pattern[k] != pattern[j-1]
. This case is a little more
complex to figure out, but figure [->] shows that
while a mismatch occurs, the current border k
needs to
be reduced to border[k]
until there is a match or
no border is found (k = -1
).
<Reduce the border until a match is found or no border exists>= (U->) while ((k >= 0) && (text.charAt(k) != text.charAt(j-1))) { k = border[k]; }
Using the above code chunks we can put together the computeBorders()
function easily.
<Auxiliary functions>= (<-U) [D->] public void computeBorders() { <initialize text length> <Define border[0]> for <each non-empty prefix> { <Reduce the border until a match is found or no border exists> <Extend the border by 1 when next characaters match> } }
One final act, so the Java compiler does not complain:
we declare the integer array border[]
use in the Morris-Pratt algorithm
<Instance properties>+= (<-U) [<-D->] public int[] border;
The need for the array border[0..m]
increases the space complexity
from the constant space requirement for the brute-force algorithm.
The benefit is a linear worst case time complexity.
Theorem: The maximal number of symbol compares in the Morris-Pratt algorithm
is 2n-m
.
i
, which ranges from 0
to n-m
. We can give an upper
bound for the number of successful compares by considering the sum i+j
.
The least value for i+j
is 0
and the greatest value is n-1
.
Each time a successful compare is made i+j
increases by 1
and it never decreases, thus there are at most n
successful compares.
Finally not both successful and unsuccessful compares can attain their
maximum, thus there are at most n + (n-m+1) - 1 = 2n-m
compares
in the Morris-Pratt algorithm.
The Morris-Pratt algorithm can be improved by using additional information known at the time a mismatch occurs. In particular, the complete invariant is
pattern[0..j-1]==text[i..i+j-1]
andpattern[j] != text[i+j]
.
The Knuth-Morris-Pratt (KMP) algorithm [cite knmp:77] makes use of this
additional one-bit of mismatch information to allow longer shifts of
pattern
in text
.
Otherwise, the algorithm is the same as the previous ones.
<Knuth-Morris-Pratt pattern matching>= (<-U) public boolean KMP (Text pattern) { <initialize text length> <initialize pattern length> <set text index to start of text> <set pattern index to start of pattern> while <text index is legal> { <left-to-right scan> if <pattern found in text> return true; <Knuth-Morris-Pratt shift> } <pattern not found in text> }
One might call the idea used in the KMP shift ``strict borders.''
The length of these strict borders are stored in an array
strictBorder[]
, and the shift is compute just as in the
Morris-Pratt case using this new array.
<Knuth-Morris-Pratt shift>= (U->) i += j-pattern.strictBorder[j]; j = (0 < pattern.strictBorder[j]) ? pattern.strictBorder[j] : 0;
The strictBorder[]
array used in the Knuth-Morris-Pratt
algorithm is declared.
<Instance properties>+= (<-U) [<-D->] public int[] strictBorder;
And its elements initialized to -1.
<initialize border>= for (int j = 0; j < text.length(); j++) strictBorder[j]=-1;
To fill out the strictBorder[]
array, consider what we know:
pattern[0..j-1]==text[i..i+j-1]
andpattern[j] != text[i+j]
.
So if pattern[0..k]
is the border of pattern[0..j-1
and
pattern[k+1] = pattern[j] != text[i+j]
then there will be an immediate
mismatch when we shift to align borders. In this case we can
safely shift farther aligning the border of pattern[0..k
with
the tail of pattern[0..j-1]
. On the other hand, if
pattern[k+1] != pattern[j]
then perhaps pattern[k+1] = text[i+j]
and we can only safely shift the the length of border pattern[0..j-1]
.
Here's an example.
index: | 0123456 |
pattern: | abaabaa |
text: | abaabacabaab |
pattern[6] != text[0+6]
and the border of abaaba
is aba
,
which has length 3, implying a shift of 6-3=3.
A shift of 3 leads to an immediate mismatch since pattern[3]=a
will be compared with text[6]=c
. Thus, we can consider the
border of the border of abaaba
, that is the border of aba
,
or a
, which has length 1 and shift by 6-1=5.
Problem 5: Develop an algorithm that computes the strict border of a pattern.
You may find it useful to know that strictBorder[j] = border[j]
if pattern[border[j-1]+1] != pattern[j]
, while
when this inequality does not hold we set j = border[j]
until it does or j
becomes negative.
Show that your algorithm is correct and estimate its time complexity.
The KMP pattern matcher has space complexity S(n+m)=O(m)
.
This reflects the storage for array strictBorder[0..m]
and the constant space required for indices and lengths.
The time complexity T(n+m) = O(n+m)
is generally better than
the Morris-Pratt algorithm, but may be no better than it.
Now we want to develop a brute-force algorithm that will match
pattern
against text
using a right-to-left scan
of pattern
. The previous left-to-right algorithm is
modified in these ways:
pattern
index j
starts at the end of pattern
.
j
.
j
runs off the left-end (j=-1
) a complete match
has occurred.
pattern
right one place and resets
j
to the end of pattern
.
<Brute-force pattern matching with right-to-left scan>= (<-U) public boolean patternMatcher2 (Text pattern) { <initialize text length> <initialize pattern length> <set text index to start of text> <set pattern index to end of pattern> while <text index is legal> { <right-to-left scan> if (j == -1) return true; <right-to-left brute-force shift of pattern> } <pattern not found in text> }
The right-to-left scan starts at the end of pattern
.
<set pattern index to end of pattern>= (U-> U->) int j = m-1;
And decrements pattern
index j
so long as text
matches pattern
.
<right-to-left scan>= (U-> U->) while ((j > -1) && <pattern[j] matches text[i+j]>) { --j; }
When a mismatch occurs, slide pattern
one position right (i = i+1
)
and reset j
to point to the end of pattern
.
<right-to-left brute-force shift of pattern>= (U->) ++i; j = m-1;
The Boyer-Moore algorithm [cite bomo:77] uses the knowledge gained by the above brute-force algorithm to leverage an improved pattern matcher. What do we know? When a mismatch occurs, we know the invariant
pattern[j] != text[i+j]]]and
pattern[j+1..m-1] = text[i+j+1..i+m-1]
This invariant is shown in figure [->].
Let's pretend, after a mismatch, that we shift pattern
s
positions to the right where 1 <= s <= j
.
This aligns pattern[0..m-1-s]
and pattern[s..m-1]
as shown in
figure [->].
In particular,
pattern[j-s]
aligns with pattern[j]
and text[i+j]
.
Thus, we require that the shift s
satisfy:
pattern[j-s] != pattern[j]
.
If they were equal an mismatch between pattern[j-s]
and
text[i+j]
would occur; such a shift is not feasible.
pattern[m-1..j-s+1]
has length m-1-j
,
that is, pattern[m-1..j+1] = pattern[m-1-s..j+1-s]
.
1
and j
positions.[*]
Now let's pretend that we shift j+1 <= s < m
characters.
Such a shift aligns pattern[0..m-1-s]
and pattern[s..m-1]
with text[i+s..i+m-1]
, see figure [->].
Such a shift s
satisfies the condition:
pattern
has some border pattern[0..m-1-s]
of length
m-j-1
or less.
j+1
and m
positions.[*]
When either the first two conditions or the third condition fail
to hold we can safely shift pattern
the maximal amount m
.
Given the pattern
we can compute if there are shifts
satisfying the conditions above.
Such shifts are safe and feasible.
For each j
the longest safe and feasible shift is stored in a look-up
table which is used when a mismatch occurs.
The algorithm is identical to the brute-force
right-to-left scan, except for this look-up of the shift.
<Boyer-Moore pattern matching>= (<-U) public boolean BoyerMoore (Text pattern) { <initialize text length> <initialize pattern length> <set text index to start of text> <set pattern index to end of pattern> while <text index is legal> { <right-to-left scan> if (j == -1) return true; <Boyer-Moore shift of pattern> } <pattern not found in text> }
We'll call the table of shifts goodSuffix[]
.
Then when a mismatch occurs on pattern index j
,
the shift goodSuffix[j]
is added
to i
and pattern
index j
is reset to the end of pattern
.
<Boyer-Moore shift of pattern>= (<-U) i += pattern.goodSuffix[j]; j = m-1;
goodSuffix[]
array
We start by declaring an instance of the goodSuffix[]
array.
It's length will be set to m
when pattern
is created
and each element will be initialized to zero, the Java default.
(Remember Text
variable pattern
has an internal
representation with a String
named text
of length n
.
The variable m
will be used when discussing the algorithm,
but n
will be used in the code.)
<Instance properties>+= (<-U) [<-D] public int[] goodSuffix;
The code for goodSuffix[]
is abstruse.
Essentially, we want to test the conditions discussed above.
Our implementation is driven more by a need for clarity than
efficiency in time and space.
Here is the complete algorithm for computegoodSuffix()
.
<Auxiliary functions>+= (<-U) [<-D->] public void computeGoodSuffix() { <Declarations and initializations> <Set each shift to the maximal value> <Reset shifts when1 <= goodSuffix[j] <= j
> <Reset shifts whengoodSuffix[j] > j
> }
Let's start with an auxiliary routine that reverses the character is a string. This will be useful in testing condition [<-].
<Auxiliary functions>+= (<-U) [<-D->] public Text reverse() { <initialize text length> StringBuffer reverse = new StringBuffer(); for (int i = n-1; i > -1; i--) { reverse.append(text.charAt(i)); } return new Text(reverse.toString()); }
Borders for both the pattern
and its reverse
are used,
and a variable called s
will denote the shift.
<Declarations and initializations>= (<-U) <initialize text length> Text reverse = reverse(); computeBorders(); reverse.computeBorders(); int s;
We'll start by being opportunistic and set, for each j
,
goodSuffix[j] = m
the largest possible shift.
As we find that shorter shifts are safe and feasible
we'll reset goodSuffix[j]
to these smaller values.
<Set each shift to the maximal value>= (<-U) for (int j = 0; j < n; j++) { goodSuffix[j] = n; }
To develop the shift table for the Boyer-Moore algorithm , we'll consider boundary cases first.
pattern[m-1]
and text[i+m-1]
, a shift of 1 is appropriate, but
so is a shift by the smallest value s
such that
pattern[m-1-s] != pattern[m-1]
. This is condition [<-]
for the case j=m-1
.
The requirement is that s
be the smallest
value satisfying reverse[0..s+1] = 0
.
j
, our pattern
position index,
has fallen off the left end of pattern
, that is j == -1
.
Our decision algorithm simply return true when this occurs.
pattern[0] != text[i]
.
That is, we've match all characters except the first.
I hope it is obvious that a shift by the period of pattern
,
that is m-border[m]
, is both safe and feasible.
The border is a good suffix where a shift by
the period will produce a potential pattern-text
match;
no shorter shift can.
We restrict our attention to the case where a mismatch occurs at
pattern[j]
and 0 < j < m-1
.
This scenario is shown in figure [<-],
and there are two cases to consider.
These are illustrated in figures [<-] and [<-].
In figure [<-], the proposed shift s
is no more than j
.
In figure [<-], s
is larger than j
, but less than m-1
.
1 <= s <= j
.s
is between 1 and j
where a mismatch occurs at
pattern[j]
.
Figure [<-] illustates that two conditions must hold:
[*]
pattern[j-s] != pattern[j]
and
[*]
pattern[j+1-s..m-1-s] = pattern[j+1..m-1]
.
Condition [<-] is the Knuth-Morris-Pratt strict border condition
and condition [<-] is the Morris-Pratt border condition
for the reversed pattern.
With reverse
being the reversal of pattern
,
condition [<-] says that prefix reverse[0..m-j+s-2]
has length m-j-1
.
For a fixed j
between 1
and n-1
,n
we'll start the shift s
at 1 and increment s
until
both conditions hold or s
exceeds j
.
So, for a for each proposed shift s
, we'll test if the
strict border condition [->] holds and when it does
we'll determine if the border condition [<-]
holds on the reverse pattern
; variable k
is the
length of a border.
<Search for a safe shift between 1 andj
>= (U->) s = 1; while (s <= j) { if <Strict border condition> { <Initialize border ofreverse[0..(n-j-1+s)]
> while <Border greater than tail to match> { <Reset to smaller border> } if (k == n-j-1) { // border condition satisfied <SetgoodSuffix[j]
and exitwhile
loop> } } ++s; }
The strict border condition is:
<Strict border condition>= (<-U) (text.charAt(j-s) != text.charAt(j))
The prefix of reverse
, whose borders we want to test,
is reverse[0..(n-j-1+s)]
. We'll start by setting variable
k
to the border of this prefix; k
will be
decremented while it is larger than the length
of the tail of pattern
we want to match, that is, n-1-j
.
<Initialize border of reverse[0..(n-j-1+s)]
>= (<-U)
int k = reverse.border[n-j-1+s];
The length of the tail that has been matched when a mismatch
occurs at j
is n-1-(j+1)-1 = n-1-j
.
<Border greater than tail to match>= (<-U) (k > n-j-1)
The next smaller border is found by looking at the border of the border.
<Reset to smaller border>= (<-U U->) k = reverse.border[k];
At this point a shift s
that satisfies conditions [->]
and [<-] has been found. We can array to exit the
while (s <= j)
loop by setting s = j
;
it will then be incremented forcing an exit of the loop.
<SetgoodSuffix[j]
and exitwhile
loop>= (<-U) goodSuffix[j] = s; s = j;
Putting all of these pieces together gives the code below.
<Reset shifts when1 <= goodSuffix[j] <= j
>= (<-U) for (int j=1; j < n; j++) { <Search for a safe shift between 1 andj
> }
j < s
.j
can be found, but perhaps a shift greater than j
exists.
Figure [<-] depicts the situation that is represented by
the equation
[*]
pattern[0..m-1-s] = pattern[s..m-1]
.
Here's the outline of what we need to do.
<Reset shifts when goodSuffix[j] > j
>= (<-U)
<Set the border length>
<Initialize the index where the search starts>
while (<There is a non-empty border>) {
<Set shift to the period of the current border>
for (int j = start; j < s; j++) {
goodSuffix[j] = (s < goodSuffix[j]) ? s : goodSuffix[j];
System.out.println("**gs["+j+"]="+goodSuffix[j]);
}
<Reset the start index>
<Reset to smaller border>
}
The shortest shift of the type under consideration is determined by the
period of pattern
.
We'll initialize k
to the length of pattern
's border and
let k
become successive (shorter) border lengths
as we search for longer shifts.
<Set the border length>= (<-U) int k = border[n];
The search continues as long as there is a non-empty border.
After each search for a shift with one border, we reset the
border length k
to the length of the next border.
<There is a non-empty border>= (<-U) (k > 0)
With a border of length k
the period to shift aligning pattern
borders is n-k
.
<Set shift to the period of the current border>= (<-U) s = n - k;
A placeholder start
will be used to control the search over j
.
The first time through pattern
index j
starts at 0
.
<Initialize the index where the search starts>= (<-U) int start = 0;
Once we've searched over a range start <= j < s
, the
next seach can be over a range that begins with start = s
.
<Reset the start index>= (<-U) start = s;
And that is the code which enforces condition [<-].
computeGoodSuffix()
computeGoodSuffix()
is not very efficient,
but it may be more clear than other developments of the code.
Problem 6: Provide a time and space complexity analysis of the presented code for
computeGoodSuffix()
.
Problem 7: Develop an alternative more efficient (in time and space) algorithm for
computeGoodSuffix()
. Some things to consider.
Declaring the reverse
of pattern
requires significant extra space; it
can be eliminated.
The time spend of computing goodSuffix[n-1]
is large;
this computation can be folded into the computation
when goodSuffix[j] <= j
.
The classical Boyer-Moore algorithm uses what is known as
the last occurance or bad character heuristic.
It says, when a mismatch occurs between pattern[j]
and text[i+j
,
find the right-most (last) occurance of text[i+j]
in pattern
and shift to align these,
see figure [->] which shows this shift.
a
) at the pattern's end does not match the bad character (b
) in text
. The last occurance of b
in pattern
is at j
.
No shift less than m-1-j could produce a pattern-text
match.[*]
When the last occurance of b
in pattern
is at index k
,
the lastOccurrence
shift on
a mismatch at j
is lastOccurrence[j] = j-k
.
Notice that when k > j
this is a negative (leftward) shift!
Also when b
does not occur in pattern
a shift of j+1
characters is appropriate,
thus we'll define lastOccurance[b] = -1
when b
does not occur
in pattern
. To create a lastOccurance[]
table requires |A| space
(A is the alphabet and |A| is its cardinality).
Some authors eschew the use of a lastOccurance[]
table, other extoll it.
It does require space that is dependent on the alphabet, something we've not
seen before.
It's utility depends on the alphabet size and distribution of characters
in pattern
.
Problem 8: Write a program computeLastOccurrence()
, which when
given an alphabet A
and a pattern
determines the last occurrence
(rightmost) of each character in A
in pattern
.
Use this algorithm to improve the Boyer-Moore algorithm.
Emperically compare the time and space complexity of Boyer-Moore with
and without this improvement by using a large text
and multiple
pattern
s.
Establishing a tight upper bound on the number of comparisons is
beyond the scope of these notes. A bound of 4n
is fairly simple
to prove, althougth 3n
is a better approximation.
When pattern
is relatively long and the alphabet is large,
Boyer-Moore is likely to be the most efficient pattern matcher.
Empirically, in the average case, the number of compares is often sub-linear,
that is cn
where c < 1
.
To complete the class we'll define a constructor.
It has one String
argument, and this is set to the text
.
It will also initialize the tables (arrays) used to look up shifts.
<Auxiliary functions>+= (<-U) [<-D->] public Text(String t) { text = t; border = new int[text.length()+1]; strictBorder = new int[text.length()+1]; goodSuffix = new int[text.length()]; }
Another useful method returns the length of the text
string.
<Auxiliary functions>+= (<-U) [<-D->] public int length() { return text.length(); }
And another useful method returns the character at a position k
in the text
string.
<Auxiliary functions>+= (<-U) [<-D] public char charAt(int k) { return text.charAt(k); }
This document is written using Norman Ramsey's noweb tools for literate programming. The source for the document is a file named Text.nw, which can be translated into an HTML file for online distribution, a LaTeX file for printing, or a Java file.
You can ftp the Java file, the LaTeX file, or its PostScript translation by anonymous ftp from tsunami.cs.fit.edu. They are in the pub/algo directory.
Now we'll do one last, but important thing. We'll write some test cases that helps us to believe that no defects occur in our code.
The main
routine will read two strings from command line and then
perform various tests to see that our algorithms work correctly
(at least on the test cases). The first string is the text
and the second is the pattern
.
<Test the pattern matching algorithms>= (<-U) public static void main(String[] args) { Text text = new Text(args[0]); Text pattern = new Text(args[1]); <Test the left-to-right scan brute-forcepatternMatcher()
> <TestcomputeBorders()
> <TestMorrisPratt()
> <TestcomputeStrictBorders()
> <TestKnuthMorrisPratt()
> <Test the right-to-left scan brute-forcepatternMatcher2()
> <TestcomputeGoodSuffix()
> <TestBoyerMoore()
> }
The first test will be of the brute-force left-to-right scan pattern matcher.
<Test the left-to-right scan brute-force patternMatcher()
>= (<-U)
System.out.println(text.patternMatcher(pattern));
One thing to test is that the border[]
array is correctly computed.
<Test computeBorders()
>= (<-U)
pattern.computeBorders();
for (int j = 0; j <= pattern.length(); j++) {
System.out.println("border[" + j + "] = " + pattern.border[j]);
}
Now let's test that our implementation of the Morris-Pratt algorithm works correctly.
<Test MorrisPratt()
>= (<-U)
System.out.println(text.MorrisPratt(pattern));
We can not test the KMP algorithm since we've left its completion as an exercise.
<Test computeStrictBorders()
>= (<-U)
// pattern.computeStrictBorders();
// for (int j = 0; j <= pattern.length(); j++) {
// System.out.println("border[" + j + "] = " + pattern.border[j]);
// }
<Test KnuthMorrisPratt()
>= (<-U)
// System.out.println(text.KnuthMorrisPratt(pattern));
<Test the right-to-left scan brute-force patternMatcher2()
>= (<-U)
System.out.println(text.patternMatcher2(pattern));
Before testing Boyer-Moore we see if goodSuffix[]
is calculated
properly.
<Test computeGoodSuffix()
>= (<-U)
pattern.computeGoodSuffix();
for (int j = 0; j < pattern.length(); j++) {
System.out.println("goodSuffix[" + j + "] = " + pattern.goodSuffix[j]);
}
And now our test of BoyerMoore()
.
<Test BoyerMoore()
>= (<-U)
System.out.println(text.BoyerMoore(pattern));
These notes suffer from empty figures. Devise good diagrams for the concept the figure is to illustrate and provide .gif and .ps files that can be incorporated the notes.
[1] R. S. Boyer and J. S. Moore, A fast string searching algorithm, Communications of the ACM, 20 (1977), pp. 762--772.
[2] D. E. Knuth, J. H. Morris, and V. R. Pratt, Fast pattern matching in strings, SIAM Journal of Computing, 6 (1977), pp. 240--267.
reverse[0..(n-j-1+s)]
>: U1, D2
1 <= goodSuffix[j] <= j
>: U1, D2
goodSuffix[j] > j
>: U1, D2
j
>: D1, U2
goodSuffix[j]
and exit while
loop>: U1, D2
BoyerMoore()
>: U1, D2
computeBorders()
>: U1, D2
computeGoodSuffix()
>: U1, D2
computeStrictBorders()
>: U1, D2
KnuthMorrisPratt()
>: U1, D2
MorrisPratt()
>: U1, D2
patternMatcher()
>: U1, D2
patternMatcher2()
>: U1, D2