- Knuth–Morris–Pratt algorithm
-
The Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word"
W
within a main "text string"S
by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.The algorithm was conceived by Donald Knuth and Vaughan Pratt and independently by James H. Morris in 1977, but the three published it jointly.
NOTE: This article uses zero-based arrays to represent strings; thus the
'C'
inS={'A','B','C'}
is denotedS[2]
.Contents
KMP algorithm
Worked example of the search algorithm
To illustrate the algorithm's details, we work through a (relatively artificial) run of the algorithm. At any given time, the algorithm is in a state determined by two integers,
m
andi
, which denote respectively the position withinS
which is the beginning of a prospective match forW
, and the index inW
denoting the character currently under consideration. This is depicted, at the start of the run, like1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
We proceed by comparing successive characters of
W
to "parallel" characters ofS
, moving from one to the next if they match. However, in the fourth step, we getS[3]
is a space andW[3] = 'D'
, a mismatch. Rather than beginning to search again atS[1]
, we note that no'A'
occurs between positions 0 and 3 inS
except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, settingm = 4
andi = 0
. (m will first become 3 sincem + i - T[i] = 0 + 3 - 0 = 3
and then become 4 sinceT[0] = -1
)1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
We quickly obtain a nearly complete match
"ABCDAB"
when, atW[6]
(S[10]
), we again have a discrepancy. However, just prior to the end of the current partial match, we passed an"AB"
which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply resetm = 8
,i = 2
and continue matching the current character. Thus, not only do we omit previously matched characters ofS
, but also previously matched characters ofW
.1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of
W
and begin searching at the next character ofS
:m = 11
, reseti = 0
.1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
Once again we immediately hit upon a match
"ABCDAB"
but the next character,'C'
, does not match the final character'D'
of the wordW
. Reasoning as before, we setm = 15
, to start at the two-character string"AB"
leading up to the current position, seti = 2
, and continue matching from the current position.1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This time we are able to complete the match, whose first character is
S[15]
.Description of and pseudocode for the search algorithm
The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table
T
, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries ofT
are constructed so that if we have a match starting atS[m]
that fails when comparingS[m + i]
toW[i]
, then the next possible match will start at indexm + i - T[i]
inS
(that is,T[i]
is the amount of "backtracking" we need to do after a mismatch). This has two implications: first,T[0] = -1
, which indicates that ifW[0]
is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at indexm + i - T[i]
, as in the example above, we need not actually check any of theT[i]
characters after that, so that we continue searching fromW[T[i]]
. The following is a sample pseudocode implementation of the KMP search algorithm.algorithm kmp_search: input: an array of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found) define variables: an integer, m ← 0 (the beginning of the current match in S) an integer, i ← 0 (the position of the current character in W) an array of integers, T (the table, computed elsewhere) while m+i is less than the length of S, do: if W[i] = S[m + i], if i equals the (length of W)-1, return m let i ← i + 1 otherwise, let m ← m + i - T[i], if T[i] is greater than -1, let i ← T[i] else let i ← 0 (if we reach here, we have searched all of S unsuccessfully) return the length of S
Efficiency of the search algorithm
Assuming the prior existence of the table
T
, the search portion of the Knuth–Morris–Pratt algorithm has complexity O(k), wherek
is the length ofS
and theO
is big-O notation. As except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in thewhile
loop, we will calculate a bound on the number of iterations of this loop; in order to do this we first make a key observation about the nature ofT
. By definition it is constructed so that if a match which had begun atS[m]
fails while comparingS[m + i]
toW[i]
, then the next possible match must begin atS[m + (i - T[i])]
. In particular the next possible match must occur at a higher index thanm
, so thatT[i] < i
.Using this fact, we will show that the loop can execute at most
2k
times. For in each iteration, it executes one of the two branches in the loop. The first branch invariably increasesi
and does not changem
, so that the indexm + i
of the currently scrutinized character ofS
is increased. The second branch addsi - T[i]
tom
, and as we have seen, this is always a positive number. Thus the locationm
of the beginning of the current potential match is increased. Now, the loop ends ifm + i = k
; therefore each branch of the loop can be reached at mostk
times, since they respectively increase eitherm + i
orm
, andm ≤ m + i
: ifm = k
, then certainlym + i ≥ k
, so that since it increases by unit increments at most, we must have hadm + i = k
at some point in the past, and therefore either way we would be done.Thus the loop executes at most
2k
times, showing that the time complexity of the search algorithm isO(k)
.Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p, if W exists as a substring of S at p, then W[0 through m] == S[p through p+m]. Upon success, that is, the word and the text matched at the positions(W[i] == S[p+i]), we increase i by 1 (i++). Upon failure, that is, the word and the text does not match at the positions(W[i] != S[p+i]), the text pointer is kept still, while the word pointer roll-back a certain amount(i = T[i], where T is the jump table) And we attempt to match W[T[i]] with S[p+i]. The maximum number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll-back as much as we have progressed up to the failure. Then it is clear the runtime is 2k.
"Partial match" table (also known as "failure function")
The goal of the table is to allow the algorithm not to match any character of
S
more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.We want to be able to look up, for each position in
W
, the length of the longest possible initial segment ofW
leading up to (but not including) that position, other than the full segment starting atW[0]
that just failed to match; this is how far we have to backtrack in finding the next match. HenceT[i]
is exactly the length of the longest possible proper initial segment ofW
which is also a segment of the substring ending atW[i - 1]
. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we setT[0] = -1
, as discussed above.Worked example of the table-building algorithm
We consider the example of
W = "ABCDABD"
first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We setT[0] = -1
. To findT[1]
, we must discover a proper suffix of"A"
which is also a prefix ofW
. But there are no proper suffixes of"A"
, so we setT[1] = 0
. Likewise,T[2] = 0
.Continuing to
T[3]
, we note that there is a shortcut to checking all suffixes: let us say that we discovered a proper suffix which is a proper prefix and ending atW[2]
with length 2 (the maximum possible); then its first character is a proper prefix of a proper prefix ofW
, hence a proper prefix itself, and it ends atW[1]
, which we already determined cannot occur in case T[2]. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (e.g. T[x]=m).Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so
T[3] = 0
.We pass to the subsequent
W[4]
,'A'
. The same logic shows that the longest substring we need consider has length 1, and although in this case'A'
does work, recall that we are looking for segments ending before the current character; henceT[4] = 0
as well.Considering now the next character,
W[5]
, which is'B'
, we exercise the following logic: if we were to find a subpattern beginning before the previous characterW[4]
, yet continuing to the current oneW[5]
, then in particular it would itself have a proper initial segment ending atW[4]
yet beginning before it, which contradicts the fact that we already found that'A'
itself is the earliest occurrence of a proper segment ending atW[4]
. Therefore we need not look beforeW[4]
to find a terminal string forW[5]
. ThereforeT[5] = 1
.Finally, we see that the next character in the ongoing segment starting at
W[4] = 'A'
would be'B'
, and indeed this is alsoW[5]
. Furthermore, the same argument as above shows that we need not look beforeW[4]
to find a segment forW[6]
, so that this is it, and we takeT[6] = 2
.Therefore we compile the following table:
i
0 1 2 3 4 5 6 W[i]
A B C D A B D T[i]
-1 0 0 0 0 1 2 Other example more interesting and complex:
i
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 W[i]
P A R T I C I P A T E I N P A R A C H U T E T[i]
-1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 3 0 0 0 0 0 Description of and pseudocode for the table-building algorithm
The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code.
algorithm kmp_table: input: an array of characters, W (the word to be analyzed) an array of integers, T (the table to be filled) output: nothing (but during operation, it populates the table) define variables: an integer, pos ← 2 (the current position we are computing in T) an integer, cnd ← 0 (the zero-based index in W of the next
character of the current candidate substring) (the first few values are fixed but different from what the algorithm
might suggest) let T[0] ← -1, T[1] ← 0 while pos is less than the length of W, do: (first case: the substring continues) if W[pos - 1] = W[cnd],
let cnd ← cnd + 1, T[pos] ← cnd, pos ← pos + 1 (second case: it doesn't, but we can fall back) otherwise, if cnd > 0, let cnd ← T[cnd] (third case: we have run out of candidates. Note cnd = 0) otherwise, let T[pos] ← 0, pos ← pos + 1Efficiency of the table-building algorithm
The complexity of the table algorithm is
O(n)
, wheren
is the length ofW
. As except for some initialization all the work is done in thewhile
loop, it is sufficient to show that this loop executes inO(n)
time, which will be done by simultaneously examining the quantitiespos
andpos - cnd
. In the first branch,pos - cnd
is preserved, as bothpos
andcnd
are incremented simultaneously, but naturally,pos
is increased. In the second branch,cnd
is replaced byT[cnd]
, which we saw above is always strictly less thancnd
, thus increasingpos - cnd
. In the third branch,pos
is incremented andcnd
is not, so bothpos
andpos - cnd
increase. Sincepos ≥ pos - cnd
, this means that at each stage eitherpos
or a lower bound forpos
increases; therefore since the algorithm terminates oncepos = n
, it must terminate after at most2n
iterations of the loop, sincepos - cnd
begins at1
. Therefore the complexity of the table algorithm isO(n)
.Efficiency of the KMP algorithm
Since the two portions of the algorithm have, respectively, complexities of
O(k)
andO(n)
, the complexity of the overall algorithm isO(n + k)
.These complexities are the same, no matter how many repetitive patterns are in
W
orS
.See also
- Boyer–Moore string search algorithm
- Rabin–Karp string search algorithm
References
- Donald Knuth; James H. Morris, Jr, Vaughan Pratt (1977). "Fast pattern matching in strings". SIAM Journal on Computing 6 (2): 323–350. doi:10.1137/0206024. http://citeseer.ist.psu.edu/context/23820/0.
- Thomas H. Cormen; Charles E. Leiserson, Ronald L. Rivest, Clifford Stein (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 978-0-262-03293-3.
External links
- String Searching Applet animation
- An explanation of the algorithm and sample C++ code by David Eppstein
- Knuth-Morris-Pratt algorithm description and C code by Christian Charras and Thierry Lecroq
- Explanation of the algorithm from scratch by FH Flensburg.
- Breaking down steps of running KMP by Chu-Cheng Hsieh.
- [1] NPTELHRD YouTube lecture video
Donald Knuth Publications The Art of Computer Programming • "The Complexity of Songs" • Computers and Typesetting • Concrete Mathematics • Surreal Numbers • Things a Computer Scientist Rarely Talks About • Selected papers seriesSoftware Fonts Literate programming Algorithms Knuth's Algorithm X • Knuth–Bendix completion algorithm • Knuth–Morris–Pratt algorithm • Knuth shuffle • Robinson–Schensted–Knuth correspondence • Trabb Pardo–Knuth algorithmOther Categories:- Algorithms on strings
- Search algorithms
- Donald Knuth
Wikimedia Foundation. 2010.