- Aho–Corasick string matching algorithm
-
The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary =
a
,aa
,aaa
,aaaa
and input string isaaaa
).Informally, the algorithm constructs a finite state machine that resembles a trie with additional links between the various internal nodes. These extra internal links allow fast transitions between failed pattern matches (e.g. a search for
cat
in a trie that does not containcat
, but containscart
, and thus would fail at the node prefixed byca
), to other branches of the trie that share a common prefix (e.g., in the previous case, a branch forattribute
might be the best lateral transition). This allows the automaton to transition between pattern matches without the need for backtracking.When the pattern dictionary is known in advance (e.g. a computer virus database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.
The Aho–Corasick string matching algorithm formed the basis of the original Unix command fgrep.
The following is the Aho–Corasick data structure constructed from the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the (unique) sequence of characters from the root to the node.
Dictionary {a, ab, bc, bca, c, caa} Path In Dictionary Suffix Link Dict Suffix Link () - (a) + () (ab) + (b) (b) - () (bc) + (c) (c) (bca) + (ca) (a) (c) + () (ca) - (a) (a) (caa) + (a) (a) At each step, the current node is extended by finding its child, and if that doesn't exist, finding its suffix's child, and if that doesn't work, finding its suffix's suffix's child, finally ending in the root node if nothing's seen before.
Execution on input string
abccab
yields the following steps:Analysis of input string abccab
Node Remaining String Output:End Position Transition Output () abccab start at root (a) bccab a:1 () to child (a) Current node (ab) ccab ab:2 (a) to child (ab) Current node (bc) cab bc:3, c:3 (ab) to suffix (b) to child (bc) Current Node, Dict suffix node (c) ab c:4 (bc) to suffix (c) to suffix () to child (c) Current node (ca) b a:5 (c) to child (ca) Dict suffix node (ab) ab:6 (ca) to suffix (a) to child (ab) Current node In general, more than one dictionary suffix link may need to be followed, as more than one dictionary entry may end at a given character in the input.
See also
- fgrep
References
- Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333–340. doi:10.1145/360825.360855. (Access to the full text may be restricted.)
External links
- Set Matching and Aho–Corasick Algorithm, lecture slides by Pekka Kilpeläinen
- Aho–Corasick entry in NIST's Dictionary of Algorithms and Data Structures
- Aho–Corasick string matching in C#, a CodeProject tutorial by Tomáš Petříček
- Aho-Corasick string matching in PHP, optimized for terms-filtering
- Aho-Corasick C implementation
- Aho-Corasick Python implementation
- Aho-Corasick Java implementation
Categories:- Algorithms on strings
- Search algorithms
Wikimedia Foundation. 2010.