Aho–Corasick string matching algorithm

Aho–Corasick string matching algorithm: The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

Informally, the algorithm constructs a finite state machine that resembles a trie with additional links between the various internal nodes. These extra internal links allow fast transitions between failed pattern matches (e.g. a search for cat in a trie that does not contain cat, but contains cart, and thus would fail at the node prefixed by ca), to other branches of the trie that share a common prefix (e.g., in the previous case, a branch for attribute might be the best lateral transition). This allows the automaton to transition between pattern matches without the need for backtracking.

When the pattern dictionary is known in advance (e.g. a computer virus database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.

The Aho–Corasick string matching algorithm formed the basis of the original Unix command fgrep.

The following is the Aho–Corasick data structure constructed from the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the (unique) sequence of characters from the root to the node.

Dictionary {a, ab, bc, bca, c, caa}
Path In Dictionary Suffix Link Dict Suffix Link

() -

(a) + ()

(ab) + (b)

(b) - ()

(bc) + (c) (c)

(bca) + (ca) (a)

(c) + ()

(ca) - (a) (a)

(caa) + (a) (a)

At each step, the current node is extended by finding its child, and if that doesn't exist, finding its suffix's child, and if that doesn't work, finding its suffix's suffix's child, finally ending in the root node if nothing's seen before.

Execution on input string abccab yields the following steps:

Analysis of input string abccab
Node Remaining String Output:End Position Transition Output

() abccab start at root

(a) bccab a:1 () to child (a) Current node

(ab) ccab ab:2 (a) to child (ab) Current node

(bc) cab bc:3, c:3 (ab) to suffix (b) to child (bc) Current Node, Dict suffix node

(c) ab c:4 (bc) to suffix (c) to suffix () to child (c) Current node

(ca) b a:5 (c) to child (ca) Dict suffix node

(ab) ab:6 (ca) to suffix (a) to child (ab) Current node

In general, more than one dictionary suffix link may need to be followed, as more than one dictionary entry may end at a given character in the input.

See also

fgrep

References

Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333–340. doi:10.1145/360825.360855. (Access to the full text may be restricted.)

External links

Set Matching and Aho–Corasick Algorithm, lecture slides by Pekka Kilpeläinen

Aho–Corasick entry in NIST's Dictionary of Algorithms and Data Structures

Aho–Corasick string matching in C#, a CodeProject tutorial by Tomáš Petříček

Aho-Corasick string matching in PHP, optimized for terms-filtering

Aho-Corasick C implementation

Aho-Corasick Python implementation

Aho-Corasick Java implementation

Categories:
Algorithms on strings
Search algorithms

Dictionary {a, ab, bc, bca, c, caa}
Path	In Dictionary	Suffix Link	Dict Suffix Link
()	-
(a)	+	()
(ab)	+	(b)
(b)	-	()
(bc)	+	(c)	(c)
(bca)	+	(ca)	(a)
(c)	+	()
(ca)	-	(a)	(a)
(caa)	+	(a)	(a)

Analysis of input string `abccab`
Node	Remaining String	Output:End Position	Transition	Output
()	abccab		start at root
(a)	bccab	a:1	() to child (a)	Current node
(ab)	ccab	ab:2	(a) to child (ab)	Current node
(bc)	cab	bc:3, c:3	(ab) to suffix (b) to child (bc)	Current Node, Dict suffix node
(c)	ab	c:4	(bc) to suffix (c) to suffix () to child (c)	Current node
(ca)	b	a:5	(c) to child (ca)	Dict suffix node
(ab)		ab:6	(ca) to suffix (a) to child (ab)	Current node

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

String searching algorithm — String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Let Σ be… … Wikipedia
Aho-Corasick algorithm — The Aho Corasick algorithm is a string searching algorithm created by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary matching algorithm that locates elements of a finite set of strings (the dictionary ) within an input text.… … Wikipedia
Algorithme d'Aho-Corasick — L algorithme d Aho Corasick est un algorithme de recherche de chaîne de caractère (ou motif) dans un texte dû à Alfred Aho et Margaret Corasick et publié en 1975. L algorithme consiste à avancer dans une structure de données abstraite appelée… … Wikipédia en Français
Commentz–Walter algorithm — The Commentz Walter algorithm is a string searching algorithm invented by Beate Commentz Walter. Like the Aho–Corasick string matching algorithm, it can search for multiple patterns at once. It combines ideas from Aho Corasick with the fast… … Wikipedia
Compressed pattern matching — In computer science Compressed Pattern Matching or CPM is the process of searching for pattern in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less… … Wikipedia
List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… … Wikipedia
Algorithme de recherche de sous-chaîne — Un algorithme de recherche de sous chaine est un type d algorithme de recherche qui a pour objectif de trouver une chaîne de caractères à l intérieur d une autre. Un tel algorithme fournit la position du premier caractère de la sous chaîne… … Wikipédia en Français
List of combinatorics topics — This is a list of combinatorics topics.A few decades ago it might have been said that combinatorics is little more than a way to classify poorly understood problems, and some standard remedies. Great progress has been made since 1960.This page is … Wikipedia
Grep — is a command line text search utility originally written for Unix. The program s name derives from the Unix ed command, g/re/p which performs a similar operation.cite web|url=http://www.catb.org/ esr/jargon/html/G/grep.html |title=grep… … Wikipedia
Algorithme de Knuth-Morris-Pratt — L algorithme de Knuth Morris Pratt (souvent abrégé par algorithme KMP) est un algorithme de recherche de sous chaîne, permettant de trouver les occurrences d une chaîne P dans un texte S. Sa particularité réside en un pré traitement de la chaîne … Wikipédia en Français

Academic Dictionaries and Encyclopedias

Aho–Corasick string matching algorithm

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Aho–Corasick string matching algorithm

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link