Aho-Corasick algorithm

Aho-Corasick algorithm

The Aho-Corasick algorithm is a string searching algorithm created by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

Informally, the algorithm constructs a trie with suffix tree-like set of links from each node representing a string (e.g. abc) to the node corresponding to the longest proper suffix (e.g. bc if it exists, else c if that exists, else the root). It also contains links from each node to the longest suffix node that corresponds to a dictionary entry; thus all of the matches may be enumerated by following the resulting linked list. It then uses the trie at runtime, moving along the input and keeping the longest match, using the suffix links to make sure that computation is linear. For every node that is in the dictionary and every link along the dictionary suffix linked list, an output is generated.

When the pattern dictionary is known in advance (e.g. a computer virus database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.

The Aho-Corasick algorithm formed the basis of the original Unix command fgrep.

The following is the Aho-Corasick data structure constructedfrom the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the (unique) sequence of characters from the root to the node.

At each step, the current node is extended by finding its daughter,and if that doesn't exist, finding its suffix's daughter, and ifthat doesn't work, finding its suffix's suffix's daughter, finallyending in the root node if nothing's seen before.

Execution on input string abccab yields the following steps:

In general, more than one dictionary suffix link may need tobe followed, as more than one dictionary entry may end at agiven character in the input.

ources

*cite journal
first = Alfred V.
last = Aho
authorlink = Alfred Aho
coauthors = Margaret J. Corasick
year = 1975
month = June
title = Efficient string matching: An aid to bibliographic search
journal = Communications of the ACM
volume = 18
issue = 6
pages = 333–340
doi = 10.1145/360825.360855
url =
(Access to the full text may be restricted.)

External links

* [http://www-sr.informatik.uni-tuebingen.de/~buehler/AC/AC.html Animation of the Aho/Corasick Pattern Matching Automaton]
* [http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf Set Matching and Aho-Corasick Algorithm] by Pekka Kilpeläinen
* [http://www.codeproject.com/cs/algorithms/ahocorasick.asp Aho-Corasick string matching in C#] by Tomáš Petříček ( [http://www.eeeksoft.net/articles/ahocorasick.aspx mirror] )
* [http://www.nist.gov/dads/HTML/ahoCorasick.html Aho-Corasick entry] in NIST's [http://www.nist.gov/dads/ Dictionary of Algorithms and Data Structures]
* [http://www.dankirsh.com/2008/07/05/aho-corasick-algorithm/ PHP/Javascript Implementation of Aho/Corasick Algorithm]
* [http://search.cpan.org/search%3fmodule=Algorithm::AhoCorasick Perl Implementation of the Aho-Corasick Algorithm] by Vaclav Barta
* [http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ A Python implementation] licensed under GPLv2 or any later version


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Aho–Corasick string matching algorithm — The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary matching algorithm that locates elements of a finite set of strings (the dictionary ) within …   Wikipedia

  • Algorithme d'Aho-Corasick — L algorithme d Aho Corasick est un algorithme de recherche de chaîne de caractère (ou motif) dans un texte dû à Alfred Aho et Margaret Corasick et publié en 1975. L algorithme consiste à avancer dans une structure de données abstraite appelée… …   Wikipédia en Français

  • Margaret J. Corasick — discovered the Aho Corasick algorithm with Alfred V. Aho …   Wikipedia

  • Alfred Aho — Infobox Scientist name = Alfred Vaino Aho caption = birth date = birth place = death date = death place = residence = nationality = field = Computer Science work institution = Columbia University alma mater = doctoral advisor =Alfred Vaino Aho… …   Wikipedia

  • Commentz–Walter algorithm — The Commentz Walter algorithm is a string searching algorithm invented by Beate Commentz Walter. Like the Aho–Corasick string matching algorithm, it can search for multiple patterns at once. It combines ideas from Aho Corasick with the fast… …   Wikipedia

  • String searching algorithm — String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Let Σ be… …   Wikipedia

  • Timeline of algorithms — The following timeline outlines the development of algorithms (mainly mathematical recipes ) since their inception.Before Modern Era* Before Writing about recipes (on cooking, rituals, agriculture and other themes) * c. 1600 BC Babylonians… …   Wikipedia

  • List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… …   Wikipedia

  • List of combinatorics topics — This is a list of combinatorics topics.A few decades ago it might have been said that combinatorics is little more than a way to classify poorly understood problems, and some standard remedies. Great progress has been made since 1960.This page is …   Wikipedia

  • Grep — is a command line text search utility originally written for Unix. The program s name derives from the Unix ed command, g/re/p which performs a similar operation.cite web|url=http://www.catb.org/ esr/jargon/html/G/grep.html |title=grep… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”