Fuzzy string searching

Fuzzy string searching

Approximate string search is the name that is used for a category of techniques for
finding strings that approximately match some given pattern string. It may also be known as approximate or inexact matching.

Approximate string searching has two different flavors::finding one or more matching substrings of a text, and :finding similar strings in a string set, often referred to as a dictionary. Approximate string searching has many application areas including
information retrieval, spellchecking and computational biology ref|Gus97.

imilarity functions

The cornerstone of any approximate searching method is a similarity function or string metric. Among the most commonly used similarity functions are Levenshtein distance (a type of edit distance) and n-gram distance. The latter is based on counting the number of common n-grams, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen ref|Ukk85 described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.

On-line vs. off-line

Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot.In other words, on-line techniques do searching without indexation. Earlyalgorithms for on-line approximate matching were suggested by Wagner and Fisherref|WF74 and by Sellers. ref|Sel80 Both algorithms are based on
dynamic programming but solve different problems. Sellers' algorithmsearches approximately for a substring in a text while the algorithm of Wagnerand Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

On-line searching techniques were repeatedly improved. Perhaps the most famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The Bitap algorithm is the heart of the Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro.ref|Nav01

Although very fast on-line techniques exist, theirperformance on large data is unacceptable. Text preprocessing or indexing makes searching dramatically faster.Today, a variety of indexing algorithms are presented. Among them are suffix treesref|Gus97, metric treesref|NB98 and n-gram methods.ref|NBST01ref|Zob95 For a detailed list of indexing techniques see the paper of Navarro "et al."ref|NBST01

ee also

*Soundex
*Agrep
*Spellchecker
*String searching algorithm
*Wildcard character
*Levenshtein distance
*Computer-assisted translation

References

* R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1--23, Irvine, CA, Jun 1996.
* D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
* [http://www.dcc.uchile.cl/~gnavarro/ps/spire98.2.ps.gz R. Baeza-Yates and G. Navarro. Fast Approximate String Matching in a Dictionary.Proc. SPIRE'98. IEEE CS Press, pages 14-22.]
* G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming, Journal of the ACM (JACM) 46 (3), May 1999, 395 - 415.
* [http://www.egeen.ee/u/vilo/edu/2002-03/Tekstialgoritmid_I/Articles/Approximate/Navarro_Review_on_Approximate_Matching_p31-navarro.pdf G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.]
* [http://www.dcc.uchile.cl/~gnavarro/ps/deb01.ps.gz G. Navarro, Ricardo Baeza-Yates, E. Sutinen and J. Tarhio. Indexing Methods for Approximate String Matching.IEEE Data Engineering Bulletin 24(4):19-27, 2001.]
* P.H. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1:359-373, 1980.
* E. Ukkonen, Algorithms for approximate string matching. Information and Control 64, 100-118. 1985.
* [http://portal.acm.org/citation.cfm?id=321811 R. Wagner and M. Fischer, The string-to-string correction problem, Journal of the association for computing machinery, vol. 21, pp. 168 173, 1974.]
* [http://www.cs.ubc.ca/rr/proceedings/spe91-95/spe/vol25/issue3/spe948jz.pdf J. Zobel, P. Dart. Finding approximate matches in large lexicons. Software-Practice & Experience 25(3), pp 331-345, 1995.]

External links

* [http://laurikari.net/tre/ Efficient POSIX compliant regexp matching library with support for approximate matching]
* [http://itman.narod.ru/english/ir/index.html Site devoted to fuzzy searching and information retrieval]
* [http://www.levenshtein.net/ The description of Levenshtein algorithm]
* [http://www.logic.at/people/bruno/Programs/BWPGazetteer/ BWPGazetteer] : [http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf an implementation of an approximate gazetteer based on Levenshtein Distance] in Java within the GATE (General Architecture for Text Engineering) framework for Natural Language Processing and Information Extraction.
* [http://tomcat-dmaweb1.jrc.it/fuzzyg/query/ The Fuzzy Gazetteer: A fuzzy string search engine for place names worldwide]
* [http://www.heise.de/ct/english/97/04/386/ Source code for n-gram based matching]
* [http://siderite.blogspot.com/2007/01/super-fast-string-distance-algorithm.html Siderite's Sift2: An empiric, but fairly accurate and very fast edit-distance algorithm]
* [http://pgrdoc.bioversity.cgiar.org/taxcheck/grin/index.html Taxonomic nomenclature checker]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • String searching algorithm — String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Let Σ be… …   Wikipedia

  • String metric — String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison and in fuzzy string… …   Wikipedia

  • Approximate string matching — In computing, approximate string matching is the technique of finding approximate matches to a pattern in a string. The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact… …   Wikipedia

  • Fuzz — may refer to: *Vellus, a type of short, fine body hair on an animal *Tomentum, a filamentous hairlike growth on a plant *Focus (optics), a blur effect *Fuzzbox, an electric guitar distortion effect *A derogatory slang term for the police… …   Wikipedia

  • List of combinatorics topics — This is a list of combinatorics topics.A few decades ago it might have been said that combinatorics is little more than a way to classify poorly understood problems, and some standard remedies. Great progress has been made since 1960.This page is …   Wikipedia

  • Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… …   Wikipedia

  • Bitap algorithm — The bitap algorithm (also known as the shift or, shift and or Baeza Yates Gonnet algorithm) is a fuzzy string searching algorithm developed by Udi Manber and Sun Wu in 1991ref|Manber91ref|Manber92 based on work done by Ricardo Baeza Yates and… …   Wikipedia

  • Agrep — (approximate grep) is a fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.It selects the best suited algorithm for the… …   Wikipedia

  • Ricardo Baeza-Yates — (born March 21, 1961) is a Chilean computer scientist and director of the [http://research.yahoo.com/ Yahoo! Research] labs at Barcelona, Spain and Santiago, Chile. His Ph.D. from the University of Waterloo was entitled Efficient Text Searching …   Wikipedia

  • Pattern search — may refer to: * Pattern recognition * Pattern mining * String searching algorithm * Fuzzy string searching * Bitap algorithm * K optimal pattern discovery * Nearest neighbor search (Nearest neighbor) * Eyeball search …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”