- Approximate string matching
In
computing , approximate string matching is the technique of finding approximate matches to apattern in a string.The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. The usual primitive operations are:Fact|date=February 2007
* "insertion" (e.g., changing "cot" to "coat"),
* "deletion" (e.g. changing "coat" to "cot"), and
* "substitution" (e.g. changing "coat" to "cost").Some approximate matchers also treat "transposition", in which the positions of two letters in the string are swapped, to be a primitive operation.Fact|date=February 2007 Changing "cost" to "cots" is an example of a transposition.
Different approximate matchers impose different constraints. Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern.Fact|date=February 2007 For example, if the pattern is "coil", "foil" differs by one substitution, "coils" by one insertion, "oil" by one deletion, and "foal" by two substitutions. If all operations count as a single unit of cost and the limit is set to one, "foil", "coils", and "oil" will count as matches while "foal" will not.
Other matchers specify the number of operations of each type separately,Fact|date=February 2007 while still others set a total cost but allow different weights to be assigned to different operations.Fact|date=February 2007 Some matchers allow separate assignments of limits and weights to individual groups in the pattern.
Most approximate matchers used for text processing are
regular expression matchers.Fact|date=February 2007 The distance between a candidate and the pattern is therefore computed as the minimum distance between the candidate and a fixed string matching the regular expression. Thus, if the pattern is "co.l", using thePOSIX notation in which a dot matches any single character, both "coal" and "coil" are exact matches, while "soil" differs by one substitution.The most common application of approximate matchers until recently has been
spell checking .Fact|date=February 2007 With the availability of large amounts of DNA data, matching ofnucleotide sequences has become an important application.Fact|date=February 2007 Approximate matching is also used to identify pieces of music from small snatches and inspam filtering .Fact|date=February 2007References
* "Pattern Matching Algorithms", Alberto Apostolico & Zvi Galil, Oxford University Press, UK, 1997.
See also
*
Fuzzy string searching
*Levenshtein distance
*Needleman-Wunsch algorithm
*Soundex
*Agrep
*Zsh
Wikimedia Foundation. 2010.