- String metric
String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a
similarity or dissimilarity (distance ) score between two pairs of text strings for approximate matching or comparison and infuzzy string searching .For example the strings "Sam" and "Samuel" can be considered (although not the same) to a degree similar. A string metric provides a floating point number indicating an algorithm-specific indication of similarity.The most widely known (although rudimentary) string metric is
Levenshtein Distance (also known as Edit Distance), which operates between two input strings, returning a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such asLevenshtein distance have expanded to include phonetic,token , grammatical and character-based methods of statisticalcomparison s.A widespread example of a string metric is
DNA sequence analysis and RNA analysis, which are performed by optimised string metrics to identify matching sequences.String metrics are used heavily in
information integration and are currently used infraud detection ,fingerprint analysis ,plagiarism detection ,ontology merging ,DNA analysis , RNA analysis,image analysis , evidence-based machine learning,database deduplication ,data mining , Web interfaces, e.g. Ajax-style suggestions as you type,data integration , semanticknowledge integration , etc..List of string metrics
*
Hamming distance
*Levenshtein distance andDamerau–Levenshtein distance
*Needleman-Wunsch distance orSellers algorithm
*Smith-Waterman distance
**FASTA
**Blast
*Gotoh distance or Smith-Waterman-Gotoh distance
**Monge Elkan distance
*Block distance orL1 distance orCity block distance
*Jaro distance metric
*Jaro-Winkler
*Soundex distance metric
*Matching coefficient
*Dice's coefficient
*Jaccard similarity orJaccard coefficient orTanimoto coefficient
*Overlap coefficient
*Euclidean distance orL2 distance
*Cosine similarity
*Variational distance
*Hellinger distance orBhattacharyya distance
*Information radius (Jensen-Shannon divergence )
*Harmonic mean
*Skew divergence
*Confusion probability
*Tau metric , an approximation of theKullback-Leibler divergence
*Fellegi and Sunters metric (SFS)
*TFIDF or TF/IDF
*Maximal matches
*Monge-Elkan ee also
*
String matching
*SimMetrics - an implementation
* [http://www.speech.cs.cmu.edu/ Carnegy Mellon University open source library]External links
*http://www.dcs.shef.ac.uk/~sam/stringmetrics.html A fairly complete overview
Wikimedia Foundation. 2010.