Damerau–Levenshtein distance

Damerau–Levenshtein distance: In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters. In his seminal paper^[1], Damerau not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. The corresponding edit distance, i.e., dealing with multiple edit operations, known as the Levenshtein distance, was introduced by Levenshtein,^[2] but it did not include transpositions in the set of basic operations. The name Damerau–Levenshtein distance is used to refer to the edit distance that allows multiple edit operations including transpositions, although it is not clear whether the term Damerau–Levenshtein distance is sometimes used in some sources as to take into account non-adjacent transpositions or not.

While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau–Levenshtein distance has also seen uses in biology to measure the variation between DNA.

Contents

1 Algorithm

2 Applications

2.1 DNA

2.2 Fraud detection

3 See also

4 References

5 External links

Algorithm

Adding transpositions sounds simple, but in reality there is a serious complication. Presented here are two algorithms: the first^[3], simpler one, computes what is known as the optimal string alignment (sometimes called the restricted edit distance), while the second one^[4] computes the Damerau–Levenshtein distance with adjacent transpositions. The difference between the two algorithms consists in that the optimal string alignment algorithm computes the number of edit operations needed to make the strings equal under the condition that no substring is edited more than once, whereas the second one presents no such restriction.

Take for example the edit distance between CA and ABC. The Damerau-Levenshtein distance LD(CA,ABC) = 2 because CA -> AC -> ABC, but the optimal string alignment distance OSA(CA,ABC) = 3 because if the operation CA -> AC is used, it is not possible to use AC -> ABC because that would require the substring to be edited more than once, which is not allowed in OSA, and therefore the shortest sequence of operations is CA -> A -> AB -> ABC. Note that for the optimal string alignment distance, the triangle inequality does not hold: OSA(CA,AC) + OSA(AC,ABC) < OSA(CA,ABC), and so it is not a true metric.

Firstly, let us consider a direct extension of the formula used to calculate Levenshtein distance. Below is pseudocode for a function OptimalStringAlignmentDistance that takes two strings, str1 of length lenStr1, and str2 of length lenStr2, and computes the optimal string alignment distance between them:

int OptimalStringAlignmentDistance(char str1[1..lenStr1], char str2[1..lenStr2]) // d is a table with lenStr1+1 rows and lenStr2+1 columns declare int d[0..lenStr1, 0..lenStr2] // i and j are used to iterate over str1 and str2 declare int i, j, cost //for loop is inclusive, need table 1 row/column larger than string length. for i from 0 to lenStr1 d[i, 0] := i for j from 1 to lenStr2 d[0, j] := j //Pseudo-code assumes string indices start at 1, not 0. //If implemented, make sure to start comparing at 1st letter of strings. for i from 1 to lenStr1 for j from 1 to lenStr2 if str1[i] = str2[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j ] + 1, // deletion d[i , j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then d[i, j] := minimum( d[i, j], d[i-2, j-2] + cost // transposition ) return d[lenStr1, lenStr2]

Basically this is the algorithm to compute Levenshtein distance with one additional recurrence:

if(i > 1 and j > 1 and str1[i] = str2[j-1] and str1[i-1] = str2[j]) then d[i, j] := minimum( d[i, j], d[i-2, j-2] + cost // transposition )

Here is the second algorithm that computes the true Damerau-Levenshtein distance with adjacent transpositions (ActionScript 3.0); this function requires as an additional parameter the size of the alphabet (C), so that all entries of the arrays are in 0..(C−1):

static public function damerauLevenshteinDistance(a:Array, b:Array, C:uint):uint { var INF:uint = a.length + b.length; var H:matrix = new matrix(a.length+2,b.length+2); H[0][0] = INF; for(var i:uint = 0; i<=a.length; ++i) {H[i+1][1] = i; H[i+1][0] = INF;} for(var j:uint = 0; j<=b.length; ++j) {H[1][j+1] = j; H[0][j+1] = INF;} var DA:Array = new Array(C); for(var d:uint = 0; d<C; ++d) DA[d]=0; for(var i:uint = 1; i<=a.length; ++i) { var DB:uint = 0; for(var j:uint = 1; j<=b.length; ++j) { var i1:uint = DA[b[j-1]]; var j1:uint = DB; var d:uint = ((a[i-1]==b[j-1])?0:1); if(d==0) DB = j; H[i+1][j+1] = Math.min(H[i][j]+d, H[i+1][j] + 1, H[i][j+1]+1, H[i1][j1] + (i-i1-1) + 1 + (j-j1-1)); } DA[a[i-1]] = i; } return H[a.length+1][b.length+1]; } public class matrix extends Array { public var rows:uint, cols:uint; public function matrix(nrows:int=0, ncols:int=-1) { super(nrows); if(ncols == -1) ncols = nrows; for(var i:uint = 0; i<nrows; ++i) { this[i] = new Array(ncols); } rows = nrows; cols = ncols; } }

using the C# language computes the true Damerau-Levenshtein distance with adjacent transpositions.

public static Int32 DamerauLevenshteinDistance(String source, String target) { if (String.IsNullOrEmpty(source)) { if (String.IsNullOrEmpty(target)) { return 0; } else { return target.Length; } } else if (String.IsNullOrEmpty(target)) { return source.Length; } Int32 m = source.Length; Int32 n = target.Length; Int32[,] H = new Int32[m + 2, n + 2]; Int32 INF = m + n; H[0, 0] = INF; for (Int32 i = 0; i <= m; i++) { H[i + 1, 1] = i; H[i + 1, 0] = INF; } for (Int32 j = 0; j <= n; j++) { H[1, j + 1] = j; H[0, j + 1] = INF; } SortedDictionary<Char, Int32> sd = new SortedDictionary<Char, Int32>(); foreach (Char Letter in (source + target)) { if (!sd.ContainsKey(Letter)) sd.Add(Letter, 0); } for (Int32 i = 1; i <= m; i++) { Int32 DB = 0; for (Int32 j = 1; j <= n; j++) { Int32 i1 = sd[target[j - 1]]; Int32 j1 = DB; if (source[i - 1] == target[j - 1]) { H[i + 1, j + 1] = H[i, j]; DB = j; } else { H[i + 1, j + 1] = Math.Min(H[i, j], Math.Min(H[i + 1, j], H[i, j + 1])) + 1; } H[i + 1, j + 1] = Math.Min(H[i + 1, j + 1], H[i1, j1] + (i - i1 - 1) + 1 + (j - j1 - 1)); } sd[source[i - 1]] = i; } return H[m + 1, n + 1]; }

(Note: the algorithm given in the paper uses alphabet 1..C rather than 0..C−1, and indexes arrays differently: H[−1..|A|,−1..|B|] rather than H[0..|A|+1,0..|B|+1], DA[1..C] rather than DA[0..C−1]; the paper seems to be missing the necessary line H[−1,−1] = INF)

To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: $O\left (M \cdot N \cdot \max(M, N) \right )$ , where M and N are string lengths. Using the ideas of Lowrance and Wagner,^[5] this naive algorithm can be improved to be $O\left (M \cdot N \right)$ in the worst case.

It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of^[6] for an example of such an adaptation.

Applications

The first algorithm computes only the restricted edit distance. Damerau–Levenshtein distance plays an important role in natural language processing. In natural languages, strings are short and the number of errors (misspellings) rarely exceeds 2. In such circumstances, restricted and real edit distance differ very rarely. That is why this limitation is not very important. However, one must remember that restricted edit distance does not always satisfy the triangle inequality and, thus, cannot be used with metric trees.

DNA

Since DNA frequently undergoes insertions, deletions, substitutions, and transpositions, and each of these operations occurs on approximately the same timescale, the Damerau–Levenshtein distance is an appropriate metric of the variation between two strands of DNA. More common in DNA, protein, and other bioinformatics related alignment tasks is the use of closely related algorithms such as Needleman-Wunsch or Smith-Waterman.

Fraud detection

The algorithm can be used with any set of words, like vendor names. Since entry is manual by nature there is a risk of entering false vendor. A fraudster employee may enter one real vendor such as "Rich Heir Estate Services" versus a false vendor "Rich Hier State Services". The fraudster would then create a false bank account and have the company route checks to the real vendor and false vendor. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and bring attention of the items to a fraud examiner.

See also

Approximate string matching

Levenshtein automata

Typosquatting

References

^ A technique for computer detection and correction of spelling errors, Communications of the ACM, 1964.

^ Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 1966.

^ B. J. Oommen and R. K. S. Loke, Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions

^ Roy Lowrance, Robert A. Wagner. An Extension of the String-to-String Correction Problem, JACM, 1975.

^ Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive, 33(1), pp. 31-88, 2001.

External links

Python implementation of optimal string alignment distance

FREJ - an open source java library which implements approximate substring search using Damerau-Levenshtein distance.

Ruby translation of above Python implementation.

MySQL stored function implementation of the Levenshtein distance, extended to optimal string alignment distance (source code) <-- This link seems broken, but here's a backup.

MySQL implementation as a compiled UDF (User Defined Function)

MySQL implementation, compiled as a Windows DLL

C# implementation

^ Site devoted to fuzzy searching and information retrieval

Categories:
Algorithms on strings
String similarity measures

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

Levenshtein distance — In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance. The… … Wikipedia
Distance de Damerau-Levenshtein — En informatique théorique et sciences informatiques, la distance de Damerau–Levenshtein est une distance entre deux chaînes de caractères. On calcule le nombre minimum d opérations nécessaires pour transformer une chaîne de caractères en une… … Wikipédia en Français
Distance De Levenshtein — La distance de Levenshtein mesure la similarité entre deux chaînes de caractères. Elle est égale au nombre minimal de caractères qu il faut supprimer, insérer ou remplacer pour passer d’une chaîne à l’autre. Son nom provient de Vladimir… … Wikipédia en Français
Distance de levenshtein — La distance de Levenshtein mesure la similarité entre deux chaînes de caractères. Elle est égale au nombre minimal de caractères qu il faut supprimer, insérer ou remplacer pour passer d’une chaîne à l’autre. Son nom provient de Vladimir… … Wikipédia en Français
Distance de Levenshtein — La distance de Levenshtein mesure la similarité entre deux chaînes de caractères. Elle est égale au nombre minimal de caractères qu il faut supprimer, insérer ou remplacer pour passer d’une chaîne à l’autre. Son nom provient de Vladimir… … Wikipédia en Français
Levenshtein-Distanz — Die Levenshtein Distanz zwischen zwei Zeichenketten ist die minimale Anzahl von Einfüge , Lösch und Ersetz Operationen, um die erste Zeichenkette in die zweite umzuwandeln. Benannt ist die Distanz nach dem russischen Wissenschaftler Wladimir… … Deutsch Wikipedia
Algorithme de Levenshtein — Distance de Levenshtein La distance de Levenshtein mesure la similarité entre deux chaînes de caractères. Elle est égale au nombre minimal de caractères qu il faut supprimer, insérer ou remplacer pour passer d’une chaîne à l’autre. Son nom… … Wikipédia en Français
Hamming distance — 3 bit binary cube for finding Hamming distance … Wikipedia
Edit distance — In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other. There are several different algorithms to define or calculate this… … Wikipedia
Distancia de Levenshtein — Saltar a navegación, búsqueda En Teoría de la información y Ciencias de la Computación se llama Distancia de Levenshtein, distancia de edición, o distancia entre palabras, al número mínimo de operaciones requeridas para transformar una cadena de… … Wikipedia Español

Academic Dictionaries and Encyclopedias

Damerau–Levenshtein distance

Contents

Algorithm

Applications

DNA

Fraud detection

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Damerau–Levenshtein distance

Contents

Algorithm

Applications

DNA

Fraud detection

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link