- Jaro-Winkler distance
The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.
The Jaro distance metric states that given two strings s_1 and s_2, their distance d_j is:
:d_j = frac{1}{3}left(frac{m}
* m = 4 Note that the two "X"s are not considered matches because they are outside the match window of 3.
* s_1| = 5
* s_2| = 8
* t = 0We find a Jaro score of:
:d_j = frac{1}{3}left(frac{4}{5} + frac{4}{8} + frac{4-0}{4} ight) = 0.767
To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:
* ell = 2
Thus:
:d_w = 0.767 + (2 * 0.1 (1 - 0.767)) = 0.813
References
*
*
*
*ee also
record linkage ,census External links
* [http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#jaro Open Source implementation in Java and .NET]
* [http://www.census.gov/geo/msb/stand/strcmp.c Original C Implementation by the author of the algorithm]
* [http://diotalevi.isa-geek.net/~josh/Jaro-Winkler/winkler's.pl Perl bindings for the original C implementation]
* [http://diotalevi.isa-geek.net/~josh/Jaro-Winkler/jjore's.pl Clean, perl reimplementation of the algorithm]
Wikimedia Foundation. 2010.