- Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient (originally coined "coefficient de communauté" by
Paul Jaccard ), is astatistic used for comparing thesimilarity and diversity of sample sets.The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
:For text matching, the attribute vectors A and B are usually the
tf-idf vectors of the documents.Since the angle, , is in the range of , the resulting similarity will yield the value of as meaning exactly opposite, meaning independent, 0 meaning exactly the same, with in-between values indicating intermediate similarities or dissimilarities.
This cosine similarity metric may be extended such that it yields the Jaccard coefficient in the case of binary attributes. This is the Tanimoto coefficient, , represented as:
See also
* Sørensen's quotient of similarity
*Mountford's index of similarity
*Hamming distance
*Correlation
*Dice's coefficient References
* Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7
*Paul Jaccard (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547-579.
* Tanimoto, T.T. (1957) IBM Internal Report 17th Nov. 1957.External links
* [http://www.cals.ncsu.edu/course/ent591k/gcextend.html#diversity Jaccard's index and species diversity]
* [http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html Example of Jaccard's coefficient]
* [http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap2_data.pdf Introduction to Data Mining lecture notes from Tan, Steinbach, Kumar]
* http://sourceforge.net/projects/simmetrics/ SimMetrics a sourceforge implementation of this and many other similarity metrics
Wikimedia Foundation. 2010.