Jaccard index

Jaccard index

The Jaccard index, also known as the Jaccard similarity coefficient (originally coined "coefficient de communauté" by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets.

The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

: J(A,B) = frac.For text matching, the attribute vectors A and B are usually the tf-idf vectors of the documents.

Since the angle, heta, is in the range of [0,pi] , the resulting similarity will yield the value of pi as meaning exactly opposite, pi/2 meaning independent, 0 meaning exactly the same, with in-between values indicating intermediate similarities or dissimilarities.

This cosine similarity metric may be extended such that it yields the Jaccard coefficient in the case of binary attributes. This is the Tanimoto coefficient, T(A,B), represented as: T(A,B) = {A cdot B over |A|^2 +|B|^2 - A cdot B}.

See also

* Sørensen's quotient of similarity
* Mountford's index of similarity
* Hamming distance
* Correlation
* Dice's coefficient


* Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7
* Paul Jaccard (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547-579.
* Tanimoto, T.T. (1957) IBM Internal Report 17th Nov. 1957.

External links

* [http://www.cals.ncsu.edu/course/ent591k/gcextend.html#diversity Jaccard's index and species diversity]
* [http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html Example of Jaccard's coefficient]
* [http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap2_data.pdf Introduction to Data Mining lecture notes from Tan, Steinbach, Kumar]
* http://sourceforge.net/projects/simmetrics/ SimMetrics a sourceforge implementation of this and many other similarity metrics

Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Jaccard-Index — Der Jaccard Koeffizient oder Jaccard Index nach dem Schweizer Botaniker Paul Jaccard (1868–1944) ist eine Kennzahl für die Ähnlichkeit von Mengen. Um den Jaccard Koeffizient zweier Mengen zu berechnen, teilt man die Anzahl der gemeinsamen… …   Deutsch Wikipedia

  • Jaccard — may refer to:*Paul Jaccard (1868 1944), Swiss botanist academic *Jacques Jaccard (1886 1960), American director *Roland Jaccard (born 1941), Swiss psychologist writer *Mark Jaccard (born c. 1950s) Canadian economist academicSee also: *Jaccard… …   Wikipedia

  • Jaccard-Koeffizient — Der Jaccard Koeffizient oder Jaccard Index nach dem Schweizer Botaniker Paul Jaccard (1868–1944) ist eine Kennzahl für die Ähnlichkeit von Mengen. Inhaltsverzeichnis 1 Definition 2 Beispiel 3 Jaccard Metrik 4 …   Deutsch Wikipedia

  • Jaccard-Metrik — Der Jaccard Koeffizient oder Jaccard Index nach dem Schweizer Botaniker Paul Jaccard (1868–1944) ist eine Kennzahl für die Ähnlichkeit von Mengen. Um den Jaccard Koeffizient zweier Mengen zu berechnen, teilt man die Anzahl der gemeinsamen… …   Deutsch Wikipedia

  • Index et distance de Jaccard — Indice et distance de Jaccard L indice et la distance de Jaccard sont deux métriques utilisées en statistiques pour comparer la similarité et la diversité entre des échantillons. Elles sont nommées d après le botaniste suisse Paul Jaccard.… …   Wikipédia en Français

  • Paul Jaccard — (18 November 1868 in Sainte Croix 9 May1944 in Zurich) was a professor of botany and plant physiology at the ETH Zurich. He studied at the University of Lausanne and ETH Zurich (PhD 1894). He continued studies in Paris with Gaston Bonnier.He… …   Wikipedia

  • Sørensen similarity index — The Sørensen index, also known as Sørensen’s similarity coefficient, is a statistic used for comparing the similarity of two samples. It was developed by the botanist Thorvald Sørensen and published in 1948 [Sørensen, T. (1948) A method of… …   Wikipedia

  • Fernand Jaccard — Fernand Alfred Jaccard (October 8 1907 ndash; April 15 2008) was a Swiss football midfielder. He played in the 1934 FIFA World Cup. [http://www.fifa.com/worldfootball/statisticsandrecords/players/player=51368/index.html FIFA Profile] ]… …   Wikipedia

  • MinHash — In computer science, MinHash (or the min wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997),[1] and initially used… …   Wikipedia

  • Cluster analysis — The result of a cluster analysis shown as the coloring of the squares into three clusters. Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”