- W-shingling
A w-shingling is a set of unique "
shingle s"—contiguoussubsequence s of tokens in adocument —that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.The document, "a rose is a rose is a rose" can be tokenized as follows:
:(a,rose,is,a,rose,is,a,rose)
The set of all contiguous sequences of 4 tokens (
N-gram s, here: 4-grams) is:{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) }
By removing duplicate elements from this set, a 4-shingling is obtained:
:{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }
Resemblance
For a given shingle size, the degree to which two documents "A" and "B" resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or
:
where |A| is the size of set A. The resemblance is a number in the range [0,1] , where 1 indicates that two documents are identical. This definition is identical with the
Jaccard coefficient describing similarity and diversity of sample sets.See also
*
Concept Mining offers an alternative method for document similarity calculation with more computational complexity, but where the measure more closely models a human's perception of document similarity.References
*(Manber 1993) "Finding Similar Files in a Large File System". Does not yet use the term "shingling". Available [http://webglimpse.net/pubs/TR93-33.pdf as PDF]
*(Broder, Glassman, Manasse, and Zweig 1997) "SyntacticClustering of the Web". SRC Technical Note #1997-015. Available [http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/ as HTML]
Wikimedia Foundation. 2010.