- Tf–idf
The tf–idf weight (term frequency–inverse document frequency) is a weight often used in
information retrieval andtext mining . This weight is a statistical measure used to evaluate how important a word is to adocument in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used bysearch engine s as a central tool in scoring and ranking a document's relevance given a userquery .Mathematical details
The "term frequency" in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term frequency regardless of the actual importance of that term in the document) to give a measure of the importance of the term within the particular document .
:
where is the number of occurrences of the considered term in document , and the denominator is the number of occurrences of all terms in document .
The "inverse document frequency" is a measure of the general importance of the term (obtained by dividing the number of all
documents by the number of documents containing the term, and then taking thelogarithm of thatquotient ).:
with
* : total number of documents in the corpus
* : number of documents where the term appears (that is ).Then
:
A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms.
Example
Consider a document containing 100 words wherein the word "cow" appears 3 times. Following the previously defined formulas, the term frequency (TF) for "cow" is then 0.03 (3 / 100). Now, assume we have 10 million documents and "cow" appears in one thousand of these. Then, the inverse document frequency is calculated as ln(10 000 000 / 1 000) = 9.21. The TF-IDF score is the product of these quantities: 0.03 * 9.21 = 0.28.
Applications in Vector Space Model
The tf-idf weighting scheme is often used in the
vector space model together withcosine similarity to determine thesimilarity between two documents.ee also
*
Noun phrase
*Word count
*Kullback-Leibler divergence
*Transinformation
*Latent semantic analysis References
* Cite journal
author = Spärck Jones, Karen
year = 1972
title = A statistical interpretation of term specificity and its application in retrieval
journal =Journal of Documentation
volume = 28
issue = 1
pages = 11–21
url = http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf
doi = 10.1108/eb026526
* Cite book
author = Salton, G. and M. J. McGill
year = 1983
title = Introduction to modern information retrieval
publisher =McGraw-Hill
isbn = 0070544840
* Cite journal
author = Salton, Gerard, Edward A. Fox & Harry Wu
year = 1983
month = November
title = Extended Boolean information retrieval
journal =Communications of the ACM
volume = 26
issue = 11
pages = 1022–1036
url = http://portal.acm.org/citation.cfm?id=358466
doi = 10.1145/182.358466
* Cite journal
author = Salton, Gerard and Buckley, C.
year = 1988
title = Term-weighting approaches in automatic text retrieval
journal =Information Processing & Management
volume = 24
issue = 5
pages = 513–523
doi = 10.1016/0306-4573(88)90021-0External links
* [http://portal.acm.org/citation.cfm?id=866292 Term Weighting Approaches in Automatic Text Retrieval]
* [http://bscit.berkeley.edu/cgi-bin/pl_dochome?query_src=&format=html&collection=Wilensky_papers&id=3&show_doc=yes Robust Hyperlinking] : An application of tf–idf for stable document addressability.
Wikimedia Foundation. 2010.