Tf–idf

Tf–idf

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

Mathematical details

The "term frequency" in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term frequency regardless of the actual importance of that term in the document) to give a measure of the importance of the term t_{i} within the particular document d_{j}.

: mathrm{tf_{i,j = frac{n_{i,j{sum_k n_{k,j

where n_{i,j} is the number of occurrences of the considered term in document d_{j}, and the denominator is the number of occurrences of all terms in document d_{j}.

The "inverse document frequency" is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient).

: mathrm{idf_{i = log frac{|{d_{j}: t_{i} in d_{j}}

with

* |D| : total number of documents in the corpus
* |{d_{j} : t_{i} in d_{j}}| : number of documents where the term t_{i} appears (that is n_{i,j} eq 0).

Then

: mathrm{tf{}idf_{i,j = mathrm{tf_{i,j cdot mathrm{idf_{i

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms.

Example

Consider a document containing 100 words wherein the word "cow" appears 3 times. Following the previously defined formulas, the term frequency (TF) for "cow" is then 0.03 (3 / 100). Now, assume we have 10 million documents and "cow" appears in one thousand of these. Then, the inverse document frequency is calculated as ln(10 000 000 / 1 000) = 9.21. The TF-IDF score is the product of these quantities: 0.03 * 9.21 = 0.28.

Applications in Vector Space Model

The tf-idf weighting scheme is often used in the vector space model together with cosine similarity to determine the similarity between two documents.

ee also

* Noun phrase
* Word count
* Kullback-Leibler divergence
* Transinformation
* Latent semantic analysis

References

* Cite journal
author = Spärck Jones, Karen
year = 1972
title = A statistical interpretation of term specificity and its application in retrieval
journal = Journal of Documentation
volume = 28
issue = 1
pages = 11–21
url = http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf
doi = 10.1108/eb026526

* Cite book
author = Salton, G. and M. J. McGill
year = 1983
title = Introduction to modern information retrieval
publisher = McGraw-Hill
isbn = 0070544840

* Cite journal
author = Salton, Gerard, Edward A. Fox & Harry Wu
year = 1983
month = November
title = Extended Boolean information retrieval
journal = Communications of the ACM
volume = 26
issue = 11
pages = 1022–1036
url = http://portal.acm.org/citation.cfm?id=358466
doi = 10.1145/182.358466

* Cite journal
author = Salton, Gerard and Buckley, C.
year = 1988
title = Term-weighting approaches in automatic text retrieval
journal = Information Processing & Management
volume = 24
issue = 5
pages = 513–523
doi = 10.1016/0306-4573(88)90021-0

External links

* [http://portal.acm.org/citation.cfm?id=866292 Term Weighting Approaches in Automatic Text Retrieval]
* [http://bscit.berkeley.edu/cgi-bin/pl_dochome?query_src=&format=html&collection=Wilensky_papers&id=3&show_doc=yes Robust Hyperlinking] : An application of tf–idf for stable document addressability.


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • IDF — or idf may stand for: *An Initialism: ** Intel Developer Forum ** Intermediate distribution frame (in Telephony and Computer networking) a cable rack that interconnects and manages the telecommunications wiring between a main distribution frame… …   Wikipedia

  • IDF — (Israel Defense Force) armed forces of the state of Israel …   English contemporary dictionary

  • IDF Spokesperson's Unit — The IDF Spokesperson s Unit ( *The IDF Spokesperson s Unit was established as the liaison between the IDF and the domestic and foreign media and general public. The unit performs a variety of functions, including serving as the spokesperson for… …   Wikipedia

  • IDF Achzarit — Infobox Weapon name=IDF Achzarit Mk1 caption=Achzarit in Yad la Shiryon museum, Israel is vehicle=yes type=Heavy armored personnel carrier origin=ISR designer=Israeli Defence Forces Corps of Ordnance manufacturer=NIMDA production date=1988… …   Wikipedia

  • IDF — Die Abkürzung IDF steht für: Israel Defense Forces, englisch für Israelische Streitkräfte Irish Defence Forces, englisch für Óglaigh na hÉireann, irische Streitkräfte Iceland Defense Force, eine bis 2006 bestehende US amerikanische Militäreinheit …   Deutsch Wikipedia

  • IdF — Die Abkürzung IDF steht für: Israel Defense Forces, englisch für Israelische Streitkräfte Irish Defence Forces, englisch für Óglaigh na hÉireann, irische Streitkräfte Iceland Defense Force, eine bis 2006 bestehende US amerikanische Militäreinheit …   Deutsch Wikipedia

  • Idf — Die Abkürzung IDF steht für: Israel Defense Forces, englisch für Israelische Streitkräfte Irish Defence Forces, englisch für Óglaigh na hÉireann, irische Streitkräfte Iceland Defense Force, eine bis 2006 bestehende US amerikanische Militäreinheit …   Deutsch Wikipedia

  • IDF GOC Kommando Nord — Schulter abzeichen des Pikud Tzafon Das Nordkommando (hebräisch ‏פיקוד צפון‎, Pikud Tzafon) ist eines der drei Regionalkommandos der Israelischen Streitkräfte und neben der Kontrolle des Nordabschnitts des Landes für den Schutz der Grenzen zu… …   Deutsch Wikipedia

  • IDF — Cette page d’homonymie répertorie les différents sujets et articles partageant un même nom.   Sigles d’une seule lettre   Sigles de deux lettres > Sigles de trois lettres   Sigles de quatre lettres …   Wikipédia en Français

  • IDF 1 — IDF1 Création 20 mars 2008 Slogan «  IDF1, la chaîne n°1 chez vous !  » Langue Français Pays d origine …   Wikipédia en Français

  • IDF Puma — Pour les articles homonymes, voir Puma (homonymie). IDF Puma Puma en service avec le Battalion 601 du Corps de Génie israélien. Prod …   Wikipédia en Français

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”