Document-term matrix

Document-term matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

Contents

General Concept

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

  • D1 = "I like databases"
  • D2 = "I hate hate databases",

then the document-term matrix would be:

I like hate databases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear.

Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Choice of Terms

A point of view on the matrix is that each row represents a document. In the vectorial semantic model, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories, and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

Applications

Improving search results

Latent semantic analysis (LSA, performing eigenvalue decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.

Finding topics

Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and more recently probabilistic latent semantic analysis and non-negative matrix factorization have been found to perform well for this task.

See also

Implementations

  • Gensim: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations (tf-idf, LSA, LDA).



Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Matrix (mathematics) — Specific elements of a matrix are often denoted by a variable with two subscripts. For instance, a2,1 represents the element at the second row and first column of a matrix A. In mathematics, a matrix (plural matrices, or less commonly matrixes)… …   Wikipedia

  • Term Discrimination — is a way to rank keywords in how useful they are for Information Retrieval. Overview This is a method similar to tf idf but it deals with finding keywords suitable for information retrieval and ones that are not. Please refer to Vector Space… …   Wikipedia

  • Non-negative matrix factorization — NMF redirects here. For the bridge convention, see new minor forcing. Non negative matrix factorization (NMF) is a group of algorithms in multivariate analysis and linear algebra where a matrix, , is factorized into (usually) two matrices, and… …   Wikipedia

  • Portable Document Format — PDF redirects here. For other uses, see PDF (disambiguation). Portable Document Format Adobe Reader icon Filename extension .pdf Internet media type application/pdf application/x pdf application/x bzpdf application/x gzpdf …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Bag of words model — The bag of words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar… …   Wikipedia

  • Linear classifier — In the field of machine learning, the goal of classification is to group items that have similar feature values, into groups. A linear classifier achieves this by making a classification decision based on the value of the linear combination of… …   Wikipedia

  • Personal computer hardware — Hardware of a modern personal computer 1. Monitor 2. Motherboard 3. CPU 4. RAM 5. Expansion cards 6. Power supply 7. Optical disc drive 8. Hard disk drive …   Wikipedia

  • Latent semantic analysis — (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA was …   Wikipedia

  • Business and Industry Review — ▪ 1999 Introduction Overview        Annual Average Rates of Growth of Manufacturing Output, 1980 97, Table Pattern of Output, 1994 97, Table Index Numbers of Production, Employment, and Productivity in Manufacturing Industries, Table (For Annual… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”