Document clustering

Document clustering

Document clustering (also referred to as Text clustering) is closely related to the concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.

A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo, consumer search engines such as PolyMeta and Helioid, or open source software such as Carrot2.
Example:
FirstGov.gov, the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more. Perform Probabilistic Latent Semantic Analysis (PLSA) can also be conducted to perform document clustering.

Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.

The application of document clustering can be categorized to two types. The online application is usually constrained by the efficiency compared offline applications.

In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and ward's method. By aggregating or dividing, documents could be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from the efficiency problems. The other algorithm is developed with K-means algorithm and its variances. Usually, it shows a better efficiency, but it is less accurate than the hierarchical algorithm.

Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.


Further reading

Publications:

  • Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, October 16, 2007 [1]
  • Claudio Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. ACM Computing Surveys (CSUR), Volume 41, Issue 3 (July 2009), Article No. 17, ISSN:0360-0300


References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Document classification — or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done manually (or intellectually ) or algorithmically.… …   Wikipedia

  • Document-oriented database — A document oriented database is a computer program designed for storing, retrieving, and managing document oriented, or semi structured data, information. Document oriented databases are one of the main categories of so called NoSQL databases and …   Wikipedia

  • Document-term matrix — A document term matrix or term document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document term matrix, rows correspond to documents in the collection and columns… …   Wikipedia

  • Clustering illusion — Illusion des séries L illusion des séries (en anglais Clustering illusion) est la tendance à percevoir à tort des coincidences dans des données au hasard. Cela est du à la sous estimation systématique par l esprit humain de la variabilité des… …   Wikipédia en Français

  • Multi-document summarization — is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, so as professional information consumers, to quickly familiarize themselves with… …   Wikipedia

  • Non-negative matrix factorization — NMF redirects here. For the bridge convention, see new minor forcing. Non negative matrix factorization (NMF) is a group of algorithms in multivariate analysis and linear algebra where a matrix, , is factorized into (usually) two matrices, and… …   Wikipedia

  • Information bottleneck method — The information bottleneck method is a technique introduced by Tishby et al [1] for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability… …   Wikipedia

  • Clique percolation method — The clique percolation method[1] is a popular approach for analyzing the overlapping community structure of networks. The term network community (also called a module, cluster or cohesive group) has no widely accepted unique definition and it is… …   Wikipedia

  • Harmony search — (HS) is a metaheuristic algorithm (also known as soft computing algorithm or evolutionary algorithm) mimicking the improvisation process of musicians. In the process, each musician plays a note for finding a best harmony all together. Likewise,… …   Wikipedia

  • Citation index — A citation index is a kind of bibliographic database, an index of citations between publications, allowing the user to easily establish which later documents cite which earlier documents. The first citation indices were legal citators such as… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”