Document classification

Document classification: Document classification or document categorization is a problem in both library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach.

Contents

1 "Content based" versus "request based" classification

2 Classification versus indexing

3 Automatic document classification

4 Techniques

5 Applications

6 See also

7 Further reading

"Content based" versus "request based" classification

Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a rule in much library classification that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document.

Request oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier ask himself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230^[1]).

Request oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents different compared to a historical library. It is probably better, however, to understand request oriented classification as policy based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request oriented classification be regarded as a user-based approach.

Classification versus indexing

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued is this distinction not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21^[2]). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,^[3] 2004;^[4] Broughton, 2008;^[5] Riesthuis & Bliedung, 1991^[6]). Therefore is the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).

Automatic document classification

Automatic document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism.

Techniques

Automatic document classification techniques include:

Expectation maximization (EM)

Naive Bayes classifier

Tf-idf

Latent semantic indexing

Support vector machines (SVM)

Artificial neural network

K-nearest neighbour algorithms

Decision trees such as ID3 or C4.5

Concept Mining

Rough set based classifier

Soft set based classifier

Natural language processing approaches

Applications

Classification techniques have been applied to

spam filtering, a process which tries to discern E-mail spam messages from legitimate emails

topic spotting, automatically determining the topic of a text

email routing, sending an email sent to a general address to a specific address or mailbox depending on topic^[7]

language guessing, automatically determining the language of a text

genre classification, automatically determining the genre of a text^[8]

See also

Categorization

Classification (disambiguation)

Compound term processing

Content-based image retrieval

Document

Supervised learning, unsupervised learning

Document retrieval

Document clustering

Information retrieval

Knowledge organization

Knowledge Organization System

Library classification

Machine learning

String metrics

Subject (documents)

Subject indexing

Text mining, web mining, concept mining

RapidMiner - open source text mining software used for document classification, e-mail spam detection, e-mail routing, text sentiment analysis, and other text classification tasks.

Further reading

Publications:

Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2010.

Introduction to document classification

Bibliography on Automated Text Categorization

Bibliography on Query Classification

Text Classification analysis page

Learning to Classify Text - Chap. 6 of the book Natural Language Processing with Python (available online)

References:

^ Soergel, Dagobert (1985). Organizing information: Principles of data base and retrieval systems. Orlando, FL: Academic Press.

^ Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.

^ Aitchison, J. (1986). “A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure.” Journal of Documentation, Vol. 42 No. 3, pp. 160-181.

^ Aitchison, J. (2004). “Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule.” Bliss Classification Bulletin, Vol. 46, pp. 20-26.

^ Broughton, V. (2008). “A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification (2nd Ed.).” Axiomathes, Vol. 18 No.2, pp. 193-210.

^ Riesthuis, G. J. A., & Bliedung, St. (1991). “Thesaurification of the UDC.” Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.

^ Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). Message classification in the call center. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158-165, ACL.

^ Santini, Marina; Rosso, Mark (2008), Testing a Genre-Enabled Application: A Preliminary Assessment, BCS IRSG Symposium: Future Directions in Information Access, London, UK, http://www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf

Data sets:

TechTC - Technion Repository of Text Categorization Datasets

David D. Lewis's Datasets

Categories:
Information science
Natural language processing
Knowledge representation
Data mining
Machine learning

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

Document retrieval — is defined as the matching of some stated user query against a set of free text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can… … Wikipedia
Classification — may refer to: Library classification and classification in general Taxonomic classification (see Taxonomy) Biological classification of organisms Medical classification Scientific classification (disambiguation) Classification (literature)… … Wikipedia
Classification in machine learning — See also: Pattern recognition This section needs integrating with Statistical classification (Discuss). Integration means cross linking and distinguishing (to/from each other), or sometimes merging (if consensus suggests). In machine learning and … Wikipedia
Classification et la catégorisation de documents — Classification et catégorisation de documents La classification et catégorisation de documents est l activité du TALN qui consiste à évaluer comment classer des ressources documentaires, généralement en provenance d un corpus. Cette… … Wikipédia en Français
Classification a facettes — Classification à facettes Une classification à facettes (en anglais : faceted classification) est une classification analytico synthétique, pragmatique, dont les critères de classification ne sont pas homogènes. C est une classification… … Wikipédia en Français
Classification À Facettes — Une classification à facettes (en anglais : faceted classification) est une classification analytico synthétique, pragmatique, dont les critères de classification ne sont pas homogènes. C est une classification élaborée en 1924 par un… … Wikipédia en Français
Classification Phylogénétique — La classification phylogénétique est un système de classification des êtres vivants. Elle tend à remplacer la classification traditionnelle en se basant uniquement sur les rapports de proximité évolutive entre espèces. L œuvre de Willi Hennig… … Wikipédia en Français
Classification phylogenetique — Classification phylogénétique La classification phylogénétique est un système de classification des êtres vivants. Elle tend à remplacer la classification traditionnelle en se basant uniquement sur les rapports de proximité évolutive entre… … Wikipédia en Français
Classification phylogénique — Classification phylogénétique La classification phylogénétique est un système de classification des êtres vivants. Elle tend à remplacer la classification traditionnelle en se basant uniquement sur les rapports de proximité évolutive entre… … Wikipédia en Français
Classification phylogénétique du vivant — Classification phylogénétique La classification phylogénétique est un système de classification des êtres vivants. Elle tend à remplacer la classification traditionnelle en se basant uniquement sur les rapports de proximité évolutive entre… … Wikipédia en Français

Academic Dictionaries and Encyclopedias

Document classification

Contents

"Content based" versus "request based" classification

Classification versus indexing

Automatic document classification

Techniques

Applications

See also

Further reading

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Document classification

Contents

"Content based" versus "request based" classification

Classification versus indexing

Automatic document classification

Techniques

Applications

See also

Further reading

Look at other dictionaries:

Share the article and excerpts

Direct link