- Classification in machine learning
-
See also: Pattern recognition
In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given piece of input data into one of a given number of categories. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.
The piece of input data is formally termed an instance, and the categories are termed classes. The instance is formally described by a vector of features, which together constitute a description of all known characteristics of the instance. Typically, features are either categorical (also known as nominal, i.e. consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting of one of a set of ordered items, e.g. "large", "medium" or "small"), integer-valued (e.g. a count of the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Often, categorical and ordinal data are grouped together; likewise for integer-valued and real-valued data. Furthermore, many algorithms work only in terms of categorical data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).
Classification normally refers to a supervised procedure, i.e. a procedure that learns to classify new instances based on learning from a training set of instances that have been properly labeled by hand with the correct classes. The corresponding unsupervised procedure is known as clustering, and involves grouping data into classes based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). Note that in some fields, the terminology is different: For example, in community ecology, the term "classification" is synonymous with what is commonly known in machine learning as "clustering".
Classification and clustering are examples of the more general problem of pattern recognition, which is the assignment of some sort of output value to a given input value. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence; etc.
A common subclass of classification is probabilistic classification. Algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which simply output a "best" class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. The best class is normally then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers:
- It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a confidence-weighted classifier)
- Correspondingly, it can abstain when its confidence of choosing any particular output is too low
- Because of the probabilities output, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation.
Note that the term statistical classification is often encountered, but used inconsistently in the technical literature. For some writers (especially within the field of machine learning), "statistical classification" and "probabilistic classification" are synonymous. For others, "statistical classification" encompasses any classifier that makes soft decisions using weights, whether or not there is an associated statistical model or probabilistic outputs. For yet others, "statistical classification" is even wider, encompassing practically all of the classification algorithms commonly used in machine learning, including algorithms such as decision trees that make hard decisions using if-then rules similar to the nature of old-style hand-coded classifiers.
Contents
Formal problem statement
See the article on pattern recognition for a formal statement of the problem of classification and related labeling tasks, including a rigorous mathematical treatment.
Application domains
Classification problems arise in many data mining applications.
- Computer vision
- Medical imaging and medical image analysis
- Optical character recognition
- Video tracking
- Drug discovery and development
- Geostatistics
- Speech recognition
- Handwriting recognition
- Biometric identification
- Natural language processing
- Document classification
- Internet search engines
- Credit scoring
- Pattern recognition
See also
- Class membership probabilities
- Compound term processing
- Data mining
- Fuzzy logic
- Data warehouse
- Information retrieval
- Artificial intelligence
- Machine learning
- Pattern recognition
References
External links
- Classifier showdown A practical comparison of classification algorithms.
- Statistical Pattern Recognition Toolbox for Matlab.
- TOOLDIAG Pattern recognition toolbox.
- Library of variable kernel density estimation routines written in C++..
- PAL Classification suite written in Java.
Categories:
Wikimedia Foundation. 2010.