Supervised learning

Supervised learning is a machine learning technique for learning a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs. The output of the functioncan be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a "reasonable" way (see inductive bias).(Compare with unsupervised learning.) The parallel task in human and animal psychology is often referred to as concept learning.

Overview

Supervised learning can generate models of two types. Most commonly, supervised learning generates a global model that maps input objects to desired outputs. In some cases, however, the map is implemented as a set of local models (such as in case-based reasoning or the nearest neighbor algorithm).

In order to solve a given problem of supervised learning (e.g. learning to recognize handwriting) one has to consider various steps:
# Determine the type of training examples. Before doing anything else, the engineer should decide what kind of data is to be used as an example. For instance, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.
# Gathering a training set. The training set needs to be characteristic of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
# Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should be large enough to accurately predict the output.
# Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use artificial neural networks or decision trees.
# Complete the design. The engineer then runs the learning algorithm on the gathered training set. Parameters of the learning algorithm may be adjusted by optimizing performance on a subset (called a "validation" set) of the training set, or via cross-validation. After parameter adjustment and learning, the performance of the algorithm may be measured on a test set that is separate from the training set.

Another term for supervised learning is classification. A wide range of classifiers are available, each with its strengths and weaknesses. Classifier performance depend greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems, this is also referred to as the 'No free lunch theorem'. Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science.

The most widely used classifiers are the Neural Network (Multi-layer Perceptron), Support Vector Machines, k-Nearest Neighbors, Gaussian Mixture Model, Gaussian, Naive Bayes, Decision Tree and RBF classifiers.

Empirical risk minimization

The goal of supervised learning of a global model is to find a function "g", given a set of points of the form ("x", "g"("x")).

It is assumed that the set of points for which the behavior of "g" is known is an independent and identically-distributed random variables sample drawn according to an unknown probability distribution "p" of a larger, possibly infinite, population. Furthermore, one assumes the existence of a task-specific loss function "L" of type

: $L: Y imes Y o Bbb{R}^+$

where "Y" is the codomain of "g" and "L" maps into the nonnegative real numbers (further restrictions may be placed on "L"). The quantity "L"("z", "y") is the loss incurred by predicting "z" as the value of "g" at a given point when the true value is "y".

The "risk" associated with a function "f" is then defined as the expectation of the loss function, as follows:

: $R(f) = sum_i L(f(x_i), g(x_i)) ; p(x_i)$

if the probability distribution "p" is discrete (the analogous continuous case employs a definite integral and a probability density function).

The goal is now to find a function "f"^* among a fixed subclass of functions for which the risk "R"("f"^*) is minimal.

However, since the behavior of "g" is generally only known for a finite set of points ("x"₁, "y"₁), ..., ("x"_n, "y"_n), one can only approximate the true risk, for example with the "empirical risk":

: $ilde{R}_n(f) = frac{1}{n} sum_{i=1}^n L(f(x_i), y_i)$

Selecting the function "f"^* that minimizes the empirical risk is known as the principle of "empirical risk minimization". Statistical learning theory investigates under what conditions empirical risk minimization is admissible and how good the approximations can be expected to be.

Active Learning

There are situations in which unlabeled data is abundant but labeling data is expensive. In such a scenario the learning algorithm can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. With this approach there is a risk that the algorithm might focus on unimportant or even invalid examples.

Active learning can be especially useful in biological research problems such as Protein engineering where a few proteins have been discovered with a certain interesting function and one wishes to determine which of many possible mutants to make next that will have a similar functionDanziger, S.A., Swamidass, S.J., Zeng, J., Dearth, L.R., Lu, Q., Chen, J.H., Cheng, J., Hoang, V.P., Saigo, H., Luo, R., Baldi, P., Brachmann, R.K. and Lathrop, R.H. Functional census of mutation sequence spaces: the example of p53 cancer rescue mutants, (2006) "IEEE/ACM transactions on computational biology and bioinformatics", 3, 114-125. ] .

yntax

Let $T$ be the total set of all data under consideration. For example, in a protein engineering problem, $T$ would include all proteins that are known to have a certain interesting activity and all additional proteins that one might want to test for that activity.

During each iteration, $i$ , $T$ is broken up in to three subsets:
# $mathbf{T}_{K,i}$ : Data points where the label is known.
# $mathbf{T}_{U,i}$ : Data points where the label is unknown.
# $mathbf{T}_{C,i}$ : A subset of $T_{U,i}$ that is chosen to be labeled.

Most of the current research in active learning involves the best method to chose the data points for $T_{C,i}$ .

Minimum Marginal Hyperplane

Most active learning algorithms are built upon Support vector machines (SVMs) and exploit the structure of the SVM to determine which data points to label. Such methods usually calculate the margin, $W$ , of each unlabeled datum in $T_{U,i}$ and treat $W$ as a n-dimensional distance from that datum to separating hyperplane.

Minimum Marginal Hyperplane methods assume that the datum with the smallest $W$ are those that the SVM is most uncertain about and therefore should be placed in $T_{C,i}$ to be labeled. Other similar methods such as Maximum Marginal Hyperplane chose datum with the largest $W$ or Tradeoff methods chose a mix of the smallest and largest $W$ s

Maximum Curiosity

Another active learning method, that typically learns a data set with fewer examples than Minimum Marginal Hyperplane but is more computationally intensive and only works for discrete classifiers is Maximum CuriosityDanziger, S.A., Zeng, J., Wang, Y., Brachmann, R.K. and Lathrop, R.H. Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants,(2007) "Bioinformatics", 23(13), 104-114.] .

Maximum curiosity takes each unlabeled datum in $T_{U,i}$ and assumes all possible labels that that datum might have. This datum with each assumed class is added to $T_{K,i}$ and then the new $T_{K,i}$ is cross-validated. It is assumed that the when the datum is paired up with its correct label, the cross-validated accuracy (or correlation coefficient) of $T_{K,i}$ will most improve. The datum with the most improved accuracy are placed in $T_{C,i}$ to be labeled

Approaches and algorithms

* Analytical learning
* Artificial neural network
* Backpropagation
* Boosting
* Bayesian statistics
* Case-based reasoning
* Decision tree learning
* Inductive logic programming
* Gaussian process regression
* Learning Automata
* Minimum message length (decision trees, decision graphs, etc.)
* Naive bayes classifier
* Nearest Neighbor Algorithm
* Probably approximately correct learning (PAC) learning
* Ripple down rules, a knowledge acquisition methodology
* Symbolic machine learning algorithms
* Subsymbolic machine learning algorithms
* Support vector machines
* Random Forests
* Ensembles of Classifiers
* Ordinal Classification
* Data Pre-processing
* Handling imbalanced datasets

Applications

* Bioinformatics
* Cheminformatics
**Quantitative structure-activity relationship
* Handwriting recognition
* Information retrieval
* Object recognition in computer vision
* Optical character recognition
* Spam detection
* Pattern recognition
* Speech recognition
* Forecasting Fraudulent Financial Statements

General issues

* Computational learning theory
* Inductive bias
* Overfitting (machine learning)
* Version spaces

References

*S. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica Journal 31 (2007) 249-268 (http://www.informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%20Machine%20Learning%20-%20A%20Review%20of...pdf).

External links

* [http://sumo.intec.ugent.be/?q=SUMO_toolbox Matlab SUrrogate MOdeling Toolbox - SUMO Toolbox] - Matlab code for Active Learning + Model Selection + Supervised Learning (Surrogate Modeling)

Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Semi-supervised learning — In computer science, semi supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training typically a small amount of labeled data with a large amount of unlabeled data. Semi supervised… … Wikipedia
supervised study — learning with supervision, studying with guidance … English contemporary dictionary
Learning for Life — Infobox WorldScouting owner = Boy Scouts of America type = organization name = Learning for Life headquarters = Irving, Texas country = United States f date = 1992 members = 1,750,767 youth 61,041 adults (2006) [cite web… … Wikipedia
Supervised agricultural experience — A supervised agriculture experience, or SAE, is required before obtaining a Chapter FFA Degree for the National FFA Organization. An SAE can be anything from raising livestock at your school farm, or a plant project. According to the 13th… … Wikipedia
Machine learning — is a subfield of artificial intelligence that is concerned with the design and development of algorithms and techniques that allow computers to learn . In general, there are two types of learning: inductive, and deductive. Inductive machine… … Wikipedia
Concept learning — Concept learning, also known as category learning, concept attainment, and concept formation, is largely based on the works of the cognitive psychologist Jerome Bruner. Bruner, Goodnow, Austin (1967) defined concept attainment (or concept… … Wikipedia
Computational learning theory — In theoretical computer science, computational learning theory is a mathematical field related to the analysis of machine learning algorithms. Contents 1 Overview 2 See also 3 References 3.1 Surveys … Wikipedia
Transduction (machine learning) — In logic, statistical inference, and supervised learning,transduction or transductive inference is reasoning fromobserved, specific (training) cases to specific (test) cases. In contrast, induction is reasoning from observed training casesto… … Wikipedia
Reinforcement learning — Inspired by related psychological theory, in computer science, reinforcement learning is a sub area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long term reward .… … Wikipedia
Unsupervised learning — In machine learning, unsupervised learning is a class of problems in which one seeks to determine how the data are organised. It is distinguished from supervised learning (and reinforcement learning) in that there are only inputs, and no… … Wikipedia

Academic Dictionaries and Encyclopedias

Supervised learning

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Supervised learning

Look at other dictionaries:

Share the article and excerpts

Direct link