One-shot learning

One-shot learning is an object categorization problem of current research interest in computer vision. Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training images.

The primary focus of this article will be on the solution to this problem presented by L. Fei-Fei, R. Fergus and P. Perona in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol28(4), 2006, which uses a generative object category model and variational Bayesian framework for representation and learning of visual object categories from a handful of training examples. Another paper, presented at the International Conference on Computer Vision and Pattern Recognition (CVPR) 2000 by Erik Miller, Nicholas Matsakis, and Paul Viola will also be discussed.

Motivation

The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans,^[1]^[2] and it is estimated that a child has learned almost of all the 10 ~ 30 thousand object categories in the world by the age of six.^[3] Yet this achievement of the human mind is due not only to its computational power, but also to its ability to synthesize and learn new object classes from existing information about different, previously learned classes. The images below illustrate the idea that given two examples from two different object classes: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of this existing knowledge of previously learned classes when learning new ones. Thus the key motivation and intuition for this one-shot learning technique in the artificial, computational world is that systems, like humans, can use prior information of object categories to learn and classify new objects.^[4]^[5]

Background

As with most classification schemes, one-shot learning involves three main challenges: "

Representation: How should we model objects and categories?

Learning: How may we acquire such models?

Recognition: Given a new image, how do we detect the presence of a known object/category amongst clutter, and despite occlusion, viewpoint, and lighting changes?"^[6]

However, one-shot learning differs greatly from single object recognition and even standard category recognition algorithms is in its emphasis on the principle of knowledge transfer, which encapsulates prior knowledge of learnt categories and allows for learning on minimal training examples.

Knowledge transfer by model parameters: One set of algorithms for one-shot learning achieves knowledge transfer through the reuse of model parameters, often exploiting the similarity between previously learned classes and the new object classes to be learned. Classes of objects are first learned on numerous training examples (i.e. not in a one-shot fashion), then new object classes are learned using transformations of model parameters from the previously learnt classes or selection relevant parameters for a classifier as in M. Fink, 2004.^[7]
Knowledge transfer by sharing features: Another class of algorithms achieves knowledge transfer by sharing parts or features of objects across classes. In a paper presented at CVPR 2005 by Bart and Ullman, an algorithm extracts "diagnostic information" in patches from already learnt classes by maximizing the patches' mutual information, and then applies these features to the learning of a new class. A dog class, for example, may be learned in one shot from previous knowledge of horse and cow classes, because dog objects may contain similar distinguishing patches.^[8]
Knowledge transfer by contextual information: Whereas the previous two groups of knowledge transfer work in one-shot learning relied on the similarity between new object classes and the previously learned classes on which they were based, transfer by contextual information instead appeals to global knowledge of the scene in which the object is placed. A paper presented at NIPS 2004 by K. Murphy et al. uses such global information as frequency distributions in a conditional random field framework to recognize objects.^[9] Another algorithm by D. Hoiem et al. makes use of contextual information in the form of camera height and scene geometry to prune object detection.^[10] Algorithms of this type have two advantages. First, they should be able to learn object classes which are relatively dissimilar in visual appearance; and second, they should perform well precisely in situations where an image has not been hand-cropped and carefully aligned, but rather which naturally occur.^[11]

Theory

The Bayesian one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models.^[12] During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and Variational Bayesian Expectation-Maximization (VBEM).^[13] It is in this stage that the object classes learned previously outside of the one-shot framework inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of p(object | test, train) to p(background clutter | test, train).^[14]

Bayesian framework

Given the task of finding a particular object in a query image, the overall objective of the Bayesian One-Shot Learning algorithm is to compare the probability that that object is present in the image and the probability that only background clutter is present in the image. If the former probability is higher, the algorithm reports the object's presence in the image, and if the latter probability is higher, the algorithm reports the absence of that object in the image. In order to compute these probabilities, the object class must be modelled from a set of (1 ~ 5) training images containing examples of that object.

To formalize these ideas, let $I$ be the query image, which contains either an example of the foreground category $O f g$ or only background clutter of a generic background category $O b g$ . Also let $I t$ be the set of training images used as the foreground category. The decision of whether $I$ contains an object from the foreground category, or only clutter from the background category is:

$R = \frac{p(O_{fg}|I,I_t)}{p(O_{bg}|I, I_t)} = \frac{p(I|I_t, O_{fg})p(O_{fg})}{p(I|I_t, O_{bg})p(O_{bg})},$

where the class posteriors $p (O f g | I, I t)$ and $p (O b g | I, I t)$ have been expanded by Bayes' Theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image $I$ contains an object from the foreground class iff $R$ exceeds a certain threshold $T$ . We next introduce parametric models for the foreground and background classes with parameters $θ$ and $θ b g$ respectively. This foreground parametric model is learned during the learning stage from training images $I t$ , as well as prior information of learnt classes. The background model we assume to be uniform across images. Omitting the constant ratio of category priors, $\frac{p(O_{fg})}{p(O_{bg})}$ , and parametrizing over $θ$ and $θ b g$ yields:

$R \propto \frac{\int{p(I | \theta, O_{fg}) p(\theta | I_t, O_{fg})} d\theta}{\int{p(I | \theta_{bg}, O_{bg}) p(\theta_{bg} | I_t, O_{bg})} d\theta_{bg}} = \frac{\int{p(I | \theta) p(\theta | I_t, O_{fg})} d\theta}{\int{p(I | \theta_{bg}) p(\theta_{bg} | I_t, O_{bg})} d\theta_{bg}}$ , having simplified

p (I | θ, O f g)

and

p (I | θ, O b g)

p (I | θ f g)

and

p (I | θ b g).

The posterior distribution of model parameters given the training images, $p (θ | I t, O f g)$ is estimated in the learning phase of the algorithm. In this estimation, one-shot learning deviates sharply from more traditional Bayesian estimation models which approximate the integral as $δ(θ M L)$ , in favor of a variational approach which makes use of prior information from previously learnt categories. For the background model, however, as well as the categories learned in advance through numerous training examples, this traditional maximum likelihood estimation of the model parameters is used.^[15]

Object category model

For each query image $I$ and training images $I t$ , a constellation model is used for representation.^[16]^[17]^[18] To obtain this model for a given image $I$ , first a set of N interesting regions is detected in the image using the Kadir brady saliency detector.^[19] Each region selected is represented by a location in the image, $X i$ and a description of its appearance, $A i$ . Letting $X = \sum_{i = 1}^N X_i, A = \sum_{i = 1}^N A_i$ and $X t$ and $A t$ the analogous representations for training images, the expression for R becomes:

$R \propto \frac{\int{p(X,A | \theta, O_{fg}) p(\theta | X_t, A_t, O_{fg})} d\theta}{\int{p(X,A | \theta_{bg}, O_{bg}) p(\theta_{bg} | X_t, A_t, O_{bg})} d\theta_{bg}} = \frac{\int{p(X,A | \theta) p(\theta | X_t, A_t, O_{fg})} d\theta}{\int{p(X,A | \theta_{bg}) p(\theta_{bg} | X_t, A_t, O_{bg})} \,d\theta_{bg}}$

The likelihoods $p (X, A | θ)$ and $p (X, A | θ b g)$ are represented as mixtures of constellation models. A typical constellation model has P(3 ~ 7) parts, but there are N(~100) interest regions. Thus a P-dimensional vector h assigns one region of interest (out of N regions) to each model part (for P parts). Thus h denotes a hypothesis (an assignment of interest regions to model parts) for the model and a full constellation model is represented by summing over all possible hypotheses h in the hypothesis space $H$ . Finally the likelihood is written

$p(X,A|\theta) = \sum_{\omega=1}^{\Omega} \sum_{\textbf{h} \in H} p(X,A,\textbf{h}, \omega | \theta).$

The different $ω$ 's represent different configurations of parts, whereas the different hypotheses h represent different assignations of regions to parts, given a part model $ω$ . The assumption that the shape of the model (as represented by $X$ , the collection of part locations) and appearance are independent allows one to consider the likelihood expression $p(X,A,\textbf{h}, \omega | \theta)$ as two separate likelihoods of appearance and shape.^[20]

Appearance

The appearance of each feature is represented by a point in appearance space (discussed below in implementation). "Each part $p$ in the constellation model has a Gaussian density within this space with mean and precision parameters $\theta_{p,\omega}^{A} = { \mu_{p,\omega}^{A}, \Gamma_{p,\omega}^{A} }$ ." From these the appearance likelihood described above is computed as a product of Gaussians over the model parts for a give hypothesis h and mixture component $ω$ .^[21]

Shape

The shape of the model for a given mixture component $ω$ and hypothesis h is represented as a joint Gaussian density of the locations of features. These features are transformed into a scale and translation-invariant space before modelling the relative location of the parts by a 2(P - 1)-dimensional Gaussian. From this, we obtain the shape likelihood, completing our representation of $p(X,A, \textbf{h}, \omega | \theta)$ . In order to reduce the number of hypotheses in the hypothesis space $H$ , only those hypotheses which satisfy the ordering constraint that the x-coordinate of each part is monotonically increasing are considered. This eliminates $P!$ hypotheses from $H$ .^[22]

Conjugate densities

In order to compute $R$ , the integral $\int{p(X,A | \theta) p(\theta | X_t, A_t, O_{fg})} d\theta$ must be evaluated, but is analytically intractable. The object category model above gives information about $p (X, A | θ)$ , so what remains is to examine $p (θ | X t, A t, O)$ , the posterior of $θ$ , and find a sufficient approximation to render the integral tractable. Previous work approximates the posterior by a $δ$ function centered at $θ *$ , collapsing the integral in question into $p (X, A | θ *)$ . This $θ *$ is normally estimated using a Maximum Likelihood ( $θ * = θ M L$ ) or Maximum A Posteriori ( $θ * = θ M A P$ ) procedure. However, because in one-shot learning, few training examples are used, the distribution will not be well-peaked, as is assumed in a $δ$ function approximation. Thus instead of this traditional approximation, the Bayesian one-shot learning algorithm seeks to "find a parametric form of $p (θ)$ such that the learning of $p (θ | X t, A t, O f g)$ is feasible." The algorithm employs a Normal-Wishart distribution as the conjugate prior of $p (θ | X t, A t, O f g)$ , and in the learning phase, variational Bayesian methods with the same computational complexity as maximum likelihood methods are used to learn the hyperparameters of the distribution. Then, since $p (X, A | θ)$ is a product of Gaussians, as chosen in the object category model, the integral reduces to a multivariate Student's T distribution, which can be evaluated.^[23]

Implementation

Feature detection and representation

To detect features in an image so that it can be represented by a constellation model, the Kadir Brady feature detector is used on grey-scale images, finding salient regions of the image. These regions are then clustered, yielding a number of features (the clusters) and the shape parameter $X$ , composed of the cluster centers. The Kadir Brady detector was chosen because it produces fewer, more salient regions, as opposed to feature detectors like multiscale Harris, which produces numerous, less significant regions. Feature detection is illustrated to the right.

The regions are then taken from the image and rescaled to a small patch of 11 by 11 pixels, allowing each patch to be represented in 121-dimensional space. This dimensionality is reduced using principal component analysis, and $A$ , the appearance parameter, is then formed from the first 10 principal components of each patch.^[24]

Learning

To obtain shape and appearance priors, three categories(spotted cats, faces, and airplanes) are learned using maximum likelihood estimation. These object category model parameters are then used to estimate the hyper-parameters of the desired priors.

Given a set of training examples, the algorithm runs the feature detector on these images, and determines model parameters from the salient regions. The hypothesis index h assigning features to parts prevents a closed-form solution of the linear model, so the posterior $p (θ | X t, A t, O f g)$ is estimated by variational Bayesian expectation-maximization, which is run until parameter convergence after ~ 100 iterations. Learning a category in this fashion takes under a minute on a 2.8 GHz machine with a 4-part model and < 10 training images.^[25]

Experimental results

Motorbike example

To learn the motorbike category:

Six training images are selected from the motorbike category of the Caltech 4 Data Set and the Kadir Brady detector is applied, giving $X t$ and through PCA, $A t$ . Examples are shown below.

Next, the prior model parameters are computed from 30 models $θ t$ , 10 from each of the three learnt categories: spotted cats, faces, and airplanes. This prior encodes the knowledge that "models lacking visual consistency [ie background clutter] occupy a different part of the parameter space [from] coherent models."

In learning, which is performed next, the prior biases the posterior $p (θ | X t, A t, O f g)$ towards parts of the parameter space corresponding to coherent models. Only one mixture component is used, letting $Ω = 1$ . The estimation of the posterior is shown below.

Finally, the figures below show the learned motorbike model with shape and appearance of parts, and the corresponding features.

For recognition tests, the model above is applied to 50 images which contain motorbikes and 50 which do not. The image below shows an ROC curve, measuring the probability of detection over the probability of false detection, as well as some recognized examples.

Comparison with maximum likelihood and MAP methods

As shown in the figure to the right, the Bayesian One-Shot Learning algorithm significantly outperforms a maximum likelihood procedure on a small number of training images.

However, the authors believe that more dramatic improvement could be achieved with more than three initial training categories or a stronger model. Such a model might include 6 or 7 parts, several mixture components, representations for curve contours, or ability to handle occlusions. They determined, however, that a large strength of the model lies in the choice of prior. In all, the algorithm performs with accuracy from 70-95 percent. In addition, a large advantage of this algorithm is that the categories used to set the priors (here, spotted cats, faces, and airplanes) do not need to be similar to the categories to be learned from few training examples, as demonstrated by their success on learning categories from the Caltech101 dataset.^[26]

Learning from one example through shared densities on transforms

An alternative to the Bayesian One-Shot Learning algorithm, the algorithm presented by Erik Miller, Nicholas Matsakis, and Paul Viola at ICCV 2000 uses knowledge transfer by model parameters to learn a new object category which is similar in appearance to previously learnt categories. In their paper, an image is represented as either a texture and shape, or as a latent image which has been transformed, denoted by $I = T (I L)$ .

Congealing

Whereas the term vectorization denotes the process of bring one image into correspondence with another, the authors of this paper have coined the term congealing to be "the simultaneous vectorization of each of a set of images to each other." For a set of training images of a certain category, congealing iteratively transforms each image to minimize the images' joint pixelwise entropies E, where

$E = \sum_{p=1}^{P}H(\nu(p)),$

"where $ν(p)$ is the binary random variable defined by the values of a particular pixel p across all of the images, $H ()$ is the discrete entropy function of that variable, and $1\leq p \leq P$ is the set of pixel indices for the image."

The congealing algorithm begins with a set of images $I i$ and a corresponding transform matrix $U i$ , which at the end of the algorithm will represent the transformation of $I i$ into its latent image $I_{L_i}$ . These latent images $I_{L_i}$ minimize the joint pixel-wise entropies. Thus the task of the congealing algorithm is to estimate the transformations $U i$ .

Sketch of algorithm:

Initialize $U I$ 's to the identity.

Compute the joint pixelwise entropies of the current set of images.

For each image $I i$ , iterate through all possible affine transformations $A$ (rotation, x-translation, y-translation, x-scale, y-scale, x-shear, y-shear) and test if $A U i$ decreases the joint pixelwise entropies. If so, set $U i = A U i$ .

Repeat previous step until convergence.

At the end of the algorithm, $U_i(I) = I_{L_i}$ , and $T = U_i^{-1}$ transforms the latent image back into the originally observed image. Congealing applied to a set of 0's and a set of 2's is shown on the right.^[27]

Classification

To use this model for classification, we must estimate the model with the maximum posterior probability given an observed image $I$ . An application of Bayes' rule to $P (c j | I)$ and parametrization by the transformation $T$ gives a difficult integral which the authors approximate, and then seek the best transform $T$ . That is, the transformation which maps the test image to its latent image. Once this transformation is found, the test image can be transformed into its latent image, and a nearest neighbor classifier based on Hausdorff distance between images is used to classify the latent image (and thus the test image) as belonging to a particular class $c j$ .

To find this optimal $T$ , the authors propose to insert the test image I into the training ensemble for the congealing process. Since we assume that the test image is drawn from one of the classes $c j$ , congealing will provide a corresponding $T_{\text{test}} = U_{\text{test}}^{-1}$ which maps I to its latent image. The latent image can now be classified.^[28]

Single-example classification

Given a set of transformations $B i$ obtained from congealing many images of a certain category, the authors extend their classifier to the case where only one training $I t$ example of a new category $c$ is allowed. Applying all the transformations $B i$ sequentially to $I t$ , we create an artificial data training set for $c$ . This artificial data set can be made larger by borrowing transformations from not only one, but many already known categories. Once this data set is obtained, $I$ , a test instance of $c$ , can be classified as in the normal classification procedure. The key assumption here is that categories are similar enough that the transforms from one can be applied to another.^[29]

Citations

^ F.F. Li et al., 2002
^ S. Thorpe et al., 1996
^ Biederman et al., 1987.
^ L. Fei Fei et al., 2006, Section 1
^ L. Fei-Fei, Knowledge transfer, 2006, Section 1
^ L. Fei-Fei et al., 2006, Section 2
^ M. Fink, 2004
^ Bart and Ullman, 2005
^ K. Murphy et al., 2004
^ D. Hoiem et al., 2005
^ Knowledge Transfer, Section 2
^ Burl et al., 1996.
^ Attias, 1999.
^ L. Fei-Fei et al., 2006
^ L. Fei-Fei et al., 2006, Section 3.1
^ Burl et al., 1996
^ M. Weber et al., 2000
^ R. Fergus et al., 2005
^ T. Kadir and M. Brady, 2001
^ L. Fei-Fei et al., 2006, Section 3.2
^ L. Fei-Fei et al., 2006, Section 3.2.1
^ L. Fei-Fei et al., 2006, Section 3.2.1
^ L. Fei-Fei et al., 2006, Section 3.4.3
^ L. Fei-Fei et al., 2006, Section 5.1
^ L. Fei-Fei et al., 2006, Section 4, Section 5.2
^ L. Fei-Fei et al., 2006, Section 3, Section 6
^ Miller et al., 2000, Section 3
^ Miller et al., 2000, Section 4
^ Miller et al., 2000, Section 7

References

L. Fei-Fei, "Knowledge transfer in learning to recognize visual object classes." International Conference on Development and Learning (ICDL). 2006. PDF
L. Fei-Fei, R. Fergus and P. Perona, "One-Shot learning of object categories". IEEE Trans. Pattern Analysis and Machine Intelligence, Vol28(4), 594 - 611, 2006.PDF
Miller, Matsakis, and Viola, "Learning from One Example through Shared Densities on Transforms". Proc. Computer Vision and Pattern Recognition, 2000.PDF
F.F. Li, R. VanRullen, C.Coch, and P. Perona, "Rapid natural scene categorization in the near absence of attention". PNAS, 99(14):9596-9601, 2002.
S. Thorpe, D. Fize, and C. Marlot, "Speed of processing in the human visual system". Nature, 381:520-522, 1996.
I. Biederman. "Recognition-by-Components: a theory of human understanding". Psychological Review, 94:115-147, 1987.
M. Fink, "Object classification from a single example utilizing class relevance pseudo-metrics". NIPS, 2004.
Bart and Ullman "Cross-generalization: learning novel classes from a single example by feature replacement". CVPR, 2005.
K. Murphy, A. Torralba, W.T. Freeman, "Using the forest to see the trees: a graphical model relating features, objects, and scenes". NIPS, 2004.
D. Hoiem, A.A. Efros, and M. Herbert, "Geometric context from a single image". ICCV, 2005.
H. Attias, "Inferring Parameters and Structure of Latent Variable Models by Variational Bayes". Proc. of the 15th Conf. in Uncertainty in Artificial Intelligence, pp. 21-30, 1999.
M. Burl, M. Weber, and P. Perona, "A Probabilistic Approach to Object Recognition Using Local Photometry and Global Geometry". Proc. European Conf. Computer Vision, pp. 628-641, 1996.
R. Fergus, P. Perona, and A. Zisserman, "Object Class Recognition by Unsupervised Scale-Invariant Learning". Proc. Computer Vision and Pattern Recognition, pp. 264-271, 2003.
M. Weber, M. Welling, and P. Perona, "Unsupervised Learning of Models for Recognition". Proc. European Conf. Computer Vision, pp. 101-108, 2000.
T. Kadir and M. Brady, "Scale, Saliency, and Image Description". Int'l J. of Computer Vision, vol. 45, no. 2, pp. 83-105, 2001.

Academic Dictionaries and Encyclopedias

One-shot learning

Contents

Motivation

Background