- Expectation-maximization algorithm
An expectation-maximization (EM) algorithm is used in
statistics for findingmaximum likelihood estimates ofparameter s in probabilistic models, where the model depends on unobservedlatent variable s. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated.History
The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster,
Nan Laird , andDonald Rubin in the Journal of theRoyal Statistical Society [Arthur Dempster,Nan Laird , andDonald Rubin . " [http://links.jstor.org/sici?sici=0035-9246%281977%2939%3A1%3C1%3AMLFIDV%3E2.0.CO%3B2-ZMaximum likelihood from incomplete data via the EM algorithm] ". "Journal of the Royal Statistical Society ", Series B, 39(1):1–38, 1977] . They pointed out that the method had been "proposed many times in special circumstances" by other authors, but the 1977 paper generalized the method and developed the theory behind it.Applications
EM is frequently used for
data clustering inmachine learning andcomputer vision . Innatural language processing , two prominent instances of the algorithm are theBaum-Welch algorithm (also known as "forward-backward") and theinside-outside algorithm for unsupervised induction ofprobabilistic context-free grammar s.In
psychometrics , EM is almost indispensable for estimating item parameters and latent abilities ofitem response theory models.With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price and manage risk of a portfolio.
The EM algorithm (and its faster variant OS-EM) is also widely used in medical image reconstruction, especially in
Positron Emission Tomography andSingle Photon Emission Computed Tomography . See below for other faster variants of EM.Demonstrations and activities
Various 1D, 2D and 3D [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture demonstrations of EM together with Mixture Modeling] are provided as part of the paired
SOCR activities and applets. These applets and activities show empirically the properties of the EM algorithm for parameter estimation in diverse settings.A number of methods have been proposed to accelerate the sometimes slow convergence of the EM algorithm, such as those utilising
conjugate gradient and modifiedNewton-Raphson techniques. Additionally EM can be utilised with constrained estimation techniques.EM in a nutshell
EM is an iterative technique for estimating the value of some unknown quantity, given the values of some correlated, known quantity.
The approach is to first assume that the quantity is represented as a value in some parameterized probability distribution (the popular application is a mixture of Gaussians, hence the example below). The EM procedure, then, is:
#Initialize the distribution parameters
#Repeat until "convergence:"
##E-Step: estimate the [E] xpected value of the unknown variables, given the current parameter estimate
##M-Step: re-estimate the distribution parameters to [M] aximize the likelihood of the data, given the expected estimates of the unknown variablesSteps 1 and 2 are loaded, and depend on what distribution you choose, how many parameters there are, and how complicated the missing value is. Two other caveats: First, the choice of initial parameters technically does not matter, but in practice a poor choice can lead to a bad estimate. Second, convergence, though guaranteed, may take a long time to get to. In practice, if the values of the missing variables and parameters do not change significantly between two iterations, then the algorithm terminates.
A simple application is filling missing values in a column of a database. Assume you know about 50% of the values in a column, but the remaining values are corrupt or missing. For simplicity, assume that the data is distributed normally with a unit variance. Then, the only parameter that must be computed is the mean value. What's more convenient, is that the E-step is simple: the expected value of each missing value is the mean. But the E-step changes the overall mean of the data, and so the estimate can be improved. This will continue for a few iterations. An example:
Initialization Step:Data: [4, 10, ?, ?] Initial mean value: 0
New Data: [4, 10, 0, 0]Step 1:New Mean: 3.5
New Data: [4, 10, 3.5, 3.5]Step 2:New Mean: 5.25
New Data: [4, 10, 5.25, 5.25]Step 3:New Mean: 6.125
New Data: [4, 10, 6.125, 6.125]Step 4:New Mean: 6.5625
New Data: [4, 10, 6.5625, 6.5625]Step 5:New Mean: 6.7825
New Data: [4, 10, 6.7825, 6.7825]Result:New Mean: 6.890625
From here you can see the value is slowly converging toward 7. For this simple model (a univariate normal distribution with unit variance), it can be seen that substituting the average of the known values is the best answer. For more complex models, there is no easy way to find the best answer, and the EM algorithm is a very popular approach for estimating the answer.
In the mixture of Gaussians demonstration below, we have a collection of multi-dimensional objects. We'll assume that each individual data object was generated from one of K Gaussians. If we knew "which" Gaussian each object came from, then estimating the parameters is easy (using
Maximum likelihood Estimation techniques). Instead, we must first compute the Expected value that the object came from each Gaussian (E-step) and then estimate the parameters, given these expected assignments (M-step).Specification of the EM procedure
Let denote incomplete data consisting of values of observable variables, and let denote the missing data. Together, and form the complete data. can either be actual missing measurements or a hidden variable that would make the problem easier if its value were known. For instance, in
mixture model s, the likelihood formula would be much more convenient if mixture components that "generated" the samples were known (see example below).Let be the joint
probability density function (continuous case) orprobability mass function (discrete case) of the complete data with parameters given by the vector : . This function can also be considered as the complete datalikelihood , that is, it can be thought of as a function of .Further, note that theconditional distribution of the missing data given the observed can be expressed as::
by using the Bayes rule and the
law of total probability . (This formulation only requires knowledge of the observation likelihood given the unobservable data and the probability of the unobservable data .)An EM algorithm iteratively improves an initial estimate by constructing new estimates and so on.An individual re-estimation step that derives from takes the following form:
:
where is the
expected value of the log-likelihood. In other words, we do not know the complete data, so we cannot say what is the exact value of the likelihood, but given the data that we do know (the 's), we can find "a posteriori" estimates of the probabilities for the various values of the unknown 's. For each set of 's there is a likelihood value for , and we can thus calculate anexpected value of the likelihood with the given values of 's (and which depends on the previously assumed value of because this influenced the probabilities of the "z"'s)."Q" is given by
:
or more generally
:where it is understood that this denotes the conditional expectation of being taken with the used in the conditional distribution of fixed at . (The log of the likelihood is often used instead of true likelihood because it leads to easier formulas, but still attains its maximum at the same point as the likelihood.)
In other words, is the value that maximizes (M) the
conditional expectation (E) of the complete data log-likelihood given the observed variables under the previous parameter value.The expectation in the continuous case would be given by:
Speaking of an expectation (E) step is a bit of a misnomer.What is calculated in the first step are the fixed, data-dependent parameters of the function "Q".Once the parameters of "Q" are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.
The origin of the name comes from the fact that in the paper by Dempster, Laird and Rubin, they first discuss a less general problem in which the probability distribution is of the
exponential family , and in that case the so-called E step consists of finding the expected values of certainsufficient statistic s of the complete data.Properties
Part of the reason for the popularity of EM algorithms is that,as can be shown, an EM iteration does not decrease the observed data likelihood function. However, there is no guarantee that the sequence converges to a
maximum likelihood estimator . For multimodal distributions, this means that an EM algorithm will converge to alocal maximum (orsaddle point ) of the observed data likelihood function, depending on starting values. There are a variety of heuristic approaches for escaping a local maximum such as using several different random initial estimates, , or applyingsimulated annealing .EM is particularly useful when
maximum likelihood estimation of a complete data model is easy.If closed-form estimators exist, the M step is often trivial.A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known."Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms.The
Baum-Welch algorithm is an example of an EM algorithm applied tohidden Markov model s.Another example is the EM algorithm for fitting a mixture density model.An EM algorithm can also find
maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood.There are other methods for finding maximum likelihood estimates, such as
gradient descent ,conjugate gradient or variations of theGauss-Newton method . Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.Incremental versions
The classic EM procedure is to replace both "Q" and "θ" with their optimal possible (argmax) values at each iteration. However it can be shown (see Neal & Hinton, 1999) that simply finding "Q" and "θ" to give "some" improvement over their current value will also ensure successful convergence.
For example, to improve "Q", we could restrict the space of possible functions to a computationally simple distribution such as a factorial distribution,
:
Thus at each E step we compute the variational approximation of "Q".
To improve "θ", we could use any hill-climbing method, and not worry about finding the optimal "θ", just some improvement. This method is also known as "Generalized EM" (GEM).
Relation to variational Bayes methods
EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a
probability distribution over the latent variables (in the Bayesian style) together with a point estimate for "θ" (either a maximum likelihood estimate or a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over "θ" as well as the latent variables. In fact the Bayesian approach to inference is simply to treat "θ" as another latent variable. In this paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as described above (variational Bayes ), we may iterate over each latent variable (now including "θ") and optimize them one at a time. There are now "k" steps per iteration, where "k" is the number of latent variables. Forgraphical models this is easy to do as each variable's new "Q" depends only on itsMarkov blanket , so localmessage passing can be used for efficient inference.Example: Gaussian Mixture
Assume that a sample of "m" vectors (or scalars) , where , is drawn from one of "n"
Gaussian distribution s. Let denote which Gaussian is from. The probability that a particular comes from the -dimensional Gaussian isOur task is to estimate the unknown parameters , that is, the mean and standard deviation of each Gaussian and the probability for each Gaussian being drawn for any given point. (Actually it is not clear that we should allow the standard deviations to take any value because then the maximum likelihood may be unbounded as one centers a particular Gaussian on a particular data point and decreases the standard deviation toward zero!)
E-step
Estimation of the unobserved "z"'s (which Gaussian is used), conditioned on the observation, using the values from the last maximization step:
:
M-step
We now want to maximize the expected log-likelihood of the joint event:
:
If we expand the probability of the joint event, we get
:
We have the constraint
:
If we add a
Lagrange multiplier , and expand the pdf, we get:
To find the new estimate , we find a maximum where .
New estimate for mean (using some differentiation rules from
matrix calculus )::
New estimate for covariance:
:
New estimate for class probability:
:
Inserting into the constraint:
:
Inserting into our estimate:
:
These estimates now become our , to be used in the next estimation step.
References
* Robert Hogg, Joseph McKean and Allen Craig. "Introduction to Mathematical Statistics". pp. 359-364. Upper Saddle River, NJ: Pearson Prentice Hall, 2005.
* Radford Neal,Geoffrey Hinton . "A view of the EM algorithm that justifies incremental, sparse, and other variants". InMichael I. Jordan (editor), "Learning in Graphical Models" pp 355-368. Cambridge, MA: MIT Press, 1999.
* [http://www.inference.phy.cam.ac.uk/mackay/itila/ The on-line textbook: Information Theory, Inference, and Learning Algorithms] , byDavid J.C. MacKay includes simple examples of the E-M algorithm such as clustering using the soft K-means algorithm, and emphasizes the variational view of the E-M algorithm.
* [http://citeseer.ist.psu.edu/bilmes98gentle.html A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models] , by [http://ssli.ee.washington.edu/people/bilmes/ Jeff Bilmes] includes a simplified derivation of the EM equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
* [http://www.cse.buffalo.edu/faculty/mbeal/thesis/index.html Variational Algorithms for Approximate Bayesian Inference] , by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.
* [http://www.cc.gatech.edu/~dellaert/em-paper.pdf The Expectation Maximization Algorithm] , by Frank Dellaert, gives an easier explanation of EM algorithm in terms of lowerbound maximization.
* [http://www.seanborman.com/publications/EM_algorithm.pdf The Expectation Maximization Algorithm: A short tutorial] , A self contained derivation of the EM Algorithm by Sean Borman.
*http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture SOCR demonstrations of EM and Mixture Modeling]
* Jamshidian, Mortaza and Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by Using Quasi-Newton Methods,"Journal of the Royal Statistical Society, Ser. B , 59, 569--587.External links
* [http://lcn.epfl.ch/tutorial/english/mixtureModel/index.html Java code example]
* [http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html Another Java code example]ee also
*
Estimation theory
*Constrained clustering
*Data clustering
*K-means algorithm
*Imputation (statistics)
*SOCR
*Ordered subset expectation maximization
*Baum-Welch algorithm , a particular case
*Latent variable model
*Hidden Markov Model
*Structural equation model
*Rubin causal model
Wikimedia Foundation. 2010.