Gibbs sampling

Gibbs sampling

In statistics and in statistical physics, Gibbs sampling or a Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. The purpose of such a sequence is to approximate the joint distribution; to approximate the marginal distribution of one of the variables, or some subset of the variables (for example, the unknown parameters or latent variables); or to compute an integral (such as the expected value of one of the variables). Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled. Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers, and hence produces different results each time it is run), and is an alternative to deterministic algorithms for statistical inference such as variational Bayes or the expectation-maximization algorithm (EM).

Gibbs sampling is an example of a Markov chain Monte Carlo algorithm. The algorithm is named after the physicist J. W. Gibbs, in reference to an analogy between the sampling algorithm and statistical physics. The algorithm was described by brothers Stuart and Donald Geman in 1984, some eight decades after the passing of Gibbs.[1]

In its basic version, Gibbs sampling is a special case of the Metropolis–Hastings algorithm. However, in its extended versions (see below), it can be considered a general framework for sampling from a large set of variables by sampling each variable (or in some cases, each group of variables) in turn, and can incorporate the Metropolis–Hastings algorithm (or similar methods such as slice sampling) to implement one or more of the sampling steps.

Gibbs sampling is applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and is easy (or at least, easier) to sample from. The Gibbs sampling algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. It can be shown (see, for example, Gelman et al. 1995) that the sequence of samples constitutes a Markov chain, and the stationary distribution of that Markov chain is just the sought-after joint distribution.

Gibbs sampling is particularly well-adapted to sampling the posterior distribution of a Bayesian network, since Bayesian networks are typically specified as a collection of conditional distributions.

Contents

Implementation

Gibbs sampling, in its basic incarnation, is a special case of the Metropolis–Hastings algorithm. The point of Gibbs sampling is that given a multivariate distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution. Suppose we want to obtain \left.k\right. samples of \mathbf{X} = \{x_1, \dots, x_n\} from a joint distribution \left.p(x_1, \dots, x_n)\right.. Denote the ith sample by \mathbf{X}^{(i)} = \{x_1^{(i)}, \dots, x_n^{(i)}\}. We proceed as follows:

  1. We begin with some initial value \mathbf{X}^{(0)} for each variable.
  2. For each sample i = \{1 \dots k\}, sample each variable x_j^{(i)} from the conditional distribution p(x_j^{(i)}|x_1^{(i)},\dots,x_{j-1}^{(i)},x_{j+1}^{(i-1)},\dots,x_n^{(i-1)}). That is, sample each variable from the distribution of that variable conditioned on all other variables, making use of the most recent values and updating the variable with its new value as soon as it has been sampled.

The samples then approximate the joint distribution of all variables. Furthermore, the marginal distribution of any subset of variables can be approximated by simply examining the samples for that subset of variables, ignoring the rest. In addition, the expected value of any variable can be approximated by averaging over all the samples.

Observed data is incorporated into the sampling process by creating separate variables for each piece of observed data and fixing the variables in question to their observed values, rather than sampling from those variables. Supervised learning can be done using a similar process. It is also easy to incorporate data with missing values, or to do semi-supervised learning, by simply fixing the values of all variables whose values are known, and sampling from the remainder.

For observed data, there will be one variable for each observation — rather than, for example, one variable corresponding to the sample mean or sample variance of a set of observations. In fact, there generally will be no variables at all corresponding to concepts such as "sample mean" or "sample variance". Instead, in such a case there will be variables representing the unknown true mean and true variance, and the determination of sample values for these variables results automatically from the operation of the Gibbs sampler.

  • The initial values of the variables can be determined randomly or by some other algorithm such as expectation-maximization.
  • It is not actually necessary to determine an initial value for the first variable sampled.
  • It is common to ignore some number of samples at the beginning (the so-called burn-in period), and then consider only every nth sample when averaging values to compute an expectation. For example, the first 1,000 samples might be ignored, and then every 100th sample averaged, throwing away all the rest. The reason for this is that (1) successive samples are not independent of each other but form a Markov chain with some amount of correlation; (2) the stationary distribution of the Markov chain is the desired joint distribution over the variables, but it may take awhile for that stationary distribution to be reached. Sometimes, algorithms can be used to determine the amount of autocorrelation between samples and the value of n (the period between samples that are actually used) computed from this, but in practice there is a fair amount of "black magic" involved.
  • The process of simulated annealing is often used to reduce the "random walk" behavior in the early part of the sampling process (i.e. the tendency to move slowly around the sample space, with a high amount of autocorrelation between samples, rather than moving around quickly, as is desired). Other techniques that may reduce autocorrelation are collapsed Gibbs sampling, blocked Gibbs sampling, and ordered overrelaxation; see below.

Relation of conditional distribution and joint distribution

Furthermore, the conditional distribution of one variable given all others is proportional to the joint distribution:

p(x_j|x_1,\dots,x_{j-1},x_{j+1},\dots,x_n) = \frac{p(x_1,\dots,x_n)}{p(x_1,\dots,x_{j-1},x_{j+1},\dots,x_n)} \propto p(x_1,\dots,x_n)

"Proportional to" in this case means that the denominator is not a function of xj and thus is the same for all values of xj; it forms part of the normalization constant for the distribution over xj. In practice, to determine the nature of the conditional distribution of a factor xj, it is easiest to factor the joint distribution according to the individual conditional distributions defined by the graphical model over the variables, ignore all factors that are not functions of xj (all of which, together with the denominator above, constitute the normalization constant), and then reinstate the normalization constant at the end, as necessary. In practice, this means doing one of three things:

  1. If the distribution is discrete, the individual probabilities of all possible values of xj are computed, and then summed to find the normalization constant.
  2. If the distribution is continuous and of a known form, the normalization constant will also be known.
  3. In other cases, the normalization constant can usually be ignored, as most sampling methods do not require it.

Mathematical background

Suppose that a sample \left.X\right. is taken from a distribution depending on a parameter vector \theta \in \Theta \,\! of length \left.d\right., with prior distribution g(\theta_1, \ldots , \theta_d). It may be that \left.d\right. is very large and that numerical integration to find the marginal densities of the \left.\theta_i\right. would be computationally expensive. Then an alternative method of calculating the marginal densities is to create a Markov chain on the space \left.\Theta\right. by repeating these two steps:

  1. Pick a random index 1 \leq j \leq d
  2. Pick a new value for \left.\theta_j\right. according to g(\theta_1, \ldots , \theta_{j-1} , \, \cdot \, , \theta_{j+1} , \ldots , \theta_d )

These steps define a reversible Markov chain with the desired invariant distribution \left.g\right.. This can be proved as follows. Define xjy if \left.x_i = y_i\right. for all i \neq j and let \left.p_{xy}\right. denote the probability of a jump from x \in \Theta to y \in \Theta. Then, the transition probabilities are

p_{xy} = \begin{cases}
\frac{1}{d}\frac{g(y)}{\sum_{z \in \Theta: z \sim_j x} g(z) } & x \sim_j y \\
0 & \text{otherwise}
\end{cases}

So


g(x) p_{xy} = \frac{1}{d}\frac{ g(x) g(y)}{\sum_{z \in \Theta: z \sim_j x} g(z) }
= \frac{1}{d}\frac{ g(y) g(x)}{\sum_{z \in \Theta: z \sim_j y} g(z) }
= g(y) p_{yx}

since xjy is an equivalence relation. Thus the detailed balance equations are satisfied, implying the chain is reversible and it has invariant distribution \left.g\right..

In practice, the suffix \left.j\right. is not chosen at random, and the chain cycles through the suffixes in order. In general this gives a non-stationary Markov process, but each individual step will still be reversible, and the overall process will still have the desired stationary distribution (as long as the chain can access all states under the fixed ordering).

Variations and extensions

  • A blocked Gibbs sampler groups two or more variables together and samples from their joint distribution conditioned on all other variables, rather than sampling from each one individually. For example, in a hidden Markov model, a blocked Gibbs sampler might sample from all the latent variables making up the Markov chain in one go, using the forward-backward algorithm.
  • A collapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable. For example, imagine that a model consists of three variables A, B, and C. A simple Gibbs sampler would sample from p(A|B,C), then p(B|A,C), then p(C|A,B). A collapsed Gibbs sampler might replace the sampling step for A with a sample taken from the marginal distribution p(A|C), with variable B integrated out in this case. Alternatively, variable B could be collapsed out entirely, alternately sampling from p(A|C) and p(C|A) and not sampling over B at all. The distribution over a variable A that arises when collapsing a parent variable B is called a compound distribution; sampling from this distribution is generally tractable when B is the conjugate prior for A, particularly when A and B are members of the exponential family. For more information, see the article on compound distributions or Liu (1994).[2]
  • A Gibbs sampler with ordered overrelaxation samples a given odd number of candidate values for x_j^{(i)} at any given step and sorts them, along with the single value for x_j^{(i-1)} according to some well-defined ordering. If x_j^{(i-1)} is the sth smallest in the sorted list then the x_j^{(i)} is selected as the sth largest in the sorted list. For more information, see Neal (1995).[3]

The primary purpose of these variations is to reduce autocorrelation between samples (see above). Depending on the particular conditional distributions and the nature of the observed data, this may or may not help. Commonly, blocked or collapsed Gibbs sampling increases the complexity of the sampling process because it induces greater dependencies among variables, but in some cases, it may actually make things easier.

It is also possible to extend Gibbs sampling in various ways. For example, in the case of variables whose conditional distribution is not easy to sample from, a single iteration of slice sampling or the Metropolis-Hastings algorithm can be used to sample from the variables in question. It is also possible to incorporate variables that are not random variables, but whose value is deterministically computed from other variables. Generalized linear models, e.g. logistic regression (aka "maximum entropy models"), can be incorporated in this fashion. (BUGS, for example, allows this type of mixing of models.)

Failure modes

There are two ways that Gibbs sampling can fail. The first is when there are islands of high-probability states, with no paths between them. For example, consider a probability distribution over 2-bit vectors, where the vectors (0,0) and (1,1) each have probability ½, but the other two vectors (0,1) and (1,0) have probability zero. Gibbs sampling will become trapped in one of the two high-probability vectors, and will never reach the other one. More generally, for any distribution over high-dimensional, real-valued vectors, if two particular elements of the vector are perfectly correlated (or perfectly anti-correlated), those two elements will become stuck, and Gibbs sampling will never be able to change them.

The second problem can happen even when all states have nonzero probability and there is only a single island of high-probability states. For example, consider a probability distribution over 100-bit vectors, where the all-zeros vector occurs with probability ½, and all other vectors are equally probable, and so have a probability of \frac{1}{2(2^{100}-1)} each. If you want to estimate the probability of the zero vector, it would be sufficient to take 100 or 1000 samples from the true distribution. That would very likely give an answer very close to ½. But you would probably have to take more than 2100 samples from Gibbs sampling to get the same result. No computer could do this in a lifetime.

This problem occurs no matter how long the burn in period is. This is because in the true distribution, the zero vector occurs half the time, and those occurrences are randomly mixed in with the nonzero vectors. Even a small sample will see both zero and nonzero vectors. But Gibbs sampling will alternate between returning only the zero vector for long periods (about 299 in a row), then only nonzero vectors for long periods (about 299 in a row). Thus convergence to the true distribution is extremely slow, requiring much more than 299 steps; taking this many steps is not computationally feasible in a reasonable time period. The slow convergence here can be seen as a consequence of the curse of dimensionality.

Note that a problem like this can be solved by block sampling the entire 100-bit vector at once. (This assumes that the 100-bit vector is part of a larger set of variables. If this vector is the only thing being sampled, then block sampling is equivalent to not doing Gibbs sampling at all, which by hypothesis would be difficult.)

Software

The WinBUGS software (Bayesian inference Using Gibbs Sampling; the open source version is called OpenBUGS) does a Bayesian analysis of complex statistical models using Markov chain Monte Carlo.

JAGS (Just another Gibbs sampler) is a GPL program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo.

Notes

  1. ^ Geman, S.; Geman, D. (1984). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images". IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6): 721–741. doi:10.1109/TPAMI.1984.4767596. 
  2. ^ Liu, Jun S. (September 1994). "The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem". Journal of the American Statistical Association 89 (427): 958–966. doi:10.2307/2290921. JSTOR 2290921. 
  3. ^ Neal, Radford M. (1995). Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation (Technical report). University of Toronto, Department of Statistics. 9508. 

References

  • Casella, George; George, Edward I. (1992). "Explaining the Gibbs sampler". The American Statistician 46 (3): 167–174. doi:10.2307/2685208. JSTOR 2685208.  (Contains a basic summary and many references.)
  • Gelfand, Alan E.; Smith, Adrian F. M. (1990). "Sampling-Based Approaches to Calculating Marginal Densities". Journal of the American Statistical Association 85 (410): 398–409. doi:10.2307/2289776. JSTOR 2289776. MR1141740. 
  • Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. London: Chapman and Hall. First edition, 1995. (See Chapter 11.)
  • C.P. Robert and G. Casella. "Monte Carlo Statistical Methods" (second edition). New York: Springer-Verlag, 2004.
  • David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. "Markov Chains and Mixing Times". http://www.uoregon.edu/~dlevin/MARKOV/
  • Bolstad, William M. (2010) Understanding Computational Bayesian Statistics, John Wiley ISBN 0-470-04609-8
  • Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. ISBN 0-387-31073-8. 

External links


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Gibbs-Sampling — ist ein Algorithmus, um eine Folge von Stichproben der gemeinsamen Wahrscheinlichkeitsverteilung zweier oder mehrerer Zufallsvariablen zu erzeugen. Das Ziel ist es dabei, die unbekannte gemeinsame Verteilung zu approximieren. Der Algorithmus ist… …   Deutsch Wikipedia

  • Gibbs — may refer to:People*Cecil Armstrong Gibbs, composer *Cory Gibbs, soccer player *Frederic A. Gibbs, neurologist *George Gibbs (mineralogist), (1776 1833) *George Gibbs (geologist), (1815 1873) *Herschelle Gibbs, South African cricketer *Humphrey… …   Wikipedia

  • Gibbs — ist der Familienname folgender Personen: Addison Crandall Gibbs (1825–1886), US amerikanischer Politiker Alan Gibbs, Gründer des britischen Fahrzeugunternehmens Gibbs Technologies Barry Gibbs (* 1948), kanadischer Eishockeyspieler Bob Gibbs (*… …   Deutsch Wikipedia

  • Gibbs measure — In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is the measure associated with the Boltzmann distribution, and… …   Wikipedia

  • Josiah Willard Gibbs — Infobox Scientist box width = 300px name = J. Willard Gibbs image size = 300px caption = Josiah Willard Gibbs birth date = birth date|1839|2|11|mf=y birth place = New Haven, Connecticut, USA death date = death date and… …   Wikipedia

  • Metropolis-Sampling — Der Metropolisalgorithmus ist eine Monte Carlo Methode zur Erzeugung von Zuständen eines Systems entsprechend der Boltzmann Verteilung. Inhaltsverzeichnis 1 Algorithmus 1.1 Verallgemeinerung 2 Anwendungen 2.1 Monte Carlo Simulation …   Deutsch Wikipedia

  • Charles Lawrence (mathematician) — Charles Lawrence Nationality American Fields Bioinformatics …   Wikipedia

  • Markov chain Monte Carlo — MCMC redirects here. For the organization, see Malaysian Communications and Multimedia Commission. Markov chain Monte Carlo (MCMC) methods (which include random walk Monte Carlo methods) are a class of algorithms for sampling from probability… …   Wikipedia

  • Monte Carlo method — Not to be confused with Monte Carlo algorithm. Computational physics …   Wikipedia

  • List of mathematics articles (G) — NOTOC G G₂ G delta space G networks Gδ set G structure G test G127 G2 manifold G2 structure Gabor atom Gabor filter Gabor transform Gabor Wigner transform Gabow s algorithm Gabriel graph Gabriel s Horn Gain graph Gain group Galerkin method… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”