Boltzmann machine

Boltzmann machine

A Boltzmann machine is the name given to a type of stochastic recurrent neural network by Geoffrey Hinton and Terry Sejnowski. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are still theoretically intriguing, however, due to the locality and Hebbian nature of their training algorithm, as well as their parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the learning can be made efficient enough to be useful for practical problems.

tructure

A Boltzmann machine, like a Hopfield Network, is a network of units with an "energy" defined for the network. It also has units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, E, in a Boltzmann machine is identical in form to that of a Hopfield network:

:E = -sum_{i

Where:
* w_{ij} is the connection strength between unit j and unit i.
* s_i is the state, s_i in {0,1}, of unit i.
* heta_i is the threshold of unit i.

The connections in a Boltzmann machine have two restrictions:
* w_{ii}=0 qquad forall i. (No unit has a connection with itself.)
* w_{ij}=w_{ji}qquad forall i,j. (All connections are symmetric.)

Thus, the difference in the global energy that results from a single unit i being 0 versus 1, written Delta E_i, is given by:

:Delta E_i = sum_j w_{ij} , s_j - heta_i

A Boltzmann machine is made up of stochastic units. The probability, p_i of the i-th unit being on is given by:

:p_i = frac{1}{1+exp (-frac{1}{T} Delta E_i)}

where the scalar T is referred to as the temperature of the system.

The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state's energy, according to a Boltzmann distribution. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.

If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done by the following training procedure.

Training

The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted P^{+}(V).

On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal equilibrium. We denote the converged distribution, after we marginalize it over the visible units V, as P^{-}(V).

Our goal is to approximate the "real" distribution P^{+}(V) using the P^{-}(V) which will be produced (eventually) by the machine. To measure how similar the two distributions are we use the Kullback-Leibler distance, G:

:G = sum_{v}{P^{+}(v)ln{frac{P^{+}(v)}{P^{-}(v)}

Where the sum is over all the possible states of V. G is a function of the weights, since they determine the energy of a state, and the energy determines P^{-}(v), as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over G, so a given weight, w_{ij} is changed by subtracting the partial derivative of G with respect to the weight.

There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to P^{+}). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, w_{ij}, is given by the very simple equation (proved in Ackley et al.):

:frac{partial{G{partial{w_{ij} = frac{1}{T} [p_{ij}^{+}-p_{ij}^{-}]

Where:
* p_{ij}^{+} is the probability of units "i" and "j" both being on when the machine is at equilibrium on the positive phase.

* p_{ij}^{-} is the probability of units "i" and "j" both being on when the machine is at equilibrium on the negative phase.

This result follows from the fact that at the thermal equilibrium the probability P^{-}(s) of any global state s when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").

Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.

The training of a Boltzmann machines can also be viewed as an application of the EM algorithm, which is heavily used in machine learning. We are given an incomplete dataset (we only know the values of the visible units). At the positive phase, we estimate the completed data based on the current parameters of the system (the weights). Later, the weights are updated to maximize the probability of the network producing the completed data.

Training the biases is similar, but uses only single node activity:

:frac{partial{G{partial{ heta_{i} = -frac{1}{T} [p_{i}^{+}-p_{i}^{-}]

Problems

If it worked, the Boltzmann machine would be a rather general computational medium. For instance, if trained on photographs, the machine would model the distribution of photographs, and could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the learning seems to stop working correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:

* the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths
* connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.

Restricted Boltzmann Machine

Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow connections between hidden units. After learning one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to learn many layers of hidden units efficiently and as each new layer is added the overall generative model gets better.

History

The Boltzmann machine is a Monte Carlo version of the Hopfield network.

The idea of using annealed Ising models for inference is often thought to have originated from Geoff Hinton, in the following papers:

* Geoffrey E. Hinton and Terrence J. Sejnowski, Analyzing Cooperative Computation. In Proceedings of the 5th Annual Congress of the Cognitive Science Society, Rochester, NY, May 1983.

* Geoffrey E. Hinton and Terrence J. Sejnowski, Optimal Perceptual Inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 448-453, IEEE Computer Society, Washington DC, June 1983.

However the same idea of applying the Ising model with annealed Gibbs sampling is also present in Douglas Hofstadter's Copycat project, described in the following two papers:

* Hofstadter, Douglas R., The Copycat Project: An Experiment in Nondeterminism and Creative Analogies. MIT Artificial Intelligence Laboratory Memo No. 755, January 1984.

* Hofstadter, Douglas R., A Non-Deterministic Approach to Analogy, Involving the Ising Model of Ferromagnetism. In E. Caianiello, ed. The Physics of Cognitive Processes. Teaneck, NJ: World Scientific, 1987.

It seems probable that these works were conducted independently, reflecting general interest in the ideas by a wide community at the time. Hinton went on to build and obtain funding for a large community around the ideas, whereas Hofstadter is funded mostly as an individual to continue his version in the Copycat family of models. For these reasons the Hinton terminology has become standard for a larger group of researchers.

Ising models are now considered to be a special case of Markov random fields, which do find widespread application in various fields, including linguistics, robotics and AI.

References

* Ackley, D. H., Hinton, G. E., Sejnowski, T. J. (1985), "A Learning Algorithm for Boltzmann Machines", "Cognitive Science", 9: 147-169.

* Hinton, G. E., Sejnowski, T. J. (1986), "Learning and Relearning in Boltzmann Machines". In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, "Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations", pp 282-317. Cambridge: MIT Press.

* Hinton, G. E.(2002), "Training Products of Experts by Minimizing Contrastive Divergence", "Neural Computation", 14: 1771-1800.

* Hinton, G. E., Osindero, S., Teh, Y. (2006), "A fast learning algorithm for deep belief nets", "Neural Computation", 18: 1527-1554. [http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf]

External links

* [http://www.scholarpedia.org/article/Boltzmann_Machine Scholarpedia article by Hinton about Boltzmann machines]
* [http://youtube.com/watch?v=AyzOUbkUf3M Speech at Google by Geoffrey Hinton]


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Ludwig Boltzmann — Infobox Scientist name = Ludwig Boltzmann |225px image width = 225px caption = Ludwig Eduard Boltzmann (1844 1906) birth date = birth date|1844|2|20|mf=y birth place= Vienna, Austrian Empire death date = death date and age|1906|9|5|1844|2|20|mf=y …   Wikipedia

  • Máquina de Boltzmann — Una máquina de Boltzmann es un tipo de red neuronal recurrente estocástica. El nombre le fue dado por los investigadores Geoffrey Hinton y Terry Sejnowski. Las máquinas de Boltzmann pueden considerarse como la contrapartida estocástica y… …   Wikipedia Español

  • Artificial Neural Network — Réseau de neurones Pour les articles homonymes, voir Réseau. Vue simplifiée d un réseau artificiel de neurones Un réseau de neurones artificiel est un modèle de c …   Wikipédia en Français

  • Neuronal network — Réseau de neurones Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

  • Reseau de neurones — Réseau de neurones Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

  • Réseau de neurone — Réseau de neurones Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

  • Réseau de neurones — Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

  • Réseau neuronal — Réseau de neurones Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

  • Réseau neuronique — Réseau de neurones Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

  • Réseaux de neurones — Réseau de neurones Pour les articles homonymes, voir Réseau. Neurosciences …   Wikipédia en Français

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”