Where:
* is the connection strength between unit and unit .
* is the state, , of unit .
* is the threshold of unit .
The connections in a Boltzmann machine have two restrictions:
* . (No unit has a connection with itself.)
* . (All connections are symmetric.)
Thus, the difference in the global energy that results from a single unit being 0 versus 1, written , is given by:
:
A Boltzmann machine is made up of stochastic units. The probability, of the -th unit being on is given by:
:
where the scalar is referred to as the temperature of the system.
The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state's energy, according to a Boltzmann distribution. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.
If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done by the following training procedure.
Training
The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted .
On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal equilibrium. We denote the converged distribution, after we marginalize it over the visible units , as .
Our goal is to approximate the "real" distribution using the which will be produced (eventually) by the machine. To measure how similar the two distributions are we use the Kullback-Leibler distance, :
:
Where the sum is over all the possible states of . is a function of the weights, since they determine the energy of a state, and the energy determines , as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over , so a given weight, is changed by subtracting the partial derivative of with respect to the weight.
There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to ). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, , is given by the very simple equation (proved in Ackley et al.):
:
Where:
* is the probability of units "i" and "j" both being on when the machine is at equilibrium on the positive phase.
* is the probability of units "i" and "j" both being on when the machine is at equilibrium on the negative phase.
This result follows from the fact that at the thermal equilibrium the probability of any global state when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.
The training of a Boltzmann machines can also be viewed as an application of the EM algorithm, which is heavily used in machine learning. We are given an incomplete dataset (we only know the values of the visible units). At the positive phase, we estimate the completed data based on the current parameters of the system (the weights). Later, the weights are updated to maximize the probability of the network producing the completed data.
Training the biases is similar, but uses only single node activity:
:
Problems
If it worked, the Boltzmann machine would be a rather general computational medium. For instance, if trained on photographs, the machine would model the distribution of photographs, and could use that model to, for example, complete a partial photograph.
Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the learning seems to stop working correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:
* the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths
* connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.
Restricted Boltzmann Machine
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow connections between hidden units. After learning one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to learn many layers of hidden units efficiently and as each new layer is added the overall generative model gets better.
History
The Boltzmann machine is a Monte Carlo version of the Hopfield network.
The idea of using annealed Ising models for inference is often thought to have originated from Geoff Hinton, in the following papers:
* Geoffrey E. Hinton and Terrence J. Sejnowski, Analyzing Cooperative Computation. In Proceedings of the 5th Annual Congress of the Cognitive Science Society, Rochester, NY, May 1983.
* Geoffrey E. Hinton and Terrence J. Sejnowski, Optimal Perceptual Inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 448-453, IEEE Computer Society, Washington DC, June 1983.
However the same idea of applying the Ising model with annealed Gibbs sampling is also present in Douglas Hofstadter's Copycat project, described in the following two papers:
* Hofstadter, Douglas R., The Copycat Project: An Experiment in Nondeterminism and Creative Analogies. MIT Artificial Intelligence Laboratory Memo No. 755, January 1984.
* Hofstadter, Douglas R., A Non-Deterministic Approach to Analogy, Involving the Ising Model of Ferromagnetism. In E. Caianiello, ed. The Physics of Cognitive Processes. Teaneck, NJ: World Scientific, 1987.
It seems probable that these works were conducted independently, reflecting general interest in the ideas by a wide community at the time. Hinton went on to build and obtain funding for a large community around the ideas, whereas Hofstadter is funded mostly as an individual to continue his version in the Copycat family of models. For these reasons the Hinton terminology has become standard for a larger group of researchers.
Ising models are now considered to be a special case of Markov random fields, which do find widespread application in various fields, including linguistics, robotics and AI.
References
* Ackley, D. H., Hinton, G. E., Sejnowski, T. J. (1985), "A Learning Algorithm for Boltzmann Machines", "Cognitive Science", 9: 147-169.
* Hinton, G. E., Sejnowski, T. J. (1986), "Learning and Relearning in Boltzmann Machines". In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, "Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations", pp 282-317. Cambridge: MIT Press.
* Hinton, G. E.(2002), "Training Products of Experts by Minimizing Contrastive Divergence", "Neural Computation", 14: 1771-1800.
* Hinton, G. E., Osindero, S., Teh, Y. (2006), "A fast learning algorithm for deep belief nets", "Neural Computation", 18: 1527-1554. [http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf]
External links
* [http://www.scholarpedia.org/article/Boltzmann_Machine Scholarpedia article by Hinton about Boltzmann machines]
* [http://youtube.com/watch?v=AyzOUbkUf3M Speech at Google by Geoffrey Hinton]