Information bottleneck method

Information bottleneck method

The information bottleneck method is a technique introduced by Tishby et al [1] for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability distribution between X and an observed relevant variable Y. Other applications include distributional clustering, and dimension reduction. In a well defined sense it generalized the classical notion of minimal sufficient statistics from parametric statistics to arbitrary distributions, not necessarily of exponential form. It does so by relaxing the sufficiency condition to capture some fraction of the mutual information with the relevant variable Y.

The compressed variable is T, and the algorithm minimises the following quantity

underset {p(t|x)}{min} ,, I(X;T) - eta I(T;Y)

where I(X;T),,I(T;Y) are the mutual informations between X;T , and T;Y , respectively.

= Gaussian Information Bottleneck [2] =

A relatively simple application of the information bottleneck is to Gaussian variates and this has some semblance to a least squares reduced rank or canonical correlation. Assume X, Y , are jointly multivariate zero mean normal vectors with covariances Sigma_{XX}, ,, Sigma_{YY} and T, is a compressed version of X, which must maintain a given value of mutual information with Y,. It can be shown that the optimum T, is a normal vector consisting of linear combinations of the elements of X , ,, T=AX , where matrix A , has orthogonal rows. The projection matrix A, in fact contains M, rows selected from the weighted left eigenvectors of the singular value decomposition of the following matrix (generally asymmetric)

Omega = Sigma_{X|Y} Sigma_{XX}^{-1} = I - Sigma_{XY} Sigma_{YY}^{-1} Sigma_{XY}^T Sigma_{XX}^{-1}

Define the singular value decomposition
Omega = ULambda V^T , with Lambda =Diag ig ( lambda_1 le lambda_2 cdots lambda_N ig ) ,
and the critical values

eta_i^C underset {lambda_i < 1}{=} (1 - lambda_i )^{-1}.

then the number M , of active eigenvectors in the projection, or order of approximation, is given by
eta_{M-1}^C < eta le eta_M^C
And we finally get

A= [ w_1 U_1 , dots , w_M U_M ] ^T

In which the weights are given by

w_i = sqrt{(eta (1- lambda_i )/lambda_i r_i}

where r_i = U_i^T Sigma_{XX} U_i .

= Data Clustering using the Information Bottleneck =

This application of the bottleneck method to non-Gaussian sampled data is described in [2] by Tishby et. el. The concept, as treated there, is not without complication as there are two independent phases in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

Density Estimation

Since the bottleneck method is framed in probabilistic rather than statistical terms, we first need to estimate the underlying probability density at the sample points X = {x_i} ,. This is a well known problem with a number of solutions described by Silverman in [4] . In the present method, joint probabilities of the samples are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.

Define an arbitrarily increasing distance metric f , between all sample pairs and distance matrix d_{i,j}=f Big ( Big| x_i - x_j Big | Big ) . Then compute transition probabilities between sample pairs P_{i,j}=exp (- lambda d_{i,j} ) , for some lambda > 0 ,. Treating samples as states, and a normalised version of P , as a Markov state transition probability matrix, the vector of probabilities of the ‘states’ after t , steps, conditioned on the initial state p(0) ,, is p(t)=P^t p(0) ,. We are here interested only in the equilibrium probability vector p(infty ) , given, in the usual way, by the dominant eigenvector of matrix P , which is independent of the initialising vector p(0) ,. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities densities there.

Other interpretations of the use of the eigenvalues of distance matrix d , are discussed in [5] .

Clusters

In the following soft clustering example, the reference vector Y , contains sample categories and the joint probability p(X,Y) , is assumed known. A soft cluster c_k , is defined by its probability distribution over the data samples x_i: ,,, p( c_k |x_i). In [1] Tishby et al present the following iterative set of equations to determine the clusters which are ultimately a generalization of the Blahut-Arimoto algorithm, developed in rate distortion theory. The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in application of Gibbs Distributions in deterministic annealing [6] .

egin{cases}p(c|x)=Kp(c) exp Big( -eta,D^{KL} Big [ p(y|x) ,|| , p(y| c)Big ] Big)\p(y| c)= extstyle sum_x p(y|x)p( c | x) p(x) ig / p(c) \p(c) = extstyle sum_x p(c | x) p(x) \end{cases}

The function of each line of the iteration is expanded as follows.

Line 1: This is a matrix valued set of conditional probabilities

A_{i,j} = p(c_i | x_j )=Kp(c_i) exp Big( -eta,D^{KL} Big [ p(y|x_j) ,|| , p(y| c_i)Big ] Big)

The Kullback Leibler distance D^{KL} , between the Y , vectors generated by the sample data x , and those generated by its reduced information proxy c , is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data Y , in accordance with the fundamental bottleneck equation. D^{KL}(a||b), is the Kullback Leibler distance between distributions a, b ,

D^{KL} (a||b)= sum_i p(a_i) log Big ( frac{p(a_i)}{p(b_i)} Big )

and K , is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback Liebler distance is large, thus successful clusters grow in probability while unsuccessful ones decay.

Line 2: This is a second matrix-valued set of conditional probabilities

B_{i,k}=p(y_i | x_k ) = sum_k p(y_i | x_k ) p(c_j | x_k )p(x_k)ig / p(c_j )
The steps in deriving this are as follows. We have, by definition

egin{align}p(y_i|c_k) & = sum_j p(y_i|x_j)p(x_j|c_k) \ & =sum_j p(y_i|x_j)p(x_j, c_k ) ig / p(c_k) \& =sum_j p(y_i|x_j)p(c_k | x_j) p(x_j) ig / p(c_k) \end{align}

where the Bayes identities p(a,b)=p(a|b)p(b)=p(b|a)p(a) , are used.

Line 3: this line finds the marginal distribution of the clusters c ,

egin{align}p(c_i)& =sum_j p(c_i , x_j)& = sum_j p(c_i | x_j) p(x_j)end{align}

This is also a standard result.

Further inputs to the algorithm are the marginal sample distribution p(x) , which has already been determined by the dominant eigenvector of P , and the matrix valued Kullback Leibler distance function

D_{i,j}^{KL}=D^{KL} Big [ p(y|x_j) ,|| , p(y| c_i)Big ] Big) derived from the sample spacings and transition probabilities.

The matrix p(y_i | c_j) , can be initialised randomly or as a reasonable guess, while matrix p(c_i | x_j) , needs no prior values. Although the algorithm is converging, multiple minima may exist which need some action to resolve. Further details, including hard clustering methods, are found in [5] .

Defining Decision Contours

To categorize a new sample x' , external to the training set X ,, apply the previous distance metric to find the transition probabilities between x' , and all samples in X: ,,, ilde p(x_i )= p(x_i | x')= Kappa exp Big (-lambda f ig ( Big| x_i - x' Big | ig ) Big ) with Kappa , a normalisation. Secondly apply the last two lines of the 3-line algorithm to get cluster, and conditional category probabilities.

egin{align}& ilde p(c_i ) = p(c_i | x' ) = sum_j p(c_i | x_j)p(x_j | x') =sum_j p(c_i | x_j) ilde p(x_j)\& p(y_i | c_j) = sum_k p(y_i | x_k) p(c_j | x_k)p(x_k | x') / p(c_j | x' )= sum_k p(y_i | x_k) p(c_j | x_k) ilde p(x_k) / ilde p(c_j) \end{align}

Finally we have

p(y_i | x')= sum_j p(y_i | c_j) p(c_j | x') )= sum_j p(y_i | c_j) ilde p(c_j) ,

Parameter eta , must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, snap into focus at certain critical thresholds.

An Example

The following case examines clustering in a four quadrant multiplier with random inputs u, v , and two categories of output, pm 1 ,, generated by y=sign(uv) ,. This function has the property that there are two spatially separated clusters for each category and so it demonstrates that the method can handle such distributions.

20 samples are taken, uniformly distributed on the square [-1,1] ^2 , . The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters lambda = 3,, eta = 2.5.
The distance function is d_{i,j} = Big| x_i - x_j Big |^2 where x_i = (u_i,v_i)^T , while the conditional distribution p(y|x), is a 2x20 matrix
egin{align} & Pr(y_i=1) = 1 , mbox{if}, sign(u_iv_i)=1, \& Pr(y_i=-1) = 1 , mbox{if} , sign(u_iv_i)=-1, \end{align}
and zero elsewhere.
The summation in line 2 is only incorporates two values representing the training values of +1 or -1 but nevertheless seems to work quite well. Five iterations of the equations were used. The figure shows the locations of the twenty samples with '0' representing "Y" = 1 and 'x' representing "Y" = -1. The contour at the unity likelihood ratio level is shown, L= Pr(1) ig/ Pr(-1) = 1as a new sample x' ,is scanned over the square. Theoretically the contour should align with the u=0 , and v=0 , coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.

Neural Network/Fuzzy Logic Analogies

There is some analogy between this algorithm and a neural network with a single hidden layer. The internal nodes are represented by the clusters c_j , and the first and second layers of network weights are the conditional probabilities p(c_j | x_i) , and p(y_k | c_j) , respectively. However, unlike a standard neural network, the present algorithm relies entirely on probabilities as inputs rather than the sample values themselves while internal and output values are all conditional probability density distributions. Nonlinear functions are encapsulated in distance metric f(.) , (or "influence functions/radial basis functions") and transition probabilities instead of sigmoid functions.The Blahut-Arimoto three-line algorithm is seen to converge rapidly, often in tens of iterations, and by varying eta ,, lambda , and f , and the cardinality of the clusters, various levels of focus on data features can be achieved.
The statistical soft clustering definition p(c_i | x_j) , has some overlap with the verbal fuzzy membership concept of fuzzy logic.

Bibliography

[1] N. Tishby, F.C. Pereira, and W. Bialek: [http://www.cs.huji.ac.il/labs/learning/Papers/allerton.pdf “The Information Bottleneck method”. The 37th annual Allerton Conference on Communication, Control, and Computing, Sep 1999: pp. 368-377]

[2] G. Chechik, A Globerson, N. Tishby and Y. Weiss: [http://www.jmlr.org/papers/volume6/chechik05a/chechik05a.pdf “Information Bottleneck for Gaussian Variables”. Journal of Machine Learning Research 6, Jan 2005, pp. 165-188]

[3] N Tishby, N Slonim: “Data clustering by Markovian Relaxation and the Information Bottleneck Method”, Neural Information Processing Systems (NIPS) 2000, pp. 640-646

[4] B.W. Silverman: “Density Estimation for Statistical Data Analysis”, Chapman and Hall, 1986.

[5] N. Slonim, N. Tishby: "Document Clustering using Word Clusters via the Information Bottleneck Method", SIGIR 2000, pp. 208-215

[6] Y. Weiss: "Segmentation using eigenvectors: a unifying view", Proceedings IEEE International Conference on Computer Vision 1999, pp. 975-982

[7] D. J. Miller, A. V. Rao, K. Rose, A. Gersho: "An Information-theoretic Learning Algorithm for Neural Network Classification". NIPS 1995: pp. 591-597

[8] P. Harremoes and N. Tishby [http://www.cs.huji.ac.il/labs/learning/Papers/flaske2.pdf "The Information Bottleneck Revisited or How to Choose a Good Distortion Measure". In proceedings of the International Symposium on Information Theory (ISIT) 2007]

ee also

* Information theory

External links

* [http://citeseer.ist.psu.edu/tishby99information.html Paper by N. Tishby, et. al]


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • List of information theory topics — This is a list of information theory topics, by Wikipedia page.*A Mathematical Theory of Communication *algorithmic information theory *arithmetic encoding *channel capacity *Communication Theory of Secrecy Systems *conditional entropy… …   Wikipedia

  • Computers and Information Systems — ▪ 2009 Introduction Smartphone: The New Computer.       The market for the smartphone in reality a handheld computer for Web browsing, e mail, music, and video that was integrated with a cellular telephone continued to grow in 2008. According to… …   Universalium

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

  • List of mathematics articles (I) — NOTOC Ia IA automorphism ICER Icosagon Icosahedral 120 cell Icosahedral prism Icosahedral symmetry Icosahedron Icosian Calculus Icosian game Icosidodecadodecahedron Icosidodecahedron Icositetrachoric honeycomb Icositruncated dodecadodecahedron… …   Wikipedia

  • List of mathematics-based methods — This is a list of mathematics based methods, by Wikipedia page.See also list of graphical methods.*Adams method (differential equations) *Akra Bazzi method (asymptotic analysis) *Condorcet method (voting systems) *Coombs method (voting systems)… …   Wikipedia

  • Word-sense disambiguation — Disambiguation redirects here. For other uses, see Disambiguation (disambiguation). In computational linguistics, word sense disambiguation (WSD) is an open problem of natural language processing, which governs the process of identifying which… …   Wikipedia

  • Augmentative and alternative communication — An AAC user indicates a series of numbers on an eye gaze communication board in order to convey a word. Augmentative an …   Wikipedia

  • computer — computerlike, adj. /keuhm pyooh teuhr/, n. 1. Also called processor. an electronic device designed to accept data, perform prescribed mathematical and logical operations at high speed, and display the results of these operations. Cf. analog… …   Universalium

  • Travelling salesman problem — The travelling salesman problem (TSP) is an NP hard problem in combinatorial optimization studied in operations research and theoretical computer science. Given a list of cities and their pairwise distances, the task is to find a shortest… …   Wikipedia

  • Mathematics and Physical Sciences — ▪ 2003 Introduction Mathematics       Mathematics in 2002 was marked by two discoveries in number theory. The first may have practical implications; the second satisfied a 150 year old curiosity.       Computer scientist Manindra Agrawal of the… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”