Fisher information

Fisher information

In statistics and information theory, the Fisher information (denoted mathcal{I}( heta)) is the variance of the score. It is named in honor of its inventor, the statistician R.A. Fisher.


The Fisher information is a way of measuring the amount of information that an observable random variable "X" carries about an unknown parameter θ upon which the likelihood function of heta , L( heta)= f(X; heta), depends. The likelihood function is the joint probability of the data, the "X"s, conditional on the value of θ, "as a function of θ". Since the expectation of the score is zero, the variance is simply the second moment of the score, the derivative of the log of the likelihood function with respect to θ. Hence the Fisher information can be written

:mathcal{I}( heta)=mathrm{E}left{left. left [ frac{partial}{partial heta} ln f(X; heta) ight] ^2 ight| heta ight},

which implies 0 leq mathcal{I}( heta) < infty. The Fisher information is thus the expectation of the squared score. A random variable carrying high Fisher information implies that the absolute value of the score is often high.

The Fisher information is not a function of a particular observation, as the random variable "X" has been averaged out. The concept of information is useful when comparing two methods of observing a given random process.

If the following regularity condition is met:

:int frac{partial^2}{partial heta^2}f(X ; heta ) , dx = 0,

then the Fisher information may also be written as:

:mathcal{I}( heta) = - mathrm{E} left [ frac{partial^2}{partial heta^2} ln f(X; heta)| heta ight] .

Thus Fisher information is the negative of the expectation of the second derivative of the log of "f" with respect to θ.Information may thus be seen to be a measure of the "sharpness" of the support curve near the maximum likelihood estimate of θ. A "blunt" support curve (one with a shallow maximum) would have a low expected second derivative, and thus low information; while a sharp one would have a high expected second derivative and thus high information.

Information is additive, in that the information yielded by two independent experiments is the sum of the information from each experiment separately:

: mathcal{I}_{X,Y}( heta) = mathcal{I}_X( heta) + mathcal{I}_Y( heta).

This result follows from the elementary fact that if random variables are independent, the variance of their sum is the sum of their variances.Hence the information in a random sample of size "n" is "n" times that in a sample of size 1 (if observations are independent).

The information provided by a sufficient statistic is the same as that of the sample "X". This may be seen by using Neyman's factorization criterion for a sufficient statistic. If T(X) is sufficient for θ, then

: f(X; heta) = g(T(X), heta) h(X) !

for some functions "g" and "h". See sufficient statistic for a more detailed explanation. The equality of information then follows from the following fact:

: frac{partial}{partial heta} ln left [f(X ; heta) ight] = frac{partial}{partial heta} ln left [g(T(X); heta) ight]

which follows from the definition of Fisher information, and the independence of h(X) from θ. More generally, if T=t(X) is a statistic, then

:mathcal{I}_T( heta)leqmathcal{I}_X( heta)

with equality if and only if "T" is a sufficient statistic.

The Cramér-Rao inequality states that the inverse of the Fisher information is a lower bound on the variance of any unbiased estimator of θ.

Informal derivation

Van Trees (1968) and Frieden (2004) provide the following method of deriving the Fisher information informally:

Consider an unbiased estimator hat heta(X). Mathematically, we write

:mathrm{E}left [ hat heta(X) - heta ight] = int left [ hat heta(X) - heta ight] cdot f(X ; heta) , dx = 0.

The likelihood function f(X ; heta) describes the probability that we observe a given sample x "given" a known value of heta. If f is sharply peaked, it is easy to intuit the "correct" value of heta given the data, and hence the data contains a lot of information about the parameter. If the likelihood f is flat and spread-out, then it would take many, many samples of X to estimate the actual "true" value of heta. Therefore, we would intuit that the data contain much less information about the parameter.

Now, given the unbiased-ness condition above, we differentiate it to get

:frac{partial}{partial heta} int left [ hat heta(X) - heta ight] cdot f(X ; heta) , dx= int left(hat heta- heta ight) frac{partial f}{partial heta} , dx - int f , dx = 0.

We now make use of two facts. The first is that the likelihood f is just the probability of the data given the parameter. Since it is a probability, it must be normalized, implying that

:int f , dx = 1.

Second, we know from basic calculus that

:frac{partial f}{partial heta} = f , frac{partial ln f}{partial heta}.

Using these two facts in the above let us write

:int left(hat heta- heta ight) f , frac{partial ln f}{partial heta} , dx = 1.

Factoring the integrand gives

:int left(left(hat heta- heta ight) sqrt{f} ight) left( sqrt{f} , frac{partial ln f}{partial heta} ight) , dx = 1.

If we square the equation, the Cauchy-Schwarz inequality lets us write

:left [ int left(hat heta - heta ight)^2 f , dx ight] cdot left [ int left( frac{partial ln f}{partial heta} ight)^2 f , dx ight] geq 1.

The right-most factor is defined to be the Fisher Information

:mathcal{I}left( heta ight) = int left( frac{partial ln f}{partial heta} ight)^2 f , dx.

The left-most factor is the expected mean-squared error of the estimator heta, since

:mathrm{E}left [ left( hat hetaleft(X ight) - heta ight)^2 ight] = int left(hat heta - heta ight)^2 f , dx.

Notice that the inequality tells us that, fundamentally,

:mbox{Var}left [hat heta ight] , geq , {1} / {mathcal{I}left( heta ight)}.

In other words, the precision to which we can estimate heta is fundamentally limited by the Fisher Information of likelihood function.

ingle-parameter Bernoulli experiment

A Bernoulli trial is a random variable with two possible outcomes, "success" and "failure", with "success" having a probability of heta. The outcome can be thought of as determined by a coin toss, with the probability of obtaining a "head" being heta and the probability of obtaining a "tail" being 1 - heta.

The Fisher information contained in "n" independent Bernoulli trials may be calculated as follows. In the following, "A" represents the number of successes, "B" the number of failures, and n = A + B is the total number of trials.

:mathcal{I}( heta)=-mathrm{E}left [ frac{partial^2}{partial heta^2} ln(f(A; heta)) ight] qquad (1)

::=-mathrm{E}left [ frac{partial^2}{partial heta^2} ln left [ heta^A(1- heta)^Bfrac{(A+B)!}{A!B!} ight] ight] qquad (2)

::=-mathrm{E}left [ frac{partial^2}{partial heta^2} left [ A ln ( heta) + B ln(1- heta) ight] ight] qquad (3)

::=-mathrm{E}left [ frac{partial}{partial heta} left [ frac{A}{ heta} - frac{B}{1- heta} ight] ight] (on differentiating ln "x", see logarithm) qquad (4)

::=+mathrm{E}left [ frac{A}{ heta^2} + frac{B}{(1- heta)^2} ight] qquad (5)

::=frac{n heta}{ heta^2} + frac{n(1- heta)}{(1- heta)^2} (as the expected value of A = n heta, etc.) qquad (6)

::= frac{n}{ heta(1- heta)} qquad (7)

(1) defines Fisher information.(2) invokes the fact that the information in a sufficient statistic is the same as that of the sample itself.(3) expands the log term and drops a constant.(4) and (5) differentiate with respect to heta.(6) replaces "A" and "B" with their expectations. (7) is algebra.

The end result, namely,:mathcal{I}( heta) = frac{n}{ heta(1- heta)},

is the reciprocal of the variance of the mean number of successes in "n" Bernoulli trials, as expected (see last sentence of the preceding section).

Matrix form

When there are "N" parameters, so that θ is a "N"x1 vector heta = egin{bmatrix} heta_{1}, heta_{2}, cdots , heta_{N} end{bmatrix},, then the Fisher information takes the form of an "N"x"N" matrix, the Fisher Information Matrix (FIM), with typical element:

:{left(mathcal{I} left( heta ight) ight)}_{i, j}=mathrm{E}left [ frac{partial}{partial heta_i} ln f(X; heta) frac{partial}{partial heta_j} ln f(X; heta) ight] .

The FIM is a "N"x"N" positive definite symmetric matrix, defining a metric on the "N"-dimensional parameter space. Exploring this topic requires differential geometry.

Orthogonal parameters

We say that two parameters heta_{i} and heta_{j} are orthogonal if the element of the i-th row and j-th column of the Fisher Information Matrix is zero. Orthogonal parameters are easy to deal with in the sense that their maximum likelihood estimates are independent and can be calculated separately. When dealing with research problems, it is very common for the researcher to invest some time searching for an orthogonal parametrization of the densities involved in the problem.

Multivariate normal distribution

The FIM for a "N"-variate multivariate normal distribution has a special form. Let mu( heta) = egin{bmatrix} mu_{1}( heta), mu_{2}( heta), dots , mu_{N}( heta) end{bmatrix}, and let Sigma( heta) be the covariance matrix. Then the typical element mathcal{I}_{m,n}, 0 ≤ "m", "n" < "N", of the FIM for X sim N(mu( heta), Sigma( heta)) is:

:mathcal{I}_{m,n}=frac{partial mu}{partial heta_m}Sigma^{-1}frac{partial mu^ op}{partial heta_n}+frac{1}{2}mathrm{tr}left( Sigma^{-1} frac{partial Sigma}{partial heta_m} Sigma^{-1} frac{partial Sigma}{partial heta_n} ight),

where (..)^ op denotes the transpose of a vector, mathrm{tr}(..) denotes the trace of a square matrix, and:

*frac{partial mu}{partial heta_m}=egin{bmatrix} frac{partial mu_1}{partial heta_m} & frac{partial mu_2}{partial heta_m} & cdots & frac{partial mu_N}{partial heta_m} &end{bmatrix};

*frac{partial Sigma}{partial heta_m}=egin{bmatrix} frac{partial Sigma_{1,1{partial heta_m} & frac{partial Sigma_{1,2{partial heta_m} & cdots & frac{partial Sigma_{1,N{partial heta_m} \ \ frac{partial Sigma_{2,1{partial heta_m} & frac{partial Sigma_{2,2{partial heta_m} & cdots & frac{partial Sigma_{2,N{partial heta_m} \ \ vdots & vdots & ddots & vdots \ \ frac{partial Sigma_{N,1{partial heta_m} & frac{partial Sigma_{N,2{partial heta_m} & cdots & frac{partial Sigma_{N,N{partial heta_m}end{bmatrix}.


The Fisher information depends on the parametrization of the problem. If &theta; and &eta; are two different parameterizations of a problem, such that heta = h(eta) and "h" is a differentiable function, then:{mathcal I}_eta(eta) = {mathcal I}_ heta(h(eta)) left( h'(eta) ight)^2where {mathcal I}_eta and {mathcal I}_ heta are the Fisher information measures of &eta; and &theta;, respectively. [Lehmann and Casella, eq. (5.2.11).]

ee also

*Formation matrix

Other measures employed in information theory:
*Kullback-Leibler divergence
*Shannon entropy



*cite book
last = Schervish
first = Mark J.
title = Theory of Statistics
publisher = Springer
year = 1995
location = New York
pages = Section 2.3.1
isbn = 0387945466

*cite book
last = Van Trees
first = H. L.
title = Detection, Estimation, and Modulation Theory, Part I
publisher = Wiley
year = 1968
location = New York
isbn = 0471095176

*cite book
last = Frieden
first = B. Roy
title = Science from Fisher Information
publisher = Cambridge University Press
year = 2004
location = New York
pages = p. 29-30
isbn = 0521009111

* cite book
last = Lehmann
first = E. L.
coauthors = Casella, G.
title = Theory of Point Estimation
year = 1998
publisher = Springer
isbn = 0-387-98502-6
pages = 2nd ed

Further weblinks

* James Case: [ An Unexpected Union — Physics and Fisher Information] , SIAM News, Volume 33, Number 6 (a review of the book "Physics from Fisher Information: A Unification" by B. Roy Frieden)
* D. A. Lavis and R. F. Streater: [ Physics from Fisher Information] (a critical review of B. Roy Frieden´s approach to deriving laws of physics from the Fisher information)
* [ Fisher4Cast: a Matlab, GUI-based Fisher information tool] for research and teaching, primarily aimed at cosmological forecasting applications.

Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Fisher-Information — Die Fisher Information (benannt nach dem Statistiker Ronald Fisher) ist eine Kenngröße aus der mathematischen Statistik und der Informationstheorie, die für eine Familie von Wahrscheinlichkeitsdichten definiert werden kann und Aussagen über die… …   Deutsch Wikipedia

  • Fisher information metric — In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space.… …   Wikipedia

  • Minimum Fisher information — In information theory, the principle of minimum Fisher information (MFI) is a variational principle which, when applied with the proper constraints needed to reproduce empirically known expectation values, determines the best probability… …   Wikipedia

  • Information — as a concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning,… …   Wikipedia

  • Information (disambiguation) — Information is the concept of transmitted quantities as bearers of messages which can be received and interpreted.Information may also refer to: * Data, or data used in computing * Physical information contained in a system * Directory assistance …   Wikipedia

  • Information geometry — In mathematics and especially in statistical inference, information geometry is the study of probability and information by way of differential geometry. It reached maturity through the work of Shun ichi Amari in the 1980s, with what is currently …   Wikipedia

  • Fisher-Matrix — Die Fisher Information ist eine Kenngröße aus der mathematischen Statistik und der Informationstheorie, die für eine Familie von Wahrscheinlichkeitsdichten definiert werden kann und Aussagen über die bestmögliche Qualität von Parameterschätzungen …   Deutsch Wikipedia

  • Fisher kernel — In mathematics, the Fisher kernel, named in honour of Sir Ronald Fisher, is a kernel. It was introduced in 1998 by Tommi Jaakkola [ Exploiting Generative Models in Discriminative Classifiers (1998) [… …   Wikipedia

  • Information theory — Not to be confused with Information science. Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental… …   Wikipedia

  • Information definitions — Wiener information on the well known definition: Information is information, not the material is not energy. Information exists generally in the nature and the human society moves, its manifestation by far is more complex than the material and… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”