Divergence (statistics)

Divergence (statistics)

In statistics and information geometry, divergence or a contrast function is a function which establishes the “distance” of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance in mathematics, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.

Contents

Definition

Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×SR satisfying [1]

  1. D(p || q) ≥ 0 for all p, qS,
  2. D(p || q) = 0 if and only if p = q,
  3. The matrix g(D) (see definition in the “geometrical properties” section) is strictly positive-definite everywhere on S.[2]

The dual divergence D* is defined as


    D^*(p \parallel q) = D(q \parallel p).

Geometrical properties

Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution pS we can write p = p(θ).

For a pair of points p, qS with coordinates θp and θq, denote the partial derivatives of D(p || q) as

\begin{align}
    D((\partial_i)_p \parallel q) \ \ &\stackrel\mathrm{def}=\ \ \tfrac{\partial}{\partial\theta^i_p} D(p \parallel q), \\
    D((\partial_i\partial_j)_p \parallel (\partial_k)_q) \ \ &\stackrel\mathrm{def}=\ \ \tfrac{\partial}{\partial\theta^i_p} \tfrac{\partial}{\partial\theta^j_p}\tfrac{\partial}{\partial\theta^k_q}D(p \parallel q), \ \ \mathrm{etc.}
  \end{align}

Now we restrict these functions to a diagonal p = q, and denote [3]

\begin{align}
    D[\partial_i\parallel\cdot]\ &:\ p \mapsto D((\partial_i)_p \parallel p), \\
    D[\partial_i\parallel\partial_j]\ &:\ p \mapsto D((\partial_i)_p \parallel (\partial_j)_p),\ \ \mathrm{etc.}
  \end{align}

By definition, the function D(p || q) is minimized at p = q, and therefore

\begin{align}
    & D[\partial_i\parallel\cdot] = D[\cdot\parallel\partial_i] = 0, \\
    & D[\partial_i\partial_j\parallel\cdot] = D[\cdot\parallel\partial_i\partial_j] = -D[\partial_i\parallel\partial_j] \ \equiv\ g_{ij}^{(D)},
  \end{align}

where matrix g(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.

Divergence D(· || ·) also defines a unique torsion-free affine connection(D) with coefficients


    \Gamma_{ij,k}^{(D)} = -D[\partial_i\partial_j\parallel\partial_k],

and the dual to this connection ∇* is generated by the dual divergence D*.

Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g(D), ∇(D), ∇(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).[4]

For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g(Df) = c·g and the connection (Df) = ∇(α), where g is the canonical Fisher information metric, ∇(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).

Examples

The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.

f-divergences

This family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as


    D_f(p\parallel q) = \int p(x)f\bigg(\frac{q(x)}{p(x)}\bigg) dx
Kullback-Leibler divergence: 
    D_\mathrm{KL}(p \parallel q) = \int p(x)\big( \ln p(x) - \ln q(x) \big) dx
squared Hellinger distance: 
    H^2(p,\, q) = 2 \int \Big( \sqrt{p(x)} - \sqrt{q(x)}\, \Big)^2 dx
Jeffrey’s divergence: 
    D_J(p \parallel q) = \int (p(x) - q(x))\big( \ln p(x) - \ln q(x) \big) dx
Chernoff’s α-divergence: 
    D^{(\alpha)}(p \parallel q) = \frac{4}{1-\alpha^2}\bigg(1 - \int p(x)^\frac{1-\alpha}{2} q(x)^\frac{1+\alpha}{2} dx \bigg)
exponential divergence: 
    D_e(p \parallel q) = \int p(x)\big( \ln p(x) - \ln q(x) \big)^2 dx
Kagan’s divergence: 
    D_{\chi^2}(p \parallel q) = \frac12 \int \frac{(p(x) - q(x))^2}{p(x)} dx
(α,β)-product divergence: 
    D_{\alpha,\beta}(p \parallel q) = \frac{2}{(1-\alpha)(1-\beta)} \int 
        \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\alpha}{2}} \Big) 
        \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\beta}{2}} \Big) 
        p(x) dx

M-divergences

S-divergences

See also

References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Divergence (disambiguation) — Divergence can refer to: In mathematics: Divergence, a function that associates a scalar with every point of a vector field Divergence (computer science), a computation which does not terminate (or terminates in an exceptional state) Divergence… …   Wikipedia

  • Divergence De Kullback-Leibler — En théorie des probabilités et en théorie de l information, la divergence de Kullback Leibler[1] [2] (ou divergence K L ou encore Entropie relative) est une mesure de dissimilarité entre deux distributions de probabilités P et Q. Elle doit son… …   Wikipédia en Français

  • Divergence de kullback-leibler — En théorie des probabilités et en théorie de l information, la divergence de Kullback Leibler[1] [2] (ou divergence K L ou encore Entropie relative) est une mesure de dissimilarité entre deux distributions de probabilités P et Q. Elle doit son… …   Wikipédia en Français

  • Divergence de Kullback-Leibler — En théorie des probabilités et en théorie de l information, la divergence de Kullback Leibler[1], [2] (ou divergence K L ou encore entropie relative) est une mesure de dissimilarité entre deux distributions de probabilités P et Q. Elle doit son… …   Wikipédia en Français

  • Kullback–Leibler divergence — In probability theory and information theory, the Kullback–Leibler divergence[1][2][3] (also information divergence, information gain, relative entropy, or KLIC) is a non symmetric measure of the difference between two probability distributions P …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

  • Jensen–Shannon divergence — In probability theory and statistics, the Jensen Shannon divergence is a popular method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) [cite book |author=Hinrich Schütze;… …   Wikipedia

  • F-divergence — In probability theory, an f divergence is a function I f ( P , Q ) that measures the difference between two probability distributions P and Q . The divergence is intuitively an average of the function f of the odds ratio given by P and Q .These… …   Wikipedia

  • Bose–Einstein statistics — In statistical mechanics, Bose Einstein statistics (or more colloquially B E statistics) determines the statistical distribution of identical indistinguishable bosons over the energy states in thermal equilibrium.ConceptBosons, unlike fermions,… …   Wikipedia

  • Entropie relative — Divergence de Kullback Leibler En théorie des probabilités et en théorie de l information, la divergence de Kullback Leibler[1] [2] (ou divergence K L ou encore Entropie relative) est une mesure de dissimilarité entre deux distributions de… …   Wikipédia en Français

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”