- Divergence (statistics)
-
In statistics and information geometry, divergence or a contrast function is a function which establishes the “distance” of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance in mathematics, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.
Contents
Definition
Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×S → R satisfying [1]
- D(p || q) ≥ 0 for all p, q ∈ S,
- D(p || q) = 0 if and only if p = q,
- The matrix g(D) (see definition in the “geometrical properties” section) is strictly positive-definite everywhere on S.[2]
The dual divergence D* is defined as
Geometrical properties
Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution p ∈ S we can write p = p(θ).
For a pair of points p, q ∈ S with coordinates θp and θq, denote the partial derivatives of D(p || q) as
Now we restrict these functions to a diagonal p = q, and denote [3]
By definition, the function D(p || q) is minimized at p = q, and therefore
where matrix g(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.
Divergence D(· || ·) also defines a unique torsion-free affine connection ∇(D) with coefficients
and the dual to this connection ∇* is generated by the dual divergence D*.
Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g(D), ∇(D), ∇(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).[4]
For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g(Df) = c·g and the connection ∇(Df) = ∇(α), where g is the canonical Fisher information metric, ∇(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).
Examples
The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.
f-divergences
Main article: f-divergenceThis family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as
Kullback-Leibler divergence: squared Hellinger distance: Jeffrey’s divergence: Chernoff’s α-divergence: exponential divergence: Kagan’s divergence: (α,β)-product divergence: M-divergences
S-divergences
See also
References
- ^ Eguchi (1985)
- ^ Amari & Nagaoka (2000, chapter 3.2)
- ^ Eguchi (1992)
- ^ Matumoto (1993)
- Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2.
- Eguchi, Shinto (1985). "A differential geometric approach to statistical inference on the basis of contrast functionals". Hiroshima mathematical journal 15 (2): 341–391. http://projecteuclid.org/euclid.hmj/1206130775.
- Eguchi, Shinto (1992). "Geometry of minimum contrast". Hiroshima mathematical journal 22 (3): 631–647. http://projecteuclid.org/euclid.hmj/1206128508.
- Matumoto, Takao (1993). "Any statistical manifold has a contrast function — on the C³-functions taking the minimum at the diagonal of the product manifold". Hiroshima mathematical journal 23 (2): 327–332. http://projecteuclid.org/euclid.hmj/1206128255.
Categories:- Statistical distance measures
- F-divergences
Wikimedia Foundation. 2010.