# Estimation of covariance matrices

Estimation of covariance matrices

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix is an unbiased estimator of the covariance matrix. In addition, if the random variable has normal distribution, the sample covariance matrix has Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data require deeper considerations. Statistical analyses of multivariate data often involve exploratory studies of the way in which the variables change in relation to one another and this may be followed up by explicit statistical models involving the covariance matrix of the variables. Thus the estimation of covariance matrices directly from observational data plays two roles::* to provide initial estimates that can be used to study the inter-relationships;:* to provide sample estimates that can be used for model checking.

Estimates of covariance matrices are required at the intial stages of principal component analysis and factor analysis, and are also involved in versions of regression analysis that treat the dependent variables in a data-set, jointly with the independent variable as the outcome of a random sample.

Estimation in a general context

Given a sample consisting of independent observations "x"1,..., "x""n" of a random vector "X" &isin; R"p"&times;1 (a "p"&times;1 column), an unbiased estimator of the ("p"×"p") covariance matrix

:$operatorname\left\{cov\right\}\left(X\right) = operatorname\left\{E\right\}\left(\left(X-operatorname\left\{E\right\}\left(X\right)\right)\left(X-operatorname\left\{E\right\}\left(X\right)\right)^T\right)$

is the sample covariance matrix

:$\left\{1 over \left\{n-1sum_\left\{i=1\right\}^n \left(x_i-overline\left\{x\right\}\right)\left(x_i-overline\left\{x\right\}\right)^T,$

where

:$overline\left\{x\right\}=\left\{1 over \left\{nsum_\left\{i=1\right\}^n x_i$

is the sample mean.This is true regardless of the distribution of the random vector "X", provided of course that the theoretical means and covariances exist. The reason for the factor "n" − 1 rather than "n" is essentially the same as the reason for the same factor appearing in unbiased estimates of sample variances and covariances, which relates to the fact that the mean is not known and is replaced by the sample mean.

In cases where the distribution of the random variable "X" is known to be within a certain family of distributions, other estimate may be derived on the basis of that assumption. A well-known instance is when the random variable "X" is normally distributed: in this case the maximum likelihood estimator of the covariance matrix is slightly different from the unbiased estimate, and is given by

:$\left\{1 over n\right\}sum_\left\{i=1\right\}^n \left(x_i-overline\left\{x\right\}\right)\left(x_i-overline\left\{x\right\}\right)^T.$

A derivation of this result is given below. Clearly, the difference between the unbiased estimator and the maximum likelihood estimator diminishes for large "n".

In the general case, the unbiased estimate of the covariance matrix provides an acceptable estimate when the data vectors in the observed data set are all complete: that is they contain no missing elements. One approach to estimating the covariance matrix is to treat the estimation of each variance or pairwise covariance separately, and to use all the observations for which both variables have valid values. Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one.

Maximum-likelihood estimation for the multivariate normal distribution

A random vector "X" &isin; R"p"&times;1 (a "p"&times;1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix "&Sigma;" precisely if "&Sigma;" &isin; R"p" &times; "p" is a positive-definite matrix and the probability density function of "X" is

:$f\left(x\right)= \left[mathrm\left\{constant\right\}\right] cdot det\left(Sigma\right)^\left\{-1/2\right\} expleft\left(-\left\{1 over 2\right\} \left(x-mu\right)^T Sigma^\left\{-1\right\} \left(x-mu\right) ight\right)$

where &mu; &isin; R"p"&times;1 is the expected value. The matrix "&Sigma;" is the higher-dimensional analog of what in one dimension would be the variance.

Suppose now that "X"1, ..., "X""n" are independent and identically distributed with the distribution above. Based on the observed values "x"1, ..., "x""n" of this sample, we wish to estimate "&Sigma;". (Note that we adhere to the convention of writing random variables as capital letters and data as lower-case letters).

First steps

It is fairly readily shown that the maximum-likelihood estimate of the mean vector &mu; is the "sample mean" vector:

:$overline\left\{x\right\}=\left(x_1+cdots+x_n\right)/n.$

See the section on estimation in the article on the normal distribution for details; the process here is similar.

Since the estimate of &mu; does not depend on "&Sigma;", we can just substitute it for &mu; in the likelihood function

:$L\left(mu,Sigma\right)= \left[mathrm\left\{constant\right\}\right] cdot prod_\left\{i=1\right\}^n det\left(Sigma\right)^\left\{-1/2\right\} expleft\left(-\left\{1 over 2\right\} \left(x_i-mu\right)^T Sigma^\left\{-1\right\} \left(x_i-mu\right) ight\right)$

:$propto det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} sum_\left\{i=1\right\}^n \left(x_i-mu\right)^T Sigma^\left\{-1\right\} \left(x_i-mu\right) ight\right)$

and then seek the value of "&Sigma;" that maximizes this (in practice it is easier to work with $log L$).

We have

:$L\left(overline\left\{x\right\},Sigma\right) propto det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} sum_\left\{i=1\right\}^n \left(x_i-overline\left\{x\right\}\right)^T Sigma^\left\{-1\right\} \left(x_i-overline\left\{x\right\}\right) ight\right).$

The trace of a 1 &times; 1 matrix

Now we come to the first surprising step.

Regard the scalar $\left(x_i-overline\left\{x\right\}\right)^T Sigma^\left\{-1\right\} \left(x_i-overline\left\{x\right\}\right)$ as the trace of a 1&times;1 matrix!

This makes it possible to use the identity tr("AB") = tr("BA") whenever "A" and "B" are matrices so shaped that both products exist.We get

:$L\left(mu,Sigma\right)=det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} sum_\left\{i=1\right\}^n operatorname\left\{tr\right\}\left(\left(x_i-mu\right)^T Sigma^\left\{-1\right\} \left(x_i-mu\right)\right) ight\right)$

:$=det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} sum_\left\{i=1\right\}^n operatorname\left\{tr\right\}\left(\left(x_i-mu\right) \left(x_i-mu\right)^T Sigma^\left\{-1\right\}\right) ight\right)$

(so now we are taking the trace of a "p"&times;"p" matrix!)

:$=det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} operatorname\left\{tr\right\} left\left( sum_\left\{i=1\right\}^n \left(x_i-mu\right) \left(x_i-mu\right)^T Sigma^\left\{-1\right\} ight\right) ight\right)$

:$=det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} operatorname\left\{tr\right\} left\left( S Sigma^\left\{-1\right\} ight\right) ight\right)$

where

:$S=sum_\left\{i=1\right\}^n \left(x_i-overline\left\{x\right\}\right) \left(x_i-overline\left\{x\right\}\right)^T in mathbf\left\{R\right\}^\left\{p imes p\right\}.$$S$ is sometimes called the scatter matrix.

Using the spectral theorem

It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix "S" has a unique positive-definite symmetric square root "S"1/2. We can again use the "cyclic property" of the trace to write

:$det\left(Sigma\right)^\left\{-n/2\right\} expleft\left(-\left\{1 over 2\right\} operatorname\left\{tr\right\} left\left( S^\left\{1/2\right\} Sigma^\left\{-1\right\} S^\left\{1/2\right\} ight\right) ight\right).$

Let "B" = "S"1/2 "&Sigma;"−1 "S"1/2. Then the expression above becomes

:$det\left(S\right)^\left\{-n/2\right\} det\left(B\right)^\left\{n/2\right\} expleft\left(-\left\{1 over 2\right\} operatorname\left\{tr\right\} \left(B\right) ight\right).$

The positive-definite matrix "B" can be diagonalized, and then the problem of finding the value of "B" that maximizes

:$det\left(B\right)^\left\{n/2\right\} expleft\left(-\left\{1 over 2\right\} operatorname\left\{tr\right\} \left(B\right) ight\right)$

reduces to the problem of finding the values of the diagonal entries &lambda;1, ..., &lambda;"p" that maximize

:$lambda_i^\left\{n/2\right\} exp\left(-lambda_i/2\right).$

This is just a calculus problem and we get &lambda;"i" = "n", so that "B" = "n Ip", i.e., "n" times the "p"&times;"p" identity matrix.

Concluding steps

Finally we get

:$Sigma=S^\left\{1/2\right\} B^\left\{-1\right\} S^\left\{1/2\right\}=S^\left\{1/2\right\}\left(\left(1/n\right)I_p\right)S^\left\{1/2\right\}=S/n,,$

i.e., the "p"&times;"p" "sample covariance matrix"

:$\left\{S over n\right\} = \left\{1 over n\right\}sum_\left\{i=1\right\}^n \left(X_i-overline\left\{X\right\}\right)\left(X_i-overline\left\{X\right\}\right)^T$

is the maximum-likelihood estimator of the "population covariance matrix" "&Sigma;". At this point we are using a capital "X" rather than a lower-case "x" because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. The random matrix "S" can be shown to have a Wishart distribution with "n" − 1 degrees of freedom. [
K.V. Mardia, J.T. Kent, and J.M. Bibby (1979) "Multivariate Analysis", Academic Press.
] That is:

:$sum_\left\{i=1\right\}^n \left(X_i-overline\left\{X\right\}\right)\left(X_i-overline\left\{X\right\}\right)^T sim W_p\left(Sigma,n-1\right)$.

Alternative derivation

An alternative derivation of the maximum likelihood estimator can be performed via matrix calculus formulae (see also differential of a determinant and differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick:

:$operatorname\left\{ln\right\} L\left(mu,Sigma\right) = operatorname\left\{const\right\} -\left\{n over 2\right\} ln det\left(Sigma\right) -\left\{1 over 2\right\} operatorname\left\{tr\right\} left \left[ Sigma^\left\{-1\right\} sum_\left\{i=1\right\}^n \left(x_i-mu\right) \left(x_i-mu\right)^T ight\right] .$

The differential of this log-likelihood is

:$d ln L\left(mu,Sigma\right) = -\left\{n over 2\right\} operatorname\left\{tr\right\} left \left[ Sigma^\left\{-1\right\} left\left\{ d Sigma ight\right\} ight\right] -\left\{1 over 2\right\} operatorname\left\{tr\right\} left \left[ - Sigma^\left\{-1\right\} \left\{ d Sigma \right\} Sigma^\left\{-1\right\} sum_\left\{i=1\right\}^n \left(x_i-mu\right)\left(x_i-mu\right)^T - 2 Sigma^\left\{-1\right\} sum_\left\{i=1\right\}^n \left(x_i - mu\right) \left\{ d mu \right\}^T ight\right] .$

It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The first order condition for maximum, $d ln L\left(mu,Sigma\right)=0$, is satisfied when the terms multiplying $d mu$ and $d Sigma$ are identically zero. Assuming (the maximum likelihood estimate of) $Sigma$ is non-singular, the first order condition for the estimate of the mean vector is

:$sum_\left\{i=1\right\}^n \left(x_i - mu\right) = 0,$

which leads to the maximum likelihood estimator

:

This lets us simplify as defined above. Then the terms involving $d Sigma$ in $d ln L$ can be combined as

:$-\left\{1 over 2\right\} operatorname\left\{tr\right\} left\left( Sigma^\left\{-1\right\} left\left\{ d Sigma ight\right\} left \left[ nI_p - Sigma^\left\{-1\right\} S ight\right] ight\right).$

The first order condition $d ln L\left(mu,Sigma\right)=0$ will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by $Sigma$ and dividing by $n$ gives

:$widehat Sigma = \left\{1 over n\right\} S,$

which of course coincides with the canonical derivation given earlier.

Dwyer cite journal| title=Some applications of matrix derivatives in multivariate analysis| journal=Journal of the American Statistical Association| month=June| year=1967| volume=62| number=318 |pages=607–625| doi=10.2307/2283988| author=Dwyer, Paul S.] points out that decomposition into two terms such as appears above is "unnecessary" and derives the estimator in two lines of working.

hrinkage estimation

If the sample size "n" is small and the number of considered variables "p" is large, the above empirical estimators of covariance and correlation are very unstable. Specifically, it is possible to furnish estimators that improve considerably upon the maximum likelihood estimate in terms of mean squared error. Moreover, for "n" < "p", the empirical estimate of the covariance matrix becomes singular, i.e. it cannot be inverted to compute the precision matrix.

As an alternative, many methods have been suggested to improve the estimation of the covariance matrix. All of these approaches rely on the concept of shrinkage. This is implicit in Bayesian methods, in penalized maximum likelihood methods, and explicit in the Stein-type shrinkage approach.

A simple version of a shrinkage estimator of the covariance matrix is constructed as follows. One considers a convex combination of the empirical estimator with some suitable chosen target, e.g., the diagonal matrix. Subsequently, the mixing parameteris selected to maximize the expected accuracy of the shrunken estimator. This can be done by cross-validation, or by using an analytic estimate of the shrinkage intensity. The resulting regularized estimator can be shown to outperform the maximum likelihood estimator for small samples. For large samples, the shrinkage intensity will reduce to zero, hence inthis case the shrinkage estimator will be identical to the empirical estimator. Apart from increased efficiency the shrinkage estimate has the additional advantage that it is always positive definite and well conditioned.

A review on this topic is given, e.g., in Schäfer and Strimmer 2005. [J. Schäfer and K. Strimmer (2005) " [http://www.bepress.com/sagmb/vol4/iss1/art32 A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics] ", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 32.]

A covariance shrinkage estimator is implemented in the R package [http://cran.r-project.org/web/packages/corpcor/index.html "corpcor"] .

ee also

*Sample mean and sample covariance

References

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Covariance matrix — A bivariate Gaussian probability density function centered at (0,0), with covariance matrix [ 1.00, .50 ; .50, 1.00 ] …   Wikipedia

• Sample mean and sample covariance — are statistics computed from a collection of data, thought of as being random.ample mean and covarianceGiven a random sample extstyle mathbf{x} {1},ldots,mathbf{x} {N} from an extstyle n dimensional random variable extstyle mathbf{X} (i.e.,… …   Wikipedia

• Unbiased estimation of standard deviation — In statistics, the standard deviation is often estimated from a random sample drawn from the population. The most common measure used is the sample standard deviation , which is defined by:s = sqrt{frac{1}{n 1} sum {i=1}^n (x i… …   Wikipedia

• Covariance — Pour le principe physique, voir Principe de covariance générale.  Ne pas confondre avec la covariance d un tenseur en algèbre ou en géométrie différentielle, ou d un foncteur en théorie des catégories. En théorie des probabilités et en… …   Wikipédia en Français

• Minimum distance estimation — (MDE) is a statistical method for fitting a mathematical model to data, usually the empirical distribution. Contents 1 Definition 2 Statistics used in estimation 2.1 Chi square criterion …   Wikipedia

• Matrice De Variance-Covariance — Une matrice de variance covariance est une matrice carrée caractérisant les interactions (linéaires) entre p variables aléatoires . Sommaire 1 Définition 2 Propriétés …   Wikipédia en Français

• Matrice de covariance — Matrice de variance covariance Une matrice de variance covariance est une matrice carrée caractérisant les interactions (linéaires) entre p variables aléatoires . Sommaire 1 Définition 2 Propriétés …   Wikipédia en Français

• Matrice de variance-covariance — Une matrice de variance covariance est une matrice carrée caractérisant les interactions (linéaires) entre p variables aléatoires . Sommaire 1 Définition 2 Propriétés 3 …   Wikipédia en Français

• Multivariate normal distribution — MVN redirects here. For the airport with that IATA code, see Mount Vernon Airport. Probability density function Many samples from a multivariate (bivariate) Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the… …   Wikipedia

• Wishart distribution — Probability distribution name =Wishart type =density pdf cdf parameters = n > 0! deg. of freedom (real) mathbf{V} > 0, scale matrix ( pos. def) support =mathbf{W}! is positive definite pdf =frac{left|mathbf{W} ight|^frac{n p 1}{2… …   Wikipedia