Your American History Reference Guide!
- Estimation of covariance matrices

HistoryMania Information Site on Estimation of covariance matrices American History American History Search        American History Browse welcome to our free resource site for all enthusiasts!

Estimation of covariance matrices

In multivariate statistics, the importance of the Wishart distribution stems in part from the fact that it is the probability distribution of the maximum likelihood estimator of the covariance matrix of a multivariate normal distribution. Although no one is surprised that the estimator of the population covariance matrix is simply the sample covariance matrix, the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant.

Contents

The multivariate normal distribution

A random vector XRp×1 (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix V precisely if VRp × p is a positive-definite matrix and the probability density function of X is

f(x)=[\mathrm{constant}]\cdot \det(V)^{-1/2} \exp\left({-1 \over 2} (x-\mu)^T V^{-1} (x-\mu)\right)

where μ ∈ Rp×1 is the expected value. The matrix V is the higher-dimensional analog of what in one dimension would be the variance.

Maximum-likelihood estimation

Suppose now that X1, ..., Xn are independent and identically distributed with the distribution above. Based on the observed values x1, ..., xn of this sample, we wish to estimate V (we adhere to the convention of writing random variables as capital letters and data as lower-case letters).

First steps

It is fairly readily shown that the maximum-likelihood estimate of the expected value μ is the "sample mean"

\overline{x}=(x_1+\cdots+x_n)/n.

See the section on estimation in the article on the normal distribution for details; the process here is similar.

Since the estimate of μ does not depend on V, we can just substitute it for μ in the likelihood function

L(\mu,V)=[\mathrm{constant}]\cdot \prod_{i=1}^n \det(V)^{-1/2} \exp\left({-1 \over 2} (x_i-\mu)^T V^{-1} (x_i-\mu)\right)
Failed to parse (unknown function \propto): \propto \det(V)^{-n/2} \exp\left({-1 \over 2} \sum_{i=1}^n (x_i-\mu)^T V^{-1} (x_i-\mu) \right)


and then seek the value of V that maximizes this.

We have

Failed to parse (unknown function \propto): L(\overline{x},V) \propto \det(V)^{-n/2} \exp\left({-1 \over 2} \sum_{i=1}^n (x_i-\overline{x})^T V^{-1} (x_i-\overline{x})\right).


The trace of a 1 × 1 matrix

Now we come to the first surprising step.

Regard the scalar (x_i-\overline{x})^T V^{-1} (x_i-\overline{x}) as the trace of a 1×1 matrix!

This makes it possible to use the identity tr(AB) = tr(BA) whenever A and B are matrices so shaped that both products exist. We get

\det(V)^{-n/2} \exp\left({-1 \over 2} \sum_{i=1}^n \operatorname{tr}((x_i-\mu)^T V^{-1} (x_i-\mu)) \right)
=\det(V)^{-n/2} \exp\left({-1 \over 2} \sum_{i=1}^n \operatorname{tr}((x_i-\mu) (x_i-\mu)^T V^{-1}) \right)

(so now we are taking the trace of a p×p matrix!)

=\det(V)^{-n/2} \exp\left({-1 \over 2} \operatorname{tr} \left( \sum_{i=1}^n (x_i-\mu) (x_i-\mu)^T V^{-1} \right) \right)
=\det(V)^{-n/2} \exp\left({-1 \over 2} \operatorname{tr} \left( S V^{-1} \right) \right)

where

S=\sum_{i=1}^n (x_i-\overline{x}) (x_i-\overline{x})^T \in \mathbf{R}^{p\times p}.

Using the spectral theorem

It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S1/2. We can again use the "cyclic property" of the trace to write

\det(V)^{-n/2} \exp\left({-1 \over 2} \operatorname{tr} \left( S^{1/2} V^{-1} S^{1/2} \right) \right).

Let B = S1/2 V−1 S1/2. Then the expression above becomes

\det(S)^{-n/2} \det(B)^{n/2} \exp\left({-1 \over 2} \operatorname{tr} (B) \right).

The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes

\det(B)^{n/2} \exp\left({-1 \over 2} \operatorname{tr} (B) \right)

reduces to the problem of finding the values of the diagonal entries λ1, ..., λp that maximize

\lambda_i^{n/2} \exp(-\lambda_i/2).

This is just a calculus problem and we get λi = n, so that B = n Ip, i.e., n times the p×p identity matrix.

Concluding steps

Finally we get

V = S1 / 2B - 1S1 / 2 = S1 / 2((1 / n)Ip)S1 / 2 = S / n,

i.e., the p×p "sample covariance matrix"

{S \over n} = {1 \over n}\sum_{i=1}^n (X_i-\overline{X})(X_i-\overline{X})^T

is the maximum-likelihood estimator of the "population covariance matrix" V. At this point we are using a capital X rather than a lower-case x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. This random matrix can be shown to have a Wishart distribution with n − 1 degrees of freedom.

The contents of this article are licensed from Wikipedia.org under the
GNU Free Documentation License. How to see transparent copy
Search | Browse | Contact | Legal info