Deriving the maximum likelihood estimate of Gaussian co variance matrix

Question

$\newcommand{\trace}{\operatorname{trace}}$I recently came across a deduction I couldn't follow. It concerns the maximum likelihood estimate of the co-variance matrix for a multivariate Gaussian distribution. At first most of it seems straight forward, move exponents down and lower them and that kind of stuff. But then I noticed the trace and the determinant, and now all bets are off and the more I look the less I feel like I understand. It goes like this:

\begin{align} 0& =\nabla_\Sigma\sum_{i=1}^n-\frac12\ln((2\pi)^d\det(\Sigma))-(x_i-\mu)^T\Sigma^{-1}(x_i-\mu) \\ & =-\frac n2\nabla_\Sigma \ln(\det(\Sigma))-\frac12\nabla_\Sigma \trace \left( \Sigma^{-1}\sum_{i=1}^n(x_i-\mu)(x_i-\mu)^T\right) \\ & =-\frac n2\Sigma^{-1}+\frac12\Sigma^{-2}\sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^T \end{align}

Here $\nabla_\Sigma$ is the gradient along $\Sigma$

I've managed to come to terms to some parts of it during writing. I'm writing them here for my own sake and for fellow classmates.

The first and most obvious issue is the disapearing constant in

$$\nabla_\Sigma\frac{-n}2\ln((2\pi)^d\det(\Sigma))=-\frac n2\nabla_\Sigma \ln(\det(\Sigma))$$

However seening as:

$$\frac{d}{dx} \ln(ax)=\frac{1}{ax}\cdot a=\frac1x$$

I am willing to assume that I might understand whats going on and that it was just a careless mistake.

Whats profoundly confusing however is that:

$$\nabla_\Sigma \ln(\det(\Sigma)) = \Sigma^{-1}$$

We start with a scalar and end up with a matrix.

Another thing, the replacement;

$$(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)=\trace(\Sigma^{-1}(x_i-\mu)(x_i-\mu)^T)$$

also confused the **** out of me. Then I noticed we switched from an inner product $u^Tu$ to an outer product $uu^T$

I tried a calculation in 2 dims and found the following:

$$\begin{eqnarray} \left[\begin{matrix} x & y \end{matrix}\right]\left[\begin{matrix} a & b \\ c & d \end{matrix}\right]\left[\begin{matrix} x \\ y \end{matrix}\right]&&=ax^2+(b+c)xy+dy^2\end{eqnarray}$$

And at the same time we have

\begin{align} & \trace \left( \left[\begin{matrix} a & b \\ c & d \\ \end{matrix}\right]\left[\begin{matrix} x \\ y \end{matrix}\right]\left[\begin{matrix} x & y \end{matrix}\right]\right) = \trace\left(\left[\begin{matrix} a & b \\ c & d \end{matrix}\right]\left[\begin{matrix} x^2 &xy \\ xy & y^2 \end{matrix}\right]\right) \\[10pt] = {} & \trace\left(\left[\begin{matrix} ax^2 + bxy & axy + by^2 \\ cx^2 + dxy & cxy + dy^2 \end{matrix}\right]\right) = ax^2 +(b+c)xy + dy^2 \end{align}

I know this isn't a proof but its good enough for me for now.

The question that remains however is how;

$$-\nabla_\Sigma \trace(\Sigma^{-1}) = \Sigma^{-2}$$

Again we go from scalar to vector. Of course it must be that I am lacking some theory, any help in this regard would be great. Perhaps even better, if you know focused resources covering the parts that I am missing that would be awesome, I am a bit low on time these months so emphasis on focused.

Regardless, Any help offered would be greatly appreciated.

Michael Hardy · Accepted Answer · 2017-01-30 20:43:30Z

You might want to look at this Wikipedia article.

One identity you need to know is this: If $A$ is a $p\times q$ matrix and $B$ is a $q\times p$ matrix, then $$ \operatorname{trace}(AB) = \operatorname{trace}(BA). $$ That's easy to prove by brute force and probably there are also some slick proofs. The $ij$ entry of $AB$ is $$ (ab)_{ij} = \sum_{k=1}^q a_{ik} b_{kj}, $$ so the trace of the product $AB$ is $$ \sum_{i=1}^p \sum_{k=1}^q a_{ik} b_{ki}. $$ Now find the trace of $BA$ by the same method, being careful about which is $p$ and which is $q$.

The gradient of a scalar-valued function of a vector variable is a vector-valued function. \begin{align} & \frac \partial {\partial a} \log \det \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \\[10pt] = {} & \frac \partial {\partial a} \log(aei - ahf -bdi + bgf + cdh -ceg) \\[10pt] = {} & \frac 1 {aei - ahf -bdi + bgf + cdh -ceg} \cdot (ei - hf) \\[10pt] = {} & \text{the entry in the first row and first column of the gradient.} \end{align} I think you need to think about how this compares with formulas for the entries in the inverse matrix. Those formulas would say the inverse of $\Sigma$ is $1/\det\Sigma$ times a certain matrix whose entries are polynomials in the entries in $\Sigma.$

$\begingroup$ Thanks! I appreciate it alot $\endgroup$
– user25470
Commented Jan 30, 2017 at 21:37 — user25470, Commented Jan 30, 2017 at 21:37

Stack Exchange Network

Deriving the maximum likelihood estimate of Gaussian co variance matrix

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
statistics
multivariable-calculus
machine-learning
.

Hot Network Questions

Deriving the maximum likelihood estimate of Gaussian co variance matrix

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged statisticsmultivariable-calculusmachine-learning.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
statistics
multivariable-calculus
machine-learning
.