$\newcommand{\trace}{\operatorname{trace}}$I recently came across a deduction I couldn't follow. It concerns the maximum likelihood estimate of the co-variance matrix for a multivariate Gaussian distribution. At first most of it seems straight forward, move exponents down and lower them and that kind of stuff. But then I noticed the trace and the determinant, and now all bets are off and the more I look the less I feel like I understand. It goes like this:
\begin{align} 0& =\nabla_\Sigma\sum_{i=1}^n-\frac12\ln((2\pi)^d\det(\Sigma))-(x_i-\mu)^T\Sigma^{-1}(x_i-\mu) \\ & =-\frac n2\nabla_\Sigma \ln(\det(\Sigma))-\frac12\nabla_\Sigma \trace \left( \Sigma^{-1}\sum_{i=1}^n(x_i-\mu)(x_i-\mu)^T\right) \\ & =-\frac n2\Sigma^{-1}+\frac12\Sigma^{-2}\sum_{i=1}^{n}(x_i-\mu)(x_i-\mu)^T \end{align}
Here $\nabla_\Sigma$ is the gradient along $\Sigma$
I've managed to come to terms to some parts of it during writing. I'm writing them here for my own sake and for fellow classmates.
The first and most obvious issue is the disapearing constant in
$$\nabla_\Sigma\frac{-n}2\ln((2\pi)^d\det(\Sigma))=-\frac n2\nabla_\Sigma \ln(\det(\Sigma))$$
However seening as:
$$\frac{d}{dx} \ln(ax)=\frac{1}{ax}\cdot a=\frac1x$$
I am willing to assume that I might understand whats going on and that it was just a careless mistake.
Whats profoundly confusing however is that:
$$\nabla_\Sigma \ln(\det(\Sigma)) = \Sigma^{-1}$$
We start with a scalar and end up with a matrix.
Another thing, the replacement;
$$(x_i-\mu)^T\Sigma^{-1}(x_i-\mu)=\trace(\Sigma^{-1}(x_i-\mu)(x_i-\mu)^T)$$
also confused the **** out of me. Then I noticed we switched from an inner product $u^Tu$ to an outer product $uu^T$
I tried a calculation in 2 dims and found the following:
$$\begin{eqnarray} \left[\begin{matrix} x & y \end{matrix}\right]\left[\begin{matrix} a & b \\ c & d \end{matrix}\right]\left[\begin{matrix} x \\ y \end{matrix}\right]&&=ax^2+(b+c)xy+dy^2\end{eqnarray}$$
And at the same time we have
\begin{align} & \trace \left( \left[\begin{matrix} a & b \\ c & d \\ \end{matrix}\right]\left[\begin{matrix} x \\ y \end{matrix}\right]\left[\begin{matrix} x & y \end{matrix}\right]\right) = \trace\left(\left[\begin{matrix} a & b \\ c & d \end{matrix}\right]\left[\begin{matrix} x^2 &xy \\ xy & y^2 \end{matrix}\right]\right) \\[10pt] = {} & \trace\left(\left[\begin{matrix} ax^2 + bxy & axy + by^2 \\ cx^2 + dxy & cxy + dy^2 \end{matrix}\right]\right) = ax^2 +(b+c)xy + dy^2 \end{align}
I know this isn't a proof but its good enough for me for now.
The question that remains however is how;
$$-\nabla_\Sigma \trace(\Sigma^{-1}) = \Sigma^{-2}$$
Again we go from scalar to vector. Of course it must be that I am lacking some theory, any help in this regard would be great. Perhaps even better, if you know focused resources covering the parts that I am missing that would be awesome, I am a bit low on time these months so emphasis on focused.
Regardless, Any help offered would be greatly appreciated.