Relationship between Hessian Matrix and Covariance Matrix

Question

While I am studying Maximum Likelihood Estimation, to do inference in Maximum Likelihood Estimaion, we need to know the variance. To find out the variance, I need to know the Cramer's Rao Lower Bound, which looks like a Hessian Matrix with Second Deriviation on the curvature. I am kind of mixed up to define the relationship between covariance matrix and hessian matrix. Hope to hear some explanations about the question. A simple example will be appreciated.

jII · Accepted Answer · 2023-02-22 20:35:48Z

You should first check out this: Basic question about Fisher Information matrix and relationship to Hessian and standard errors.

Suppose we have a statistical model (family of distributions) $\{f_{\theta}: \theta \in \Theta\}$. In the most general case we have $\mathrm{dim}(\Theta) = d$, so this family is parameterized by $\theta = (\theta_1, \dots, \theta_d)^T$. Under certain regularity conditions, we have

$$I_{i,j}(\theta) = -E_{\theta}\left[\frac{\partial^2 l(X; \theta)}{\partial\theta_i\partial\theta_j}\right] = -E_\theta\left[H_{i,j}(l(X;\theta))\right],$$

where $I_{i,j}$ is a Fisher Information matrix (as a function of $\theta$) and $X$ is the observed value (sample)

$$l(X; \theta) = \ln(f_{\theta}(X)),\text{ for some } \theta \in \Theta.$$

So Fisher Information matrix is a negated expected value of Hesian of the log-probability under some $\theta$

Now let's say we want to estimate some vector function of the unknown parameter $\psi(\theta)$. Usually it is desired that the estimator $T(X) = (T_1(X), \dots, T_d(X))$ should be unbiased, i.e.

$$\forall_{\theta \in \Theta}\ E_{\theta}[T(X)] = \psi(\theta).$$

Cramer Rao Lower Bound states that for every unbiased $T(X)$ the $\mathrm{cov}_{\theta}(T(X))$ satisfies

$$\mathrm{cov}_{\theta}(T(X)) \ge \frac{\partial\psi(\theta)}{\partial\theta}I^{-1}(\theta)\left(\frac{\partial\psi(\theta)}{\partial\theta}\right)^T = B(\theta),$$

where the notation $A \ge B$ for matrices $A, B$ means that $A - B$ is positive semi-definite. Further, $\frac{\partial\psi(\theta)}{\partial\theta}$ denotes the Jacobian matrix $J_{i,j}(\psi)$. Note that if we estimate $\theta$, that is $\psi(\theta) = \theta$, above simplifies to

$$\mathrm{cov}_{\theta}(T(X)) \ge I^{-1}(\theta).$$

But what does it tell us really? For example, recall that

$$\mathrm{var}_{\theta}(T_i(X)) = [\mathrm{cov}_{\theta}(T(X))]_{i,i}$$

and that for every positive semi-definite matrix $A$ diagonal elements are non-negative

$$\forall_i\ A_{i,i} \ge 0.$$

From above we can conclude that the variance of each estimated element is bounded by diagonal elements of matrix $B(\theta)$

$$\forall_i\ \mathrm{var}_{\theta}(T_i(X)) \ge [B(\theta)]_{i,i}$$

So CRLB doesn't tell us the variance of our estimator, but wheter or not our estimator is optimal, i.e., if it has lowest covariance among all unbiased estimators.

I appreciate your explanation here. I am not really a math person but I am in the way of learning the math serously. However, it still looks too abstract to me. I hope there is some gentle example with simple numbers, that will definitely understand it. — user122358, Commented Feb 15, 2017 at 4:07

Stack Exchange Network

Relationship between Hessian Matrix and Covariance Matrix

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
mathematical-statistics
maximum-likelihood
data-mining
or ask your own question.

Linked

Hot Network Questions

Relationship between Hessian Matrix and Covariance Matrix

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learningmathematical-statisticsmaximum-likelihooddata-mining or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
mathematical-statistics
maximum-likelihood
data-mining
or ask your own question.