59
$\begingroup$

Suppose we have a random variable $X \sim f(x|\theta)$. If $\theta_0$ were the true parameter, the the likelihood function should be maximized and the derivative equal to zero. This is the basic principle behind the maximum likelihood estimator.

As I understand it, Fisher information is defined as

$$I(\theta) = \Bbb E \Bigg[\left(\frac{\partial}{\partial \theta}f(X|\theta)\right)^2\Bigg ]$$

Thus, if $\theta_0$ is the true parameter, $I(\theta) = 0$. But if it $\theta_0$ is not the true parameter, then we will have a larger amount of Fisher information.

my questions

  1. Does Fisher information measure the "error" of a given MLE? In other words, doesn't the existence of positive Fisher information imply my MLE can't be ideal?
  2. How does this definition of "information" differ from that used by Shannon? Why do we call it information?
$\endgroup$
5
  • $\begingroup$ Why do you write it $E_\theta$? The expectation is over values of $X$ distributed as if they came from your distribution with parameter $\theta$. $\endgroup$
    – Neil G
    Commented Feb 14, 2016 at 21:56
  • 4
    $\begingroup$ Also $I(\theta)$ is not zero at the true parameter. $\endgroup$
    – Neil G
    Commented Feb 14, 2016 at 21:57
  • $\begingroup$ The E(S) is zero (i.e.: expectation of the score function), but as Neil G wrote - fisher information (V(S)) is not (usually) zero. $\endgroup$
    – Tal Galili
    Commented Mar 17, 2016 at 22:27
  • 1
    $\begingroup$ The term 'Fisher information' is misleading. It is not a type of information. It would be better to speak about the 'Fisher information matrix', it describes the information and it's relation with the parameters of the distribution. Related: Conflicting Definition of Information in Statistics | Fisher Vs Shannon $\endgroup$ Commented May 26, 2023 at 23:36
  • $\begingroup$ This post points out a misleading notation in stats.stackexchange.com/a/197471/295619 Due to the lack of reputation in this community, I cannot comment. All expectations in this post should be taken over $X$ instead of $\theta$. To avoid this misunderstanding, I suggest the author of the post replace all $\mathbb{E}_{\theta}$ with $\mathbb{E}_{x\sim f(x,\theta)}$, and discards the $\dot{\ell}$ and $\ddot{\ell}$ notations. $\endgroup$ Commented Apr 10 at 15:49

3 Answers 3

43
$\begingroup$

Trying to complement the other answers... What kind of information is Fisher information? Start with the loglikelihood function $$ \ell (\theta) = \log f(x;\theta) $$ as a function of $\theta$ for $\theta \in \Theta$, the parameter space. Assuming some regularity conditions we do not discuss here, we have $\DeclareMathOperator{\E}{\mathbb{E}} \E \frac{\partial}{\partial \theta} \ell (\theta) = \E_\theta \dot{\ell}(\theta) = 0$ (we will write derivatives with respect to the parameter as dots as here). The variance is the Fisher information $$ I(\theta) = \E_\theta ( \dot{\ell}(\theta) )^2= -\E_\theta \ddot{\ell}(\theta) $$ the last formula showing that it is the (negative) curvature of the loglikelihood function. One often finds the maximum likelihood estimator (mle) of $\theta$ by solving the likelihood equation $\dot{\ell}(\theta)=0$ when the Fisher information as the variance of the score $\dot{\ell}(\theta)$ is large, then the solution to that equation will be very sensitive to the data, giving a hope for high precision of the mle. That is confirmed at least asymptotically, the asymptotic variance of the mle being the inverse of Fisher information.

How can we interpret this? $\ell(\theta)$ is the likelihood information about the parameter $\theta$ from the sample. This can really only be interpreted in a relative sense, like when we use it to compare the plausibilities of two distinct possible parameter values via the likelihood ratio test $\ell(\theta_0) - \ell(\theta_1)$. The rate of change of the loglikelihood is the score function $\dot{\ell}(\theta)$ tells us how fast the likelihood changes, and its variance $I(\theta)$ how much this varies from sample to sample, at a given parameter value, say $\theta_0$. The equation (which is really surprising!) $$ I(\theta) = - \E_\theta \ddot{\ell}(\theta) $$ tells us there is a relationship (equality) between the variability in the information (likelihood) for a given parameter value, $\theta_0$, and the curvature of the likelihood function for that parameter value. This is a surprising relationship between the variability (variance) of ths statistic $\dot{\ell}(\theta) \mid_{\theta=\theta_0}$ and the expected change in likelihood when we vary the parameter $\theta$ in some interval around $\theta_0$ (for the same data). This is really both strange, surprising and powerful!

So what is the likelihood function? We usually think of the statistical model $\{ f(x;\theta), \theta \in \Theta \} $ as a family of probability distributions for data $x$, indexed by the parameter $\theta$ some element in the parameter space $\Theta$. We think of this model as being true if there exists some value $\theta_0 \in \Theta$ such that the data $x$ actually have the probability distribution $f(x;\theta_0)$. So we get a statistical model by imbedding the true data-generating probability distribution $f(x;\theta_0)$ in a family of probability distributions. But, it is clear that such an imbedding can be done in many different ways, and each such imbedding will be a "true" model, and they will give different likelihood functions. And, without such an imbedding, there is no likelihood function. It seems that we really do need some help, some principles for how to choose an imbedding wisely!

So, what does this mean? It means that the choice of likelihood function tells us how we would expect the data to change, if the truth changed a little bit. But, this cannot really be verified by the data, as the data only gives information about the true model function $f(x;\theta_0)$ which actually generated the data, and not nothing about all the other elements in the choosen model. This way we see that choice of the likelihood function is similar to choice of a prior in Bayesian analysis, it injects non-data information into the analysis. Let us look at this in a simple (somewhat artificial) example, and look at the effect of imbedding $f(x;\theta_0)$ in a model in different ways.

Let us assume that $X_1, \dotsc, X_n$ are iid as $N(\mu=10, \sigma^2=1)$. So, that is the true, data-generating distribution. Now, let us embed this in a model in two different ways, model A and model B. $$ A \colon X_1, \dotsc, X_n ~\text{iid}~N(\mu, \sigma^2=1),\mu \in \mathbb{R} \\ B \colon X_1, \dotsc, X_n ~\text{iid}~N(\mu, \mu/10), \mu>0 $$ you can check that this coincides for $\mu=10$.

The loglikelihood functions become $$ \ell_A(\mu) = -\frac{n}{2} \log (2\pi) -\frac12\sum_i (x_i-\mu)^2 \\ \ell_B(\mu) = -\frac{n}{2} \log (2\pi) - \frac{n}{2}\log(\mu/10) - \frac{10}{2}\sum_i \frac{(x_i-\mu)^2}{\mu} $$

The score functions: (loglikelihood derivatives): $$ \dot{\ell}_A(\mu) = n (\bar{x}-\mu) \\ \dot{\ell}_B(\mu) = -\frac{n}{2\mu}- \frac{10}{2}\sum_i \left(\frac{x_i}{\mu}\right)^2 - 15 n $$ and the curvatures $$ \ddot{\ell}_A(\mu) = -n \\ \ddot{\ell}_B(\mu) = \frac{n}{2\mu^2} + \frac{10}{2}\sum_i \frac{2 x_i^2}{\mu^3} $$ so, the Fisher information do really depend on the imbedding. Now, we calculate the Fisher information at the true value $\mu=10$, $$ I_A(\mu=10) = n, \\ I_B(\mu=10) = n \cdot \left(\frac1{200}+\frac{2020}{2000}\right) > n $$ so the Fisher information about the parameter is somewhat larger in model B.

This illustrates that, in some sense, the Fisher information tells us how fast the information from the data about the parameter would have changed if the governing parameter changed in the way postulated by the imbedding in a model family. The explanation of higher information in model B is that our model family B postulates that if the expectation would have increased, then the variance too would have increased. So that, under model B, the sample variance will also carry information about $\mu$, which it will not do under model A.

Also, this example illustrates that we really do need some theory for helping us in how to construct model families.

$\endgroup$
6
  • 2
    $\begingroup$ great explanation. Why do you say $\E_\theta \dot{\ell}(\theta) =0$? it's a function of $\theta$ - isn't it 0 only when evaluated at the true parameter $\theta_0$? $\endgroup$
    – ihadanny
    Commented Aug 13, 2016 at 6:33
  • 1
    $\begingroup$ Yes, what you say is true, @idadanny It is zero when evaluated at the true parameter value. $\endgroup$ Commented Aug 16, 2016 at 13:59
  • $\begingroup$ Thanks again @kjetil - so just one more question: is the surprising relationship between the variance of the score and the curvature of the likelihood true for every $\theta$? or only in the neighborhood of the true parameter $\theta_0$? $\endgroup$
    – ihadanny
    Commented Aug 16, 2016 at 18:56
  • $\begingroup$ Again, that trelationship is true for the true parameter value. But for that to be of much help, there must be continuity, so that it is approximately true in some neighborhood, since we will use it at the estimated value $\hat{\theta}$, not only at the true (unknown) value. $\endgroup$ Commented Aug 16, 2016 at 19:31
  • $\begingroup$ so, the relationship holds for the true parameter $\theta_0$, it almost holds for $\theta_{mle}$ since we assume that it's in the neighborhood of $\theta_0$, but for a general $\theta_1$ it does not hold, right? $\endgroup$
    – ihadanny
    Commented Aug 17, 2016 at 7:02
45
$\begingroup$

Let's think in terms of the negative log-likelihood function $\ell$. The negative score is its gradient with respect to the parameter value. At the true parameter, the score is zero. Otherwise, it gives the direction towards the minimum $\ell$ (or in the case of non-convex $\ell$, a saddle point or local minimum or maximum).

The Fisher information measures the curvature of $\ell$ around $\theta$ if the data follows $\theta$. In other words, it tells you how much wiggling the parameter would affect your log-likelihood.

Consider that you had a big model with millions of parameters. And you had a small thumb drive on which to store your model. How should you prioritize how many bits of each parameter to store? The right answer is to allocate bits according the Fisher information (Rissanen wrote about this). If the Fisher information of a parameter is zero, that parameter doesn't matter.

We call it "information" because the Fisher information measures how much this parameter tells us about the data.


A colloquial way to think about it is this: Suppose the parameters are driving a car, and the data is in the back seat correcting the driver. The annoyingness of the data is the Fisher information. If the data lets the driver drive, the Fisher information is zero; if the data is constantly making corrections, it's big. In this sense, the Fisher information is the amount of information going from the data to the parameters.

Consider what happens if you make the steering wheel more sensitive. This is equivalent to a reparametrization. In that case, the data doesn't want to be so loud for fear of the car oversteering. This kind of reparametrization decreases the Fisher information.

$\endgroup$
25
$\begingroup$

Complementary to @NeilG's nice answer (+1) and to address your specific questions:

  1. I would say it counts the "precision" rather than the "error" itself.

Remember that the Hessian of the log-likelihood evaluated at the ML estimates is the observed Fisher information. The estimated standard errors are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. Stemming from this the Fisher information is the trace of the Fisher information matrix. Given that the Fisher Information matrix $I$ is a Hermitian positive-semidefinite matrix matrix then the diagonal entries $I_{j,j}$ of it are real and non-negative; as a direct consequence it trace $tr(I)$ must be positive. This means that you can have only "non-ideal" estimators according to your assertion. So no, a positive Fisher information is not related to how ideal is your MLE.

  1. The definition differs in the way we interpreter the notion of information in both cases. Having said that, the two measurements are closely related.

The inverse of Fisher information is the minimum variance of an unbiased estimator (Cramér–Rao bound). In that sense the information matrix indicates how much information about the estimated coefficients is contained in the data. On the contrary the Shannon entropy was taken from thermodynamics. It relates the information content of a particular value of a variable as $–p·log_2(p)$ where $p$ is the probability of the variable taking on the value. Both are measurements of how "informative" a variable is. In the first case though you judge this information in terms of precision while in the second case in terms of disorder; different sides, same coin! :D

To recap: The inverse of the Fisher information matrix $I$ evaluated at the ML estimator values is the asymptotic or approximate covariance matrix. As this ML estimator values are found in a local minimum graphically the Fisher information shows how deep is that minimum and who much wiggle room you have around it. I found this paper by Lutwak et al. on Extensions of Fisher information and Stam’s inequality an informative read on this matter. The Wikipedia articles on the Fisher Information Metric and on Jensen–Shannon divergence are also good to get you started.

$\endgroup$
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.