What is the connection between estimation and information theory?

Question

I'm trying to understand the connection between estimators and entropy. Grasping at straws but do estimators reduce the entropy of the random variable being estimated, with better estimators (less variance) reducing it more?

RNG · Accepted Answer · 2019-09-18 21:02:14Z

[Note: I am not entirely sure if this answer interprets the notion of "reduces the entropy" exactly as intended in the question, but this information (drawn largely form Sections 2.8, 2.10 of Thomas Cover, Elements of Information Theory) seems at least highly relevant to the question at hand.]

Overview

Consider a sample $X$ drawn from a distribution with probability mass function $\{f_\theta\}$ indexed by parameter $\theta$. The parameter $\theta$ is the estimand. This post explains how a statistic $T(X)$ for $\theta$ reduces the entropy $H(\theta)$, with sufficient statistics $T'(X)$ (i.e., "better estimators") reducing $H(\theta)$ more than any other statistic.

When we say that "a sufficient statistic $T'(X)$ for $\theta$ reduces the entropy $H(\theta)$ more than any other statistic", we mean that $T'(X)$ contains as much information about $\theta$ as does the sample $X$ itself. That is, $I(\theta; X) = I(\theta; T')$ (or, stated in terms of conditional entropies, $H(\theta|X) = H(\theta|T')$). It turns out that, by the Data Processing Inequality, no statistic $T(X)$ contains more information about $\theta$ than the sample $X$ contains about $\theta$. Thus, the sufficient statistic contains more information about $\theta$ than any other statistic (i.e., reduces the entropy of $\theta$ more than any other statistic).

This sketch of an answer gives rise to two questions: "What is the Data Processing Inequality?" and "What is a sufficient statistic?"

Data Processing Inequality and Sufficient Statistics

The reason why no statistic $T$ can contain more information about $\theta$ than $X$ contains about $\theta$ is the Data Processing Inequality. Loosely speaking, the Data Processing Inequality states that we can compute no statistics $T$ of the sample $X$ that increase the amount of information in $X$ about $\theta$. That is, there are no functions $T(X)$ of the sample $X$ that increase the mutual information $I(\theta; T)$ above the value $I(\theta; X)$. The best we can do is to find a sufficient statistic $T'(X)$, which contains exactly as much information about $\theta$ as the sample $X$ contains.

The actual statement of the Data Processing Inequality involves the notion of a Markov Chain.

(Markov Chain) Random variables $X,Y,Z$ form a Markov Chain $X \rightarrow Y \rightarrow Z$ if the conditional distribution of $Z$ depends on $Y$ and is conditionally independent of $X$.

Since the probability mass function $f_\theta$ generates the random variable $X$ and the statistic $T(X)$ is a function of $X$, we are interested in the Markov chain $\theta \rightarrow X \rightarrow T(X)$.

(Data Processing Inequality) If $X \rightarrow Y \rightarrow Z$, then $I(X,Y) \geq I(X,Z)$, with equality if and only if $X \rightarrow Z \rightarrow Y$.

Since $\theta \rightarrow X \rightarrow T(X)$ is a Markov Chain, the Data Processing Inequality gives us that $I(\theta; X) \geq I(\theta; T(X))$. We have strict equality if $\theta \rightarrow T(X) \rightarrow X$ is also a Markov Chain (i.e., if the random variable $X$ is conditionally independent of $\theta$, given the statistic $T(X)$). If so, we consider $T(X)=T'(X)$ a sufficient statistic because it contains all the information in the sample $X$ about parameter $\theta$.

The Data Processing Inequality allows us to say that the sufficient statistic contains more information about $\theta$ than any other statistic (i.e., "reduces the entropy" of $\theta$ more than any other statistic).

virtuolie · Accepted Answer · 2021-12-07 08:18:07Z

@RNG's answer is brilliant, so this is just to give a more conceptual answer.

The most confusing thing when comparing estimation and information theories (to me) is that they have opposite definitions of information, Fisher information and Shannon information (entropy). They're opposite in the sense that as one goes up, the other tends to go down. This is because a sample $X$ has higher entropy when the typical observation $x$ is less predictable from the other observations; whereas $X$ has higher Fisher information when $x$ is more predictable from the other observations. (Technically, Fisher information is an attribute of the estimator, not the sample, but that's not really important here.) So, in simplest terms, the answer is entropy and parameter information have an inverse relationship.

One caveat is that this is only true when we assume unbiased measurement. If, for example, scores on a test are positively correlated with an irrelevant parameter (like race or SES for an IQ test), and the sample includes individuals with a range of values on these correlates, this will add variability to the scores. More variance means less Fisher information. However, the additional variance isn't random$-$bias is systematic and can be accounted for if you model it correctly$-$so entropy may also decrease.

Stack Exchange Network

What is the connection between estimation and information theory?

2 Answers 2

Overview

Data Processing Inequality and Sufficient Statistics

Not the answer you're looking for? Browse other questions tagged
machine-learning
mathematical-statistics
estimation
entropy
information-theory
or ask your own question.

Hot Network Questions

What is the connection between estimation and information theory?

2 Answers 2

Overview

Data Processing Inequality and Sufficient Statistics

Not the answer you're looking for? Browse other questions tagged machine-learningmathematical-statisticsestimationentropyinformation-theory or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
mathematical-statistics
estimation
entropy
information-theory
or ask your own question.