4
$\begingroup$

I guess my trouble is not a big one but here it is: when one applies maximum likelihood, he considers the realization $(x_1, \dots, x_n)$ of a simple random sample (SRS), leading to ML Estimates. But if one wants to talk about things like bias, consistency and so, it has to refer to estimators instead, right ?

For example, in this Wikipedia example, considering the normal example, they end up with $\hat{\mu}=\bar{x}$ and then write $\mathbb{E}(\hat{\mu})=\mu$ which is arguably a bit sloppy since it's not of big interest taking the expected value of a deterministic quantity.

Hence, my question is why, in usual (I mean, most sources: Wikipedia, textbooks, ...) MLE are built based on actual data (that is on a realization of a SRS) and not on a SRS (that is, the collection of random variables $(X_1, \dots, X_n)$), providing with estimators for which it makes sense to compute expectations and so on ? I guess this is a spurious subtlety of little practical interest since it would suffice to "capital letter-ize" estimates to obtain the corresponding estimator but still I wanted to ask.

$\endgroup$
2
  • $\begingroup$ Biasness and consistency, for examples, are the properties of the estimator. On the other hand, $\bar x$ is simply a realization of the estimator. $\endgroup$ Commented May 8 at 13:12
  • $\begingroup$ @User1865345 I know. This is precisely why I am asking the question :) $\endgroup$
    – MysteryGuy
    Commented May 8 at 13:19

2 Answers 2

3
$\begingroup$

Consider a statistical model $\left(\mathcal X, \mathcal A, \left(\mathbb P_\vartheta\right)_{\vartheta \in \Theta}\right)$ consisting of

  • a set $\mathcal X$ (the sample space),
  • a $\sigma$-algebra $\mathcal A$ on $\mathcal X$,
  • a family of probability measures $\left(\mathbb P_\vartheta\right)_{\vartheta \in \Theta}$ on $\mathcal A$ with
  • index set $\Theta$ (the parameter space) of cardinality bigger than one,

in which $\mathbb P_\vartheta$ is dominated by a $\sigma$-finite measure $\mu$ for all $\vartheta \in \Theta$. Denote the density (Radon–Nikodym derivative) of $\mathbb P_\vartheta$ w.r.t. $\mu$ by $\frac{\mathrm d \mathbb P_\vartheta}{\mathrm d \mu}$; and let $\mathcal S$ denote a $\sigma$-algebra on $\Theta$.

One way to formalize maximum likelihood estimation is to define the likelihood function $\mathcal L$ as the bivariate function $\mathcal L: \Theta \times \mathcal X \to [0, \infty), \mathcal L(\vartheta, x) \mathrel{:=} \frac{\mathrm d \mathbb P_\vartheta}{\mathrm d \mu}(x)$ and call an estimator $\hat \vartheta: (\mathcal X, \mathcal A) \to (\Theta, \mathcal S)$ of $\vartheta \in \Theta$ a maximum likelihood estimator if $\mathcal L(\hat \vartheta(x), x) = \max_{\vartheta \in \Theta} \mathcal L(\vartheta, x)$ holds for all $x \in \cal X$.

An alternative way (which I have come across more frequently) is to define the likelihood function for the outcome $x \in \mathcal X$ as $\mathcal L_x : \Theta \to [0, \infty), \mathcal L_x(\vartheta) \mathrel{:=} \mathcal L(\vartheta, x)$, where $x \in \mathcal X$ is fixed, and call an estimator $\hat \vartheta: (\mathcal X, \mathcal A) \to (\Theta, \mathcal S)$ a maximum likelihood estimator of $\vartheta \in \Theta$ if the estimate $\hat \vartheta(x)$ is a maximizer of $\mathcal L_x$ on $\Theta$ for each $x \in \cal X$.

Evidently, both approaches lead to the same definition of a maximum likelihood estimator.


I agree with you that denoting both a (maximum likelihood) estimator $\hat \vartheta$ of $\vartheta$ and the corresponding estimate $\hat \vartheta(x)$ by $\hat \vartheta$ is formally incorrect as they are different objects. However, overloading the symbol $\hat \vartheta$ in this way is at least partially justified since it is usually very clear from the context which object the so overloaded symbol $\hat \vartheta$ refers to.


Reference

Georgii, H.-O. (2013). Stochastics: Introduction to probability and statistics (E. Baake & M. Ortgiese, Trans.). Walter de Gruyter.

$\endgroup$
1
  • $\begingroup$ Thanks for the insight. I think that the fact that the exact distinction between what is an estimator and an estimate does not help. For example, in the question you are pointing to in your answer, some people (see Whuber's answer and even the reference in your own answer) still define an estimate as a RV while others don't... I have the impression it does not help $\endgroup$
    – MysteryGuy
    Commented May 12 at 8:19
0
$\begingroup$

Let's try to pierce the subject a bit, based on my understanding of ML Estimation. When we study the properties of an estimation method, we need to have some assumptions and premises to give us a bedrock for the theory we build. Those usually come from probability theory, hence why inference is seen as "applied probability" in some fields.

One of the fundamental assumptions of the theory we showcase is i.i.d. samples. Even though it is not necessary for some of the theoretical results, it is a starting point to expose the ML Estimation theory. And if we have that in practice, through simple random sampling, for example, great, we can rely on those results!

If not, we need 1) other guarantees our theory can give us; 2) other premises we can rely on (such as sample size asymptotic behavior) or 3) different estimation methods altogether (think GLMs, Mixed Models, GMM, etc).

Those can be ML Estimates as well, but based on different probabilistic structures or assumptions which enriches the theory of ML estimation.

Let's start a back-and-forth in the comments so I can try and update this answer since it is a bit open-ended 😅

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.