6
$\begingroup$

The first thing that one learns in statistics is to use the sample mean, $\hat{X}$, as an unbiased estimate of the population mean, $\mu$; and pretty much the same would be true for the variance, $S^2$, as an estimate of $\sigma^2$ (leaving aside Bessel's correction for a second). From these working assumptions, and with the CLT, a great part of basic inferential statistics is taught utilizing Gaussian and t distributions.

This, in principle, seems very alike the setup behind MLE calculations - i.e. estimating a population parameter based on the evidence of a sample. And in both cases the population parameters are really unknown.

My question is whether MLE is sort of the overarching mathematical frequentist framework underpinning the assumptions in basic (introductory) inferential statistical courses.

Although it makes sense to derive the hat matrix for OLS utilizing MLE, and proving the maximum likelihood with Hessian matrices, it is also true that through MLE one can "rediscover" the truth of some basic assumptions that are given for granted in basic courses.

For instance, we can derive the result that the MLE of the mean, $\mu$, of a Gaussian distribution given the observation of a sample of values ($x_1,..,x_N$) equals $\hat\mu=\frac{1}{N}\displaystyle\sum_{1}^N x_i$ - i.e. the sample mean; and the MLE of the variance, $\sigma^2$, is $\hat\sigma^2=\frac{1}{x}\displaystyle\sum_{1}^N (x_i-\hat\mu)^2$ -i.e. the sample variance.

So in the end the layman's account would be that what is taught in introductory courses is really supported by a more sophisticated mathematical structure - Maximum Likelihood Estimation, elaborated by R.A. Fisher and that has its main counterpart in Bayesian statistics.

MLE bypasses the need for a prior probability of the population parameter without support in the sample $p(\theta)$ needed in Bayes calculation of the inverse probability or posterior ($p(\theta|{x_1,...x_n})$) with the equation: $p(\theta|{x_1,...x_n}) = \Large \frac{p({x_1,...x_n}|\theta)\,p(\theta)}{p({x_1,...x_n})}$. And it does so by substituting $\mathscr{L}(\theta|{x_1,...x_n})$ (defined as the joint probability function of $\theta|{x_1,...x_n}$) for $p(\theta|{x_1,...x_n})$ and maximizing its value.

So two general theories, one of them (MLE) barely mentioned in introductory courses, but underpinning mathematically what is taught in school.

$\endgroup$
9
  • 2
    $\begingroup$ You include so many questions in your thread that it could make it hard or impossible to answer with not writing a book on MLE, so it would be better if you could try to edit your question to be more precise on what exactly is your question. As about the general question on MLE you could some answers in: stats.stackexchange.com/questions/112451/… $\endgroup$
    – Tim
    Commented Mar 15, 2015 at 22:13
  • $\begingroup$ Great link. The question is broad on purpose, but it focuses on the conceptual difference between MLE and other (more basic) inferential techniques and statistical tests. $\endgroup$ Commented Mar 15, 2015 at 22:20
  • $\begingroup$ @toni such questions are not really valid for the site. They are closed as being too broad or unclear what you're asking. $\endgroup$
    – AdamO
    Commented Mar 15, 2015 at 22:52
  • $\begingroup$ If this is the case, I will be willing to erase it. I just seem to directly fall into explanations of MLE without a nice segue from basic techniques. $\endgroup$ Commented Mar 15, 2015 at 22:56
  • 2
    $\begingroup$ Can you focus your question a little more and perhaps ask followup questions separately? If it's relevant to do so you can link back to earlier questions. $\endgroup$
    – Glen_b
    Commented Mar 15, 2015 at 23:41

1 Answer 1

2
$\begingroup$

Toni,

It has always been my impression that the theoretical backing of basic inferential statistics is typically the central limit theorem, which motivates both the mean and variance calculations you suggested, not necessarily by deeming them the most likely, but by arguments of the asymptotic correctness of these approximations.

MLE is really designed to answer a slightly different question: given a collection of data, and a set of (parametrized) assumptions, what configuration of the parameters maximizes the likelihood of my data? Of course, it is no coincidence that these values agree with the ones above, given that we expect them to agree asymptotically.

As for Bayesian Inference, I differentiate this once again by the nature of the question being asked. Here we no longer ask, which configuration of the parameters makes our data most likely, but rather, which configuration of the parameters is most likely. One can intuitively, or symbolically, arrive quickly to the fact that we don't have enough assumptions to answer this question. For this reason a prior distribution on the parameters is necessary. (A common alternate view of Bayesian inference begins with the prior, simply trying to incorporate prior beliefs on statistical inference in a measured way).

All in all, if I could only have one index on a canon of mathematical theory, I would index by the question being asked/answered. Answers are pretty useless without their questions, and when you go to apply this stuff, you'll typically start with a question. Sorry for the vague answer, it was just a very overarching question, but I think that the organization of one's understanding of a subject is incredibly important.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.