4
$\begingroup$

The weak likelihood principle (WLP) has been summarized as: If a sufficient statistic computed on two different samples has the same value on each sample, then the two samples contain the same inferentially useful information. The WLP is usually described as a "widely accepted" or "very reasonable" statement, but not as a theorem. From this I infer that there exists no proof of the WLP.

My question is, why does the absence of a proof for the WLP not prompt skepticism about it? Or, if not skepticism, then at least pressure to find a proof (or a disproof) among statisticians? Or among mathematicians, for that matter? Why is the WLP not the subject of a Millennium Prize (or its statistical equivalent)? Why do we not regard it as the Fermat's Last Theorem of probability and statistics? (Maybe the parallel postulate would be a better analogy....)

As far as answers go, I'd appreciate either/both of two types: theoretical explanations ("We don't need to prove it because..." or "Actually, there is a theorem, see reference...") or historical explanations ("Early statisticians went through a phase when they tried to find a proof but ultimately settled for an axiom..." or "Fisher bullied people into it..."). (My own search turns up no evidence along any of these lines, but I'd welcome examples if they exist.)

Clarification based on input from responders: I'm calling this the WLP but some may prefer to identify it as the sufficiency principle (SP). I'm okay with that, because the SP implies the WLP. Alternatively, you could say the SP is the mathematical statement made by Fisher and proven by the factorization theorem--that the sufficient statistic contains all the parameter information in the sample, and that the sample conditional on a sufficient statistic is independent of the parameter--and that the WLP takes this a step further by insisting there is no non-likelihood information in the sample that's inferentially useful. I'm okay with that, too. Whether it's called the WLP or SP, and whether it involves only the likelihood function or includes the sampling distribution, both are empirical claims about the best possible estimate calculable on a sample in practice, and there seems to be no imperative for proving either.

Edit 2: I think an answer is materializing across both answers and both sets of comments. If someone agrees and wants to write this up, or modify further, I'll give it a checkmark. a) Statistics lacks a formal axiomatic system. b) Instead, statistics relies on these things called "principles," which are like axioms or postulates except they arise by convention and are adopted by consensus (implicit or explicit). c) No one expects or even hopes to turn these principles into theorems, because without a formal system of axioms, it may well be impossible, and in any event it's very hard to know how or where to start. d) Birbaum's proof of the SLP is the exception that proves the rule, in that he was able to deduce a strong principle from two weaker ones (controversially). e) If someone were to prove or contradict the WLP, it would be another such exception.

$\endgroup$
6
  • 2
    $\begingroup$ In short (and consistent with the two answers given so far): Statistical inference is not an axiomatized mathematical theory. 1) There is not agreement about axioms 2) Practice is not consistent with any axiomatic system so far tried $\endgroup$ Commented Jan 23, 2023 at 16:57
  • $\begingroup$ @kjetilbhalvorsen: Many proof/disproof methods are effectively axiom adjacent. For example, because this is an empirical claim, a single counterexample would do. Also, and this is admittedly vague speculation, but perhaps one could prove an "enumerative" theorem--i.e., if the WLP holds for all samples, then surely it holds for this one pair of samples, or pairs like this pair, etc. Even that would be progress. Maybe even extending it to something like the computer-assisted proof of the 4 color theorem, but proving it for classes of samples instead of classes of maps. $\endgroup$
    – virtuolie
    Commented Jan 23, 2023 at 17:16
  • $\begingroup$ @kjetilbhalvorsen: That said, I'm not familiar with any attempts to axiomatize statistics. (Excepting Solomonoff's theory of inductive inference, though that seems more a parallel universe to probabilistic statistics, where sufficiency doesn't emerge naturally. Combinatorics or set theory or graph theory might work for permutation statistics...) Maybe the real question is, why have we not prioritized developing such a system? I'd guess it's because statistics has to be consistent with reality, like physics. $\endgroup$
    – virtuolie
    Commented Jan 23, 2023 at 17:24
  • 1
    $\begingroup$ Of interest, given your Edit 2 reference to Birnbaum's proof, is that within 2 years of the proof's publication he had rejected both the SLP and his own conditionality principle (according to Giere 1977), although it took until 1969 for Birnbaum's rejection to appear in one of his own published articles. He rejected both principles because they were were inconsistent with the frequentist confidence concept (see also his 1970 letter in Nature). $\endgroup$ Commented Jan 24, 2023 at 22:05
  • 2
    $\begingroup$ You will be interested in this statistical-science-the-likelihood-principle-issue-is-out $\endgroup$ Commented Jan 26, 2023 at 13:00

2 Answers 2

9
$\begingroup$

Fermat's Last Theorem is a proposition of Number Theory, so you'd want to prove it from Peano's axioms; the parallel postulate, of Euclidean geometry, so from Euclid's other four postulates: but the Weak Likelihood Principle (a.k.a. the Sufficiency Principle) isn't a proposition of Probability Theory, so it's not obvious what you'd want to prove it from.

Birnbaum (1962) kicked off the approach of giving a formal account of the relationships between various "principles". He took the concept of evidential meaning as basic & the W.L.P. as axiomatic, & went on to derive the Strong Likelihood Principle from this & another axiom, the Conditionality Principle. His formal statement of the W.L.P. is that for inference about a parameter $\theta$ in an experiment $E$, where $T$ is a sufficient statistic for $\theta$, if $T(x) = T(y)$ for samples $x$ & $y$, then $\operatorname{Ev(E,x)} = \operatorname{Ev}(E, y)$; in which $\operatorname{Ev(E,x)} = \operatorname{Ev}(E, y)$ denotes "evidential equivalence" or your "containing the same inferentially useful information". This is not an empirical claim, or even a mathematical one, but purports to constrain (sensible) inferential procedures: if it's entailed by other foundational principles you hold dear, then all well & good; if not then you may try & balance it against those or to eschew it altogether.

The W.L.P. is part & parcel of Bayesian frameworks (e.g. Savage, 1954): the likelihood's all from the data that goes into the calculation of posterior probabilities. (Not necessarily so for the S.L.P.—see Do you have to adhere to the likelihood principle to be a Bayesian?.)

Perhaps more interestingly, purely frequentist desiderata tend to mandate the use of sufficient statistics in estimation & testing—consider the Rao–Blackwell Theorem & the Neyman–Pearson Lemma & their ramifications. On those occasions when a randomized estimator or test does enjoy some kind of optimality, that's more prone to be taken as evincing the need for the W.L.P. than as a counter-example. In complex situations different criteria often clash. For example, a solution to the Behrens–Fisher problem was posted here last year: an exact test with better power properties than several alternatives: the only thing wrong with it is that it violates the W.L.P. (But note that in all cases, it's a matter of 'padding out' the sufficient statistic with random noise, & it makes no odds whether the noise is real—from the ancillary part of the data—or synthetic—introduced by the statistician—see the first bullet point of @Sextus Empiricus' answer. There's no "non-likelihood information" being exploited.)

Fiducial inferences may violate the W.L.P. in a different way—in cases where reduction of the data to a sufficient statistic positively discards information held to be pertinent. See Fraser (1963) for discussion & an example. In fact the difficulty isn't unique to fiducial approaches: the nub of the matter is that a premature reduction may conflate events you'd prefer to separate through conditioning (Kalbfleisch, 1975). (This is generally seen as calling for strictures on when to invoke the W.L.P. rather than for its abandonment.)


Birnbaum (1962), "On the Foundations of Statistical Inference", J. Am. Stat. Assoc., 57, 298

Fraser (1963), "On the Sufficiency & Likelihood Principles", J. Am. Stat. Assoc., 58, 303

Kalbfleisch (1975), "Sufficiency & Conditionality", Biometrika, 62, 2

Savage (1954), The Foundations of Statistics

$\endgroup$
8
  • $\begingroup$ In a rigorous system, having no obvious way to prove a fundamental principle is a reason to prioritize proving it. Consider the question, why are we satisfied for certain statements to be definitions or axioms, but insist on proofs for others? I'd argue the answer is that the former--definitions of the a line, natural numbers, sampling distributions--are abstractions. But WLP is an empirical statement, with direct, practical consequences. An empirical counterexample would disprove it. If it's wrong, it means we're leaving information unused. $\endgroup$
    – virtuolie
    Commented Jan 23, 2023 at 16:16
  • $\begingroup$ But maybe I've misunderstood you. Are you saying that, historically, statisticians have avoided trying to prove the WLP/SP, or failed to recognize the value of doing so, $because$ there's no obvious place to start? $\endgroup$
    – virtuolie
    Commented Jan 23, 2023 at 16:19
  • $\begingroup$ Before anything else, I'm baffled by the notion that the W.L.P. could be empirically disconfirmed. Could you please elaborate? $\endgroup$ Commented Jan 23, 2023 at 17:09
  • $\begingroup$ Sure. Keep the "weaker" SP, so a sufficient statistic conveys all probabilistic information in the sample. Then define a certain measure of non-probabilistic information that quantifies manifest or exact properties of the data. If one could show that sample-specific randomness in a probability sample (X, Y) is equivalent to, say, combinatorial information or conditional complexity of Y given X, then conditioning on a non-probabilistic measure would improve estimate accuracy and precision, not by increasing parameter information, but by decreasing randomness. $\endgroup$
    – virtuolie
    Commented Jan 23, 2023 at 17:40
  • 1
    $\begingroup$ In that case you could improve your estimator by Rao-Blackwellizing it. $\endgroup$ Commented Jan 23, 2023 at 21:59
3
$\begingroup$
  • There are theorems revolving around the sufficient statistic. For instance the Fisher-Neyman factorisation theorem. It's corollory is that the likelihood function is a sufficient statistic.

    If we have a sufficient statistic then we can replace the data generating process by (hypothetically) equivalent split process, first sample the sufficient statistic (the only step where the parameters are relevant), second the rest of the data which is indepently produced based on only the value of the sufficient statistic.

    That view leads to descriptions such as Theorem 6.1 in Lehmann & Casella Theory of Point Estimation (thanks to Scortchi for mentioning it)

    Theorem 6.1 Let $X$ be distributed according to $P_\theta \in \mathcal{P}$ and let $T$ be sufficient for $\mathcal{P}$. Then, for any estimator $\delta(X)$ of $g(\theta)$, there exists a (possibly randomized) estimator based on $T$ which has the same risk function as $\delta(X)$.

  • The strong likelihood principle states that that likelihood function is the only relevant information in inference, also when it stems from experiments with different distributions.

    There are many cases of frequentist and fiducial inference that are not consistent with this strong likelihood principle, because those methods also regard the probability distribution of the sufficient statistic along with the likelihood function. The probability distribution can differ while the likelihood function is the same.

  • If you weaken the likelihood principle enough, and require the sampling distribution of the sufficient statistic in two samples/experiments to be equal in order for the comparison in the LP to make sense, then it becomes equivalent to a definition of the 'sufficient statistic', and that's not much of 'theorem'. It is a definition and a trivial matter of relating different definitions.

    "why there's no proof that only the sufficient statistic is inferentially relevant"

    Because a sufficient statistic is by definition a statistic that contains all inferentially relevant information. The statement requires no proof.

$\endgroup$
37
  • 1
    $\begingroup$ „Wovon man nicht sprechen kann, darüber muss man schweigen.“ $\endgroup$ Commented Jan 23, 2023 at 16:50
  • 2
    $\begingroup$ "The only inferentially useful information in a sample is Fisher information," @virtuolie what is 'inferentially useful information' in mathematical terms? If you can not express it into mathematics, then it is not gonna be a mathematical theorem. $\endgroup$ Commented Jan 26, 2023 at 12:23
  • 2
    $\begingroup$ @virtuolie if you define 'inferentially useful' as improving performance, then your WLP turns into something like the Rao-Blackwell theorem and the Lehmann Scheffé theorem, which state (more or less) that you can't beat the performance of the best estimator based on the sufficient statistic. $\endgroup$ Commented Jan 26, 2023 at 23:02
  • 1
    $\begingroup$ @virtuolie I referred to RB and LS theorems but the theorem 6.1 in the reference from Scortchi does it much more efficient. The proof is only two sentences long (because it is a bit trivial). The RB and LS relate to slightly more different/complicated cases. $\endgroup$ Commented Jan 27, 2023 at 9:05
  • 1
    $\begingroup$ "that the second statistic "depend[s] on the data only through T @virtuolie that is the definition of the sufficient statistic. If it depends on something else, then T would not be sufficient. It is also not what you want to prove. The proof is about the fact that with T you can always construct a statistic that has the same risk function. $\endgroup$ Commented Jan 27, 2023 at 9:21

Not the answer you're looking for? Browse other questions tagged or ask your own question.