12
$\begingroup$

I am trying to see if I understand the definition of $p$-value as used by Sir Ronald A. Fisher and the one used today by frequentist statisticians (not sure how to call it better).

$p$-value according to Sir R. A. Fisher
A $p$-value for a given realization of a test statistic is the probability of getting an equally extreme or a more extreme realization under repeated sampling, provided that the null hypothesis $H_0$ is true. Extremity is defined in terms of the density of the test statistic (a random variable) given $H_0$, with "more extreme" meaning "having lower density". Note that such a definition of $p$-value does not involve any reference to an alternative hypothesis. Based on my rudimentary knowledge of history of statistics, I suppose Fisher would be fine with such a definition. As expressed by Christensen (2005),

The $p$-value is the probability of seeing something as weird or weirder than you actually saw. (There is no need to specify that it is computed under the null hypothesis because there is only one hypothesis.)

Also there:

The key fact in a Fisherian test is that it makes no reference to any alternative hypothesis.

Also (square brackets added by me as deemphasis):

[First, $F$ tests and $\chi^2$ tests are typically rejected only for large values of the test statistic. Clearly, in Fisherian testing, that is inappropriate.] Finding the $p$ value for an $F$ test should involve finding the density associated with the observed $F$ statistic and finding the probability of getting any value with a lower density. [This will be a two-tailed test, rejecting for values that are very large or very close to 0.]

$p$-value used by frequentist statisticians today
I think the definition above (i.e. my attempt at rephrasing Fisher's definition) is not what is normally used today, because every now and then I encounter "one-sided" vs. "two-sided" $p$-values where extremity is clearly defined with reference to a specific alternative hypothesis. The low-density areas on the "uninteresting" side of $H_0$ are thus not counted. Otherwise the definition is the same.

Questions

  1. Does my understanding of $p$-value according to Sir R. A. Fisher make sense?
    (Or did I overlook something important or made some mistakes in my explanation?)
  2. Is my understanding of today's definition of the $p$-value correct?

References

$\endgroup$
14
  • 4
    $\begingroup$ I don't know if he would have used density per se. Imagine a test statistic whose distribution was bi-modal under the null (I can't think of one, but nothing prevents a test static from having a weird distribution). In that case, a "low density" could occur in the middle of the distribution, but my intuition is that Fisher wouldn't have the same interpretation there as in the tails. $\endgroup$ Commented Jan 9, 2019 at 18:03
  • 1
    $\begingroup$ Are you referring to p. 124, "[the hypothesis should be rejected] if any relevant feature of the observational record can be shown to [be] sufficiently rare"? That does make it sound like Fisher would have preferred density to the tail proportions. I don't really know for sure. It was just my hunch. $\endgroup$ Commented Jan 9, 2019 at 19:20
  • 1
    $\begingroup$ I see. 2 paragraphs back (p. 124) he quotes Fisher, "In choosing the grounds upon which a general hypothesis should be rejected, personal judgement may and should properly be exercised.” But then turns right around & seems to contradict him, "Nevertheless, the logic of Fisherian testing in no way depends on the source of the test statistic." Your 3rd quote follows from him favoring his interpretation of what the logic of Fisherian testing implies over what Fisher himself appears to have had in mind. $\endgroup$ Commented Jan 9, 2019 at 21:06
  • 3
    $\begingroup$ I don't really have a dog in that fight, but my hunch, above, seems to go with what Fisher says in his quote, whereas the notion that you must use density as opposed to tail proportion goes with what Christensen believes Fisher was obligated to advocate. $\endgroup$ Commented Jan 9, 2019 at 21:08
  • 4
    $\begingroup$ Christensen's article admits that "When discussing Fisherian testing, I make no claim to be expositing exactly what Fisher proposed. I am expositing a logical basis for testing that is distinct from Neyman-Pearson theory and that is related to Fisher’s views." ...so I would not trust this article as a historically-accurate summary of Fisher's own views. $\endgroup$
    – civilstat
    Commented Jul 31, 2023 at 16:59

5 Answers 5

12
+50
$\begingroup$

My explanation of contemporary and past interpretations of $p$ values is more historical than mathematical. However, I list a number of references that discuss these things, both from the original authors of these perspectives and from modern attestations of what $p$ values "are". Much of this you may already know, and these are simply my thoughts on the matter given what I have read.

Fisherian NHST

Though it is believed that Pearson was the one who may have originally designed the formal concept of a $p$ value in 1900, Fisher is largely credited for popularizing it, followed by the Neyman-Pearson framework (Salsburg, 2001). You are correct that the Fisherian conceptualization of a $p$ value simply tests the null hypothesis $H_0$. In fact, his book The Design of Experiments (written 10 years after his original conceptualization in SMRW) has an early section (p. 18-19) that argues against Bayesian probability and with it any "inductive" probability inference assumed from the alternative hypothesis $H_1$, as he notes three distinctions about how Bayesian thinking is "wrong":

  1. It has "mathematical contradictions." He doesn't explicitly outline why though, a common issue with Fisher's writings where he often alludes to things with great mathematical sophistication but often omitted from his writing (Salsburg, 2001).
  2. The notion that Bayesian probability should be apparent to any rational mind. Again, he doesn't provide any philosophical or mathematical proof of this idea, but simply states it.
  3. It is only rarely used for justification of experiments. Once more, he states this without any evidence, and it is contradictory given he states beforehand that Bayesian or inductive reasoning has dominated probability discussions.

With respect to how $p$ values are used now, Fisher did not identify a cutoff criterion for $p$ values, and noted that there were two important components of using them. First, $p$ values were in his opinion an iterative process which involved testing the hypothesis and getting a low enough probability to determine that the results are not by chance (which is explained through his famous tea tasting experiment). The reason this is important is that he viewed replication over many experiments to be an important part of this process, which is unfortunately uncommon in today's time. Second, the level of the $p$ value determined to be sufficient was context-specific, and should not be a hard criterion for deciding upon the evidence in a single experimental design. In fact, in Pages 30-31 of the same book, he explains:

This hypothesis, which may or may not be impugned by the result of an experiment, is again characteristic of all experimentation. Much confusion would often be avoided if it were explicitly formulated when the experiment is designed. In relation to any experiment, we may call this the "null hypothesis", and it should be noted that the null hypothesis never proved or established, but it is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.

Thus only the null is exposited from this form of test, and not the alternative. No mention of false positives or negatives are given, though sampling error is described in Pages 51-52.

Neyman-Pearson NHST

It was only when Jerzy Neyman and Egon Pearson presented their canonical paper at the Royal Statistical Society that this changed. In their paper, presided over by Karl Pearson (Egon's father), they first make the following proposition (p. 290-291):

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. Here, for example, would be such a “rule of behaviour”: to decide whether a hypothesis, $H$, of a given type be rejected or not, calculate a specified character, $x$, of the observed facts; if $x > x_0$, reject $H$, if $x < x_0$, accept $H$. Such a rule tells us nothing as to whether in a particular case $H$ is true when $x <= x_0$ or false when $x > x_0$. But it may often be proved that if we behave according to such a rule, then in the long run we shall reject $H$ when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject $H$ sufficiently often when it is false.

You may detect that they are similarly agnostic but still determined to make some additions to previous thinking. They later dictate in this paper two important contributions which would forever shape modern interpretation of $p$ values, both in the correct and incorrect sense:

  • There should be a explicit probability testing of $H_0$ and $H_1$.
  • That there should be two importantly determined errors, $\alpha$, or false positive, and $\beta$, a false negative.

The second point gave rise to the "table of decisions" we typically see in stats books (from Hulbert & Lombardi, 2009):

enter image description here

With respect to the first point, they note:

It is clear that besides $H_0$ in which we are particularly interested, there will exist certain admissible alternative hypotheses. Denote by $D$ the set of all simple hypotheses, which in a particular problem we consider as admissible. If $H_0$ is a simple hypothesis, it will clearly belong to $\Omega$. If $H_0$ is a composite hypothesis, then it will be possible to specify a part of the set $\Omega$, say $\omega$, such that every simple hypothesis belonging to the sub-set to will be a particular case of the composite hypothesis $H_0$. We could say also that the simple hypotheses belonging to the sub-set $\omega$, may be obtained from $H_0$ by means of some additional conditions specifying the parameters of the function (7) which are not specified by the hypothesis $H_0$.

With respect to the second point, they dictate that:

The use of the principle of likelihood in testing hypotheses, consists in accepting for critical regions those determined by the inequality $X <= C = \text{const.}$ Let us now for a moment consider the form in which judgments are made in practical experience. We may accept or we may reject a hypothesis with varying degrees of confidence; or we may decide to remain in doubt. But whatever conclusion is reached the following position must be recognised. If we reject $H_0$, we may reject it when it is true; if we accept $H_0$, we may be accepting it when it is false, that is to say, when really some alternative is true. These two sources of error can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. We are reminded of the old problem considered by Laplace of the number of votes in a court of judges that should be needed to convict a prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the con­sequences of the error; is the punishment death or fine; what is the danger to the community of released criminals; what are the current ethical views on punishment? From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimized. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator. The principle upon which the choice of the critical region is determined so that the two sources of errors may be controlled is of first importance.

They go through a very lengthy mathematical justification of this decision making criterion, but curiously never give the famous $.05$ value for $\alpha$, which actually seems to be a carryover from some tables produced by Fisher in another book, where the printing at the time made this more convenient in terms of producing tables (Hurlbert & Lombardi, 2009). In fact, Fisher was a long-time harsh critic of the Neyman-Pearson philosophy, even though these are now in a modern sense lumped together, so that is where Fisherian and modern views differ greatly (Salsburg, 2001).

Modern Interpretations of the P

Now there are the "correct" modern definitions of a $p$ value and clearly incorrect ones. I simply provide the ASA's definition here as the most accurate one (2016):

Informally, a $p$-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

With the following conditions:

  1. $P$-values can indicate how incompatible the data are with a specified statistical model.
  2. $P$-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a $p$-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A $p$-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a $p$-value does not provide a good measure of evidence regarding a model or hypothesis.

How others define this in a modern sense is greatly variable. As noted by many past experiments, the Fisher-Neyman-Pearson framework is often interpreted with sometimes fatal consequences (Hauer, 2004; Ziliak & McCloskey, 2008). Because of these issues, some have proposed complete abandonment of $p$ values (see Cumming, 2014), or at least lowering cutoffs to $.005$ (Ioannidis, 2018) or even lower values. Others have developed a "neoFisherian" perspective on $p$ value reporting (Hurlbert & Lombardi, 2009), which in some respects isn't "neo" at all in terms of Fisherian thinking:

Our analysis of these matters thus leads us to a recommendation that for standard types of significance assessment, the paleoFisherian and Neyman-Pearsonian paradigms be replaced by a neoFisherian one. The essence of the latter is that a critical $a$ (probability of type I error) is not specified, the terms 'significant' and 'non-significant' are abandoned, that high $p$ values lead only to suspended judgments, and that the so-called "three-valued logic" of Cox, Kaiser, Tukey, Tryon and Harris is adopted.

Some have even gone a step further and proposed an $s$ value (I'm forgetting the exact paper that I saw this, but it is alluded to in Andrade, 2019). Given all of this, what a "modern" $p$ value is may be largely dependent on the user, as this seems to be an unresolved issue given current controversies surrounding their usage.

Final Remarks

Much of this is just a word potato salad I came up with from my previous readings of the material, and I may amend this answer to include clarification if this answer isn't clear enough. I will note that The Lady Tasting Tea referenced below is the best resource for learning about the development of the NHST framework, including much of what Fisher conceptualized. Many of the other references below also provide important historical commentary on Fisher's views.

Edit

For those curious, I found the $s$ value paper, which is described below in Greenland (2019):

In an attempt to forestall misinterpretations, $p$ can be described as a measure of the degree of statistical compatibility between $H$ and the data (given the model $A$) bounded by $0$ = complete incompatibility (data impossible under $H$ and $A$) and $1$ = no incompatibility apparent from the test (Greenland et al. 2016). Similarly,in a test off it of $A$,the resulting $p$ can be interpreted as a measure of the compatibility between $A$ and the data. The scaling of $p$ as a measure is poor, however, in that the difference between (say) $0.01$ and $0.10$ is quite a bit larger geometrically than the difference between $0.90$ and $0.99$. For example, using a test statistic that is normal with mean zero and standard deviation (SD) of $1$ under $H$ and $A$, a $p$ of $0.01$ vs. $0.10$ corresponds to about a $1$ SD difference in the statistic, whereas a $p$ of $0.90$ vs. $0.99$ corresponds to about a $0.1$ SD difference.

One solution to both the directional and scaling problems is to reverse the direction and rescale $p$-values by taking their negative base-$2$ logs, which results in the $s$-value $s =−\text{log2}(p)$. Larger values of $s$ do correspond to more evidence against $H$. As discussed below this leads to using the $s$-value as a measure of evidence against $H$ given $A$ (or against $A$ when $p$ is from a test of fit of $A$).

Whether this is a "solution" to $p$ values is open for debate, but I simply note that it is one modern take on $p$ values that is worthy of discussion in terms of contemporary $p$ values versus historical ones.

References

  • Andrade, C. (2019). The p value and statistical significance: Misunderstandings, explanations, challenges, and alternatives. Indian Journal of Psychological Medicine, 41(3), 210–215. https://doi.org/10.4103/IJPSYM.IJPSYM_193_19
  • Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
  • Fisher, R. A. (1925). Statistical methods for research workers (11th ed.). Oliver and Boyd.
  • Fisher, R. A. (1935). The design of experiments (9. ed). Hafner Press.
  • Greenland, S. (2019). Valid p-values behave exactly as they should: Some misleading criticisms of p-values and their resolution with s-values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625
  • Hauer, E. (2004). The harm done by tests of significance. Accident Analysis & Prevention, 36(3), 495–500. https://doi.org/10.1016/S0001-4575(03)00036-8
  • Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349. https://doi.org/10.5735/086.046.0501
  • Ioannidis, J. P. A. (2018). The proposal to lower p value thresholds to .005. JAMA, 319(14), 1429. https://doi.org/10.1001/jama.2018.1536
  • Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249. https://doi.org/10.1080/01621459.1993.10476404
  • Neyman, J., & Pearson, E. S. (1933). IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009
  • Pearson K (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling" (PDF). Philosophical Magazine. Series 5. 50 (302): 157–175. doi:10.1080/14786440009463897
  • Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. W.H. Freeman.
  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p -values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
  • Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.
$\endgroup$
3
  • 2
    $\begingroup$ That lady tasting tea experiment is also a clear example where Fisher uses a one-sided test. $\endgroup$ Commented Dec 1, 2023 at 14:22
  • 1
    $\begingroup$ +1 for Hulbert & Lombardi's table and their brief note about "illogical decisions"! $\endgroup$
    – civilstat
    Commented Dec 1, 2023 at 17:13
  • 1
    $\begingroup$ Great answer !! $\endgroup$
    – Lynchian
    Commented Dec 23, 2023 at 13:52
6
$\begingroup$

@ShawnHemelstrand's answer contains much of what is needed to get to an understanding of these issues, but there is something that I can add that will help, particularly with issues like those raised in @FrankHarrel's answer.

The Neyman and Pearson sentences often quoted are not always fully explained. (I quote them here by snipping from Shawn's answer. He provides the source.)

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

That sentence is their justification for forgoing any treatment of the data as straightforward evidence for or against the null hypothesis. The next paragraph then goes on to set out the framework for an evidence agnostic approach to decision-making.

But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.

That means that their purpose of the tests is not to extract a useable sort of evidence from the data and statistical model, but to use them to make algorithmic decisions that will not 'too often' be wrong. The method that they go on to describe then provides an algorithm that will work well 'in the long run'.

Note that the decision is not much informed by the strength of the evidence! As long as the test statistic falls in the 'critical region' the decision is to discard the null hypothesis. The decision is the same for a test statistic value that is at the inside edge of the critical region as it is for one that falls well inside the region. In p-value terms, that sentence becomes (assuming we are dealing with a predetermined $\alpha$ of 0.03) 'the decision is the same for an observed p-value of 0.029 as it is for a value of 0.00000000001'. Another way to see how the strength of evidence is ignored in the Neyman–Pearsonian method is to consider that p-values of 0.031 and 0.029 would yield the opposite decisions.

The Neo-Fisherian responses to those same p-values might be that p=0.029 (and p=0.031) shows only modestly strong evidence against the null, but that evidence is enough to be interesting and so therefore the experiment should be run again or developed further. The result of p=0.00000000001 corresponds to a strength of evidence against the null hypothesis in question to be compelling enough for the question to be considered to be answered.

Fisher's approach was to characterise the evidence in the data concerning the particular null hypothesis in question. The Fisherian approach has the data converted to evidence against a 'local' hypothesis and so it is helpful to call that evidence 'local'. That is the idea that the first sentence of quoted Neyman & Pearson speaks against.

The Neyman–Pearsonian approach yields a decision. That decision may be correct or incorrect as detailed in the table that Shawn includes, but there is no way to say whether it will be right or wrong for any particular 'local' hypothesis tested. What can be said is the global probability of error associated with the repeated application of the method in the long run. The Neyman–Pearsonian methods are designed with the global error rates in mind, not the properties of any particular decision regarding a local hypothesis.

The evidence-ignoring aspects of the Neyman–Pearsonian probably makes many of us so uncomfortable that it needs some justification. What is the benefit of working within that framework? Well, it enables simple accounting. The long run error rates that can be expected from the faithful application of the algorithm can be specified. They can be 'controlled' to the extent that the false positive errors are pre-set in the selection of $\alpha$ and the false negative errors are 'controlled' by choice of sample size and the choice of a most powerful test procedure.

Now, finally, to the point raised by Frank Harrel's answer. Is the probability of being wrong a Bayesian probability? Yes, it would be for any decision made on the basis of a neo-Fisherian analysis of the strength of evidence. But if we are talking about the global long run probability of error associated with application of the Neyman–Pearsonian framework that probability is a well-defined frequency determined by the design of the analysis. That frequency comes from the (theoretical) long run application of the method, not from any repeated testing of the local hypothesis of interest. Its expression as a probability needs no prior as it is purely frequentist and non-Bayesian.

$\endgroup$
6
$\begingroup$

Extremity is defined in terms of the density of the test statistic

First, $F$ tests and $\chi^2$ tests are typically rejected only for large values of the test statistic. Clearly, in Fisherian testing, that is inappropriate

This makes little sense to me.

1) No clear use of density in the definition by Fisher

I don't believe that this lowest density region is a part of the definition that stems from Fisher. Aside from no clear writings where he instructs about the definition of 'extreme' to be two sided, the lady tasting tea experiment is a clear example where he deviates from it.

Possibly in his critique of Mendel's experiments, does Fisher express p-values based on the density and uses a two sided chi-squared test? Nope, in that case Fisher also uses one-sided computations of the p-values, but for the lower end of the distribution. See the table V in "Has Mendel's work been rediscoverd?" Annals of Science, v. 1: 115-137 (1936)

table with p-values from Fisher's work

2 densities are ambiguous

The use of of densities can be problematic because they are ambiguous.

Considering aside from only large values of the test statistic, also small values if they associate with a low probability density, is not necessary to be doing Fisherian testing.

The 'density' is a relative term. It depends on the expression of the test statistic.

For example,

  • the region of highest/lowest density is different when we use $\chi$ versus $\chi^2$.

  • Or consider a sample of iid normal distributed variables $x_i \sim N(0,1)$.

    On the one hand, this distribution will have the highest density for every $x_i = 0$.

    On the other hand, if the sample is larger than size two the density of the chi-squared value $\chi^2 =0$ in that point $x_i = 0$ will be zero. Instead, the mode of the chi-squared distribution is in the point $\chi^2 = n-2$.

So this is probably why we often use a one-sided $\chi^2$-test or $F$-test.


Computational example for 2nd point

See below an example with explicit computations in R, it uses Laplace distributed variables such that the null distribution of the chi-squared statistic under the null hypothesis $H_0:\mu = 0$ has a distribution with 4 degrees of freedom.

What you can see is that for small values of $x,y$ the density in the scatterplot is high and these are not extreme values. Yet in terms of the chi-squared null distribution the density is low. You can see this in the little red dot in the center where the p-value is below 0.2 if we use the chi-squared statistic.

example with Laplace distribution

set.seed(1)
n = 1000
mu = 0

x = mu + rexp(n,0.5)*(1-2*rbinom(n,1,0.5))
y = mu + rexp(n,0.5)*(1-2*rbinom(n,1,0.5))

chi_squared = abs(x-mu)+abs(y-mu)
density = dchisq(chi_squared,4)
significant = rank(density)<=n*0.20

plot(x,y,pch=20, col = rgb(significant,0,0,0.3), 
     main = "distribution of \n x ~ Laplace(mu,0.5) \n y ~ Laplace(mu,0.5)" )

plot(chi_squared,density, col = 1+significant,
     main = "chi-squared value and null distribution density")

legend(5,0.17,
       c("observations with p-value <= 0.2", 
         "observations with p-values > 0.2"),
       col = c(2,1), pch = 20)

Also, if instead of the distribution of the chi-squared statistic we would use the square of the statistic, then the lowest density region of the statistic is in the highest values only.

Compare the following two histograms of the simulated values under the null hypothesis

hist(chi_squared, breaks = 40, freq = 0)
hist(chi_squared^2, breaks = 40, freq = 0)

different statistics will give different lowest density regions

$\endgroup$
8
  • $\begingroup$ Are these the same densities we are talking about? Can your example be interpreted as an instance of a test statistic falling in some place of its null distribution? Because I think I am talking about the latter. $\endgroup$ Commented Dec 2, 2023 at 11:15
  • $\begingroup$ @RichardHardy it is not really considered failing. The p-values can be correct. But the issue is that there are no uniquely defined p-values when you consider the highest density region. If you consider two sided chi-squared test, then the p-value will differ if you compute it based on $\chi$ statistic versus based on $\chi^2$ statistic. $\endgroup$ Commented Dec 2, 2023 at 11:21
  • $\begingroup$ I wrote (and meant) falling, not failing. I do not quite follow your argument; I think a bit more detail would help bridge the gap between your examples and how they can be used in hypothesis testing where we have a $H_0$, a test statistic and its distribution under $H_0$. $\endgroup$ Commented Dec 2, 2023 at 11:25
  • $\begingroup$ This quote “First, F tests and χ2 tests are typically rejected only for large values of the test statistic. Clearly, in Fisherian testing, that is inappropriate” makes little sense to me. 1) I am unaware of Fisherman testing using the density of the statistic to decide about 'extreme' values and the computation of a p-value. 2) The chi-squared statistic might have a low density for small values, but this contrasts with the joint distribution of the sample having a high value. The statistic that is used changes the density and makes this density based p-value ambiguous. $\endgroup$ Commented Dec 2, 2023 at 11:27
  • $\begingroup$ The part you explain in 2) is clear to me, but what is your $H_0$ and what are the test statistics you are comparing (explicitly, as functions of the sample)? And what are their distributions under $H_0$? $\endgroup$ Commented Dec 2, 2023 at 11:31
5
$\begingroup$

Technically there is no contradiction. If you for example look at the one- or two-sided t-test for the $H_0$ of a mean being zero, the two-sided test will measure what's "more extreme" by the absolute value of the t-statistic, i.e., positive and negative deviations will count as "too extreme", whereas the one-sided test will measure extremity by large positive values of the t-statistic only (or negative if you test the other way round). Technically this still fits Fisher's definition, just what counts as "extreme" is different. Fisher himself used one- and two-sided tests, see, e.g., Wikipedia on one- and two-sided tests.

In fact I'd say that every test comes with an implicit definition of an alternative, namely the alternative is the set of models for which the probability to reject is larger than under the null hypothesis. This is no different for Fisher's tests, even though he wouldn't say this explicitly. (The Neyman-Pearson view is that one should choose a test statistic so that the power is optimal under a specific alternative, whereas for Fisher the test statistic comes first and he wouldn't talk about an alternative but there is implicitly an alternative involved, see above.)

There is a large variety in interpreting p-values and also considerable difference between Fisher and Neyman/Pearson (and what later was made of this controversy), but the technical definition of the p-value is not really controversial. (What is also somewhat controversial is the choice of wording for explaining it generally like "extreme", "weird" etc.) The reference to densities by the way makes sense for certain tests and not for others.

By the way, if you want to be thoroughly confused, read this (including discussion, to which I contribute a tiny bit). I stand by what I wrote above regardless, so I link this for information, but the warning to not read it if you don't want more confusion is serious. (Personally I don't mind confusion so I still recommend to read it to those who can handle it. ;-)

$\endgroup$
12
  • $\begingroup$ From gung's comment: Imagine a test statistic whose distribution was bi-modal under the null (I can't think of one, but nothing prevents a test static from having a weird distribution). In that case, a "low density" could occur in the middle of the distribution Does your "no contradiction" statement hinge on this not being the case (i.e. low density at the tails, not in the middle)? I find low density to be a very intuitive measure of what is atypical under $H_0$ while I think other measures (such as how far in a tail) derive from that but otherwise might not be intrinsically meaningful. $\endgroup$ Commented Dec 1, 2023 at 11:16
  • 2
    $\begingroup$ @RichardHardy In real data analysis I don't think the concept "intrinsically meaningful" gets us very far. We have to define what we mean by "extreme" or "weird" and the definition will depend on what exactly we know and what exactly we're interested in. Being in a low density region is "extreme" according to any definition that measures extremity as inverse proportional to density, and I'm fine with that; but in some situations we may want something else. $\endgroup$ Commented Dec 1, 2023 at 13:40
  • 1
    $\begingroup$ I think your point about the implicit alternative is true. Fisher seemed to completely reject Neyman-Pearson's ideas of an alternative hypothesis, but seemed to have no issue however with accepting claims like $\mu_1 \neq \mu_2$ when rejecting the null (Hurlbert & Lombardi, 2009). $\endgroup$ Commented Dec 1, 2023 at 20:58
  • $\begingroup$ It it hard to argue against in some situations we may want something else :) I would be curious to see an example where a value that has low density under $H_0$ can be considered less "weird/extreme" than one with high density. I wonder if that can be achieved without violating the intrinsic meanings of "weird" and "extreme". Here, I define "weird/extreme" in relation to $H_0$ without a reference to $H_1$. If we introduce $H_1$, things can change. $\endgroup$ Commented Dec 2, 2023 at 10:43
  • $\begingroup$ @RichardHardy Well, a test with distribution of the test statistic so that there is low density somewhere in the middle is a strange thing in the first place. Maybe if you can give me an example where that occurs I could tell you an application where you then still want to reject only in big distance from the center regardless. ;-) $\endgroup$ Commented Dec 2, 2023 at 13:40
3
$\begingroup$

Shawn that is a wonderful answer but it’s important to point out a subtle problem with a small part of it. The “in the long run we should not often be wrong” statement of Neyman and Pearson is an example of a tragic logic error which Fisher also made at times and is addressed in the terrific book by Aubrey Clayton Bernoulli’s Fallacy. The chance of being wrong about an assertion that an effect is present is an unconditional probability with respect to the unknown parameter, i.e., the probability of a mistake is not conditional on $H_0$. Yet the N&P and Fisher justifications go on to assume $H_0$ in describing the value and meaning of the procedure. The probability of making a mistake in asserting an effect needs to be conditional on the evidence used in making the assertion and not condition on the unknown value of an effect parameter. When $H_0$ is true then any assertion of an effect is a mistake by definition. The probability of a mistake is the Bayesian posterior probability that the effect is zero or is in the opposite direction. This gets the conditioning right.

For example the probability that a regulatory agency makes a mistake in approving a drug to be marketed is the probability that the drug doesn’t work. It is not the probability of asserting that the drug works when it doesn’t work, as we are past that in the logic flow. If the drug was approved because the probability of efficacy was 0.96, the probability of in efficacy, i.e., the probability of an error, is 0.04.

$\endgroup$
9
  • 1
    $\begingroup$ The “in the long run we should not often be wrong” statement of Neyman and Pearson is an example of a tragic logic error which Fisher also made at times---Absolutely agree. I would say Fisher was often opaque about some of these points (perhaps on purpose) and Neyman and Pearson didn't always provide better answers to some of his postulations. $\endgroup$ Commented Dec 1, 2023 at 13:31
  • 2
    $\begingroup$ I read N-P's statement differently. Reading to the end of that quoted paragraph that includes "in the long run...", their goal is to rarely reject H when it's true H, and rarely fail-to-reject H when it's false. If you are talking about a sequence of problems, and in each problem you divide up the world into two possible states (H and ~H), and you rarely make a mistake under H and also rarely make a mistake under ~H, then indeed in the long run you are not wrong. Your "probability of a mistake" is something different entirely, but that doesn't mean N-P's statement is a logic error. $\endgroup$
    – civilstat
    Commented Dec 1, 2023 at 15:20
  • 1
    $\begingroup$ But the N-P quote is not just about rarely being wrong in acting as if an effect is present; it's about rarely being wrong overall. $\endgroup$
    – civilstat
    Commented Dec 1, 2023 at 17:08
  • 3
    $\begingroup$ As I see it, N-P's intent was: If you define "mistake" as "EITHER conclude H when ~H, OR conclude ~H when H", and if both P(mistake | H) and P(mistake | ~H) are small, then the unconditional P(mistake) can be written as their convex combination and thus is also small. Meanwhile, you seem to be defining P(mistake) ONLY as P(H | the evidence led us to conclude ~H). You might personally find that more useful than N-P's view, but yours is not the only way to define it. Their view isn't a logic error -- they are simply talking about something different than you are. $\endgroup$
    – civilstat
    Commented Dec 1, 2023 at 17:10
  • 2
    $\begingroup$ That's actually consistent with my point. In poker the goal is to maximize your chance of winning the game. It's not to minimize your probability of having betted for those games you lost, which is what is related to $\alpha$. $\endgroup$ Commented Dec 1, 2023 at 22:27

Not the answer you're looking for? Browse other questions tagged or ask your own question.