My explanation of contemporary and past interpretations of $p$ values is more historical than mathematical. However, I list a number of references that discuss these things, both from the original authors of these perspectives and from modern attestations of what $p$ values "are". Much of this you may already know, and these are simply my thoughts on the matter given what I have read.
Fisherian NHST
Though it is believed that Pearson was the one who may have originally designed the formal concept of a $p$ value in 1900, Fisher is largely credited for popularizing it, followed by the Neyman-Pearson framework (Salsburg, 2001). You are correct that the Fisherian conceptualization of a $p$ value simply tests the null hypothesis $H_0$. In fact, his book The Design of Experiments (written 10 years after his original conceptualization in SMRW) has an early section (p. 18-19) that argues against Bayesian probability and with it any "inductive" probability inference assumed from the alternative hypothesis $H_1$, as he notes three distinctions about how Bayesian thinking is "wrong":
- It has "mathematical contradictions." He doesn't explicitly outline why though, a common issue with Fisher's writings where he often alludes to things with great mathematical sophistication but often omitted from his writing (Salsburg, 2001).
- The notion that Bayesian probability should be apparent to any rational mind. Again, he doesn't provide any philosophical or mathematical proof of this idea, but simply states it.
- It is only rarely used for justification of experiments. Once more, he states this without any evidence, and it is contradictory given he states beforehand that Bayesian or inductive reasoning has dominated probability discussions.
With respect to how $p$ values are used now, Fisher did not identify a cutoff criterion for $p$ values, and noted that there were two important components of using them. First, $p$ values were in his opinion an iterative process which involved testing the hypothesis and getting a low enough probability to determine that the results are not by chance (which is explained through his famous tea tasting experiment). The reason this is important is that he viewed replication over many experiments to be an important part of this process, which is unfortunately uncommon in today's time. Second, the level of the $p$ value determined to be sufficient was context-specific, and should not be a hard criterion for deciding upon the evidence in a single experimental design. In fact, in Pages 30-31 of the same book, he explains:
This hypothesis, which may or may not be impugned by the result of an
experiment, is again characteristic of all experimentation. Much
confusion would often be avoided if it were explicitly formulated when
the experiment is designed. In relation to any experiment, we may call
this the "null hypothesis", and it should be noted that the null
hypothesis never proved or established, but it is possibly disproved,
in the course of experimentation. Every experiment may be said to
exist only in order to give the facts a chance of disproving the null
hypothesis.
Thus only the null is exposited from this form of test, and not the alternative. No mention of false positives or negatives are given, though sampling error is described in Pages 51-52.
Neyman-Pearson NHST
It was only when Jerzy Neyman and Egon Pearson presented their canonical paper at the Royal Statistical Society that this changed. In their paper, presided over by Karl Pearson (Egon's father), they first make the following proposition (p. 290-291):
We are inclined to think that as far as a particular hypothesis is
concerned, no test based upon the theory of probability can by itself
provide any valuable evidence of the truth or falsehood of that
hypothesis.
But we may look at the purpose of tests from another view-point.
Without hoping to know whether each separate hypothesis is true or
false, we may search for rules to govern our behaviour with regard to
them, in following which we insure that, in the long run of
experience, we shall not be too often wrong. Here, for example, would
be such a “rule of behaviour”: to decide whether a hypothesis, $H$,
of a given type be rejected or not, calculate a specified character,
$x$, of the observed facts; if $x > x_0$, reject $H$, if $x < x_0$, accept $H$. Such a
rule tells us nothing as to whether in a particular case $H$ is true
when $x <= x_0$ or false when $x > x_0$. But it may often be proved that if we
behave according to such a rule, then in the long run we shall reject
$H$ when it is true not more, say, than once in a hundred times, and in
addition we may have evidence that we shall reject $H$ sufficiently
often when it is false.
You may detect that they are similarly agnostic but still determined to make some additions to previous thinking. They later dictate in this paper two important contributions which would forever shape modern interpretation of $p$ values, both in the correct and incorrect sense:
- There should be a explicit probability testing of $H_0$ and $H_1$.
- That there should be two importantly determined errors, $\alpha$, or false positive, and $\beta$, a false negative.
The second point gave rise to the "table of decisions" we typically see in stats books (from Hulbert & Lombardi, 2009):
With respect to the first point, they note:
It is clear that besides $H_0$ in which we are particularly interested,
there will exist certain admissible alternative hypotheses. Denote by
$D$ the set of all simple hypotheses, which in a particular problem we
consider as admissible. If $H_0$ is a simple hypothesis, it will clearly
belong to $\Omega$. If $H_0$ is a composite hypothesis, then it will be possible
to specify a part of the set $\Omega$, say $\omega$, such that every simple
hypothesis belonging to the sub-set to will be a particular case of
the composite hypothesis $H_0$. We could say also that the simple
hypotheses belonging to the sub-set $\omega$, may be obtained from $H_0$ by
means of some additional conditions specifying the parameters of the
function (7) which are not specified by the hypothesis $H_0$.
With respect to the second point, they dictate that:
The use of the principle of likelihood in testing hypotheses, consists
in accepting for critical regions those determined by the inequality
$X <= C = \text{const.}$ Let us now for a moment consider the form in
which judgments are made in practical experience. We may accept or we
may reject a hypothesis with varying degrees of confidence; or we may
decide to remain in doubt. But whatever conclusion is reached the
following position must be recognised. If we reject $H_0$, we may
reject it when it is true; if we accept $H_0$, we may be accepting it
when it is false, that is to say, when really some alternative is
true. These two sources of error can rarely be eliminated completely;
in some cases it will be more important to avoid the first, in others
the second. We are reminded of the old problem considered by Laplace
of the number of votes in a court of judges that should be needed to
convict a prisoner. Is it more serious to convict an innocent man or
to acquit a guilty? That will depend upon the consequences of the
error; is the punishment death or fine; what is the danger to the
community of released criminals; what are the current ethical views on
punishment? From the point of view of mathematical theory all that we
can do is to show how the risk of the errors may be controlled and
minimized. The use of these statistical tools in any given case, in
determining just how the balance should be struck, must be left to the
investigator. The principle upon which the choice of the critical
region is determined so that the two sources of errors may be
controlled is of first importance.
They go through a very lengthy mathematical justification of this decision making criterion, but curiously never give the famous $.05$ value for $\alpha$, which actually seems to be a carryover from some tables produced by Fisher in another book, where the printing at the time made this more convenient in terms of producing tables (Hurlbert & Lombardi, 2009). In fact, Fisher was a long-time harsh critic of the Neyman-Pearson philosophy, even though these are now in a modern sense lumped together, so that is where Fisherian and modern views differ greatly (Salsburg, 2001).
Modern Interpretations of the P
Now there are the "correct" modern definitions of a $p$ value and clearly incorrect ones. I simply provide the ASA's definition here as the most accurate one (2016):
Informally, a $p$-value is the probability under a specified statistical
model that a statistical summary of the data (e.g., the sample mean
difference between two compared groups) would be equal to or more
extreme than its observed value.
With the following conditions:
- $P$-values can indicate how incompatible the data are with a specified statistical model.
- $P$-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a $p$-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A $p$-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a $p$-value does not provide a good measure of evidence regarding a model or hypothesis.
How others define this in a modern sense is greatly variable. As noted by many past experiments, the Fisher-Neyman-Pearson framework is often interpreted with sometimes fatal consequences (Hauer, 2004; Ziliak & McCloskey, 2008). Because of these issues, some have proposed complete abandonment of $p$ values (see Cumming, 2014), or at least lowering cutoffs to $.005$ (Ioannidis, 2018) or even lower values. Others have developed a "neoFisherian" perspective on $p$ value reporting (Hurlbert & Lombardi, 2009), which in some respects isn't "neo" at all in terms of Fisherian thinking:
Our analysis of these matters thus leads us to a recommendation that
for standard types of significance assessment, the paleoFisherian and
Neyman-Pearsonian paradigms be replaced by a neoFisherian one. The
essence of the latter is that a critical $a$ (probability of type I
error) is not specified, the terms 'significant' and 'non-significant'
are abandoned, that high $p$ values lead only to suspended judgments,
and that the so-called "three-valued logic" of Cox, Kaiser, Tukey,
Tryon and Harris is adopted.
Some have even gone a step further and proposed an $s$ value (I'm forgetting the exact paper that I saw this, but it is alluded to in Andrade, 2019). Given all of this, what a "modern" $p$ value is may be largely dependent on the user, as this seems to be an unresolved issue given current controversies surrounding their usage.
Final Remarks
Much of this is just a word potato salad I came up with from my previous readings of the material, and I may amend this answer to include clarification if this answer isn't clear enough. I will note that The Lady Tasting Tea referenced below is the best resource for learning about the development of the NHST framework, including much of what Fisher conceptualized. Many of the other references below also provide important historical commentary on Fisher's views.
Edit
For those curious, I found the $s$ value paper, which is described below in Greenland (2019):
In an attempt to forestall misinterpretations, $p$ can be described as a
measure of the degree of statistical compatibility between $H$ and the
data (given the model $A$) bounded by $0$ = complete incompatibility (data
impossible under $H$ and $A$) and $1$ = no incompatibility apparent from the
test (Greenland et al. 2016). Similarly,in a test off it of $A$,the resulting $p$
can be interpreted as a measure of the compatibility between $A$ and the
data. The scaling of $p$ as a measure is poor, however, in that the
difference between (say) $0.01$ and $0.10$ is quite a bit larger
geometrically than the difference between $0.90$ and $0.99$. For example,
using a test statistic that is normal with mean zero and standard
deviation (SD) of $1$ under $H$ and $A$, a $p$ of $0.01$ vs. $0.10$ corresponds to
about a $1$ SD difference in the statistic, whereas a $p$ of $0.90$ vs. $0.99$
corresponds to about a $0.1$ SD difference.
One solution to both the
directional and scaling problems is to reverse the direction and
rescale $p$-values by taking their negative base-$2$ logs, which results
in the $s$-value $s =−\text{log2}(p)$. Larger values of $s$ do correspond to more
evidence against $H$. As discussed below this leads to using the $s$-value
as a measure of evidence against $H$ given $A$ (or against $A$ when $p$ is
from a test of fit of $A$).
Whether this is a "solution" to $p$ values is open for debate, but I simply note that it is one modern take on $p$ values that is worthy of discussion in terms of contemporary $p$ values versus historical ones.
References
- Andrade, C. (2019). The p value and statistical significance: Misunderstandings, explanations, challenges, and alternatives. Indian Journal of Psychological Medicine, 41(3), 210–215. https://doi.org/10.4103/IJPSYM.IJPSYM_193_19
- Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
- Fisher, R. A. (1925). Statistical methods for research workers (11th ed.). Oliver and Boyd.
- Fisher, R. A. (1935). The design of experiments (9. ed). Hafner Press.
- Greenland, S. (2019). Valid p-values behave exactly as they should: Some misleading criticisms of p-values and their resolution with s-values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625
- Hauer, E. (2004). The harm done by tests of significance. Accident Analysis & Prevention, 36(3), 495–500. https://doi.org/10.1016/S0001-4575(03)00036-8
- Hurlbert, S. H., & Lombardi, C. M. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349. https://doi.org/10.5735/086.046.0501
- Ioannidis, J. P. A. (2018). The proposal to lower p value thresholds to .005. JAMA, 319(14), 1429. https://doi.org/10.1001/jama.2018.1536
- Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249. https://doi.org/10.1080/01621459.1993.10476404
- Neyman, J., & Pearson, E. S. (1933). IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694–706), 289–337. https://doi.org/10.1098/rsta.1933.0009
- Pearson K (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling" (PDF). Philosophical Magazine. Series 5. 50 (302): 157–175. doi:10.1080/14786440009463897
- Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. W.H. Freeman.
- Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p -values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
- Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.