116
$\begingroup$

We already have multiple threads tagged as that reveal lots of misunderstandings about them. Ten months ago we had a thread about psychological journal that "banned" $p$-values, now American Statistical Association (2016) says that with our analysis we "should not end with the calculation of a $p$-value".

American Statistical Association (ASA) believes that the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the $p$-value.

The committee lists other approaches as possible alternatives or supplements to $p$-values:

In view of the prevalent misuses of and misconceptions concerning $p$-values, some statisticians prefer to supplement or even replace $p$-values with other approaches. These include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates. All these measures and approaches rely on further assumptions, but they may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is correct.

So let's imagine post-$p$-values reality. ASA lists some methods that can be used in place of $p$-values, but why are they better? Which of them can be real-life replacement for a researcher that used $p$-values for all his life? I imagine that this kind of questions will appear in post-$p$-values reality, so maybe let's try to be one step ahead of them. What is the reasonable alternative that can be applied out-of-the-box? Why this approach should convince your lead researcher, editor, or readers?

As this follow-up blog entry suggests, $p$-values are unbeatable in their simplicity:

The p-value requires only a statistical model for the behavior of a statistic under the null hypothesis to hold. Even if a model of an alternative hypothesis is used for choosing a “good” statistic (which would be used for constructing the p-value), this alternative model does not have to be correct in order for the p-value to be valid and useful (i.e.: control type I error at the desired level while offering some power to detect a real effect). In contrast, other (wonderful and useful) statistical methods such as Likelihood ratios, effect size estimation, confidence intervals, or Bayesian methods all need the assumed models to hold over a wider range of situations, not merely under the tested null.

Are they, or maybe it is not true and we can easily replace them?

I know, this is broad, but the main question is simple: what is the best (and why), real-life alternative to $p$-values that can be used as a replacement?


ASA (2016). ASA Statement on Statistical Significance and $P$-values. The American Statistician. (in press)

$\endgroup$
9
  • 3
    $\begingroup$ Bound to become a classic question +1! The Bayesian approach, because it allows us to (at least subjectively) answer the question we are often interested in, viz.: "In light of the evidence (data), what is the probability that the hypothesis is true?" $\endgroup$ Commented Mar 8, 2016 at 8:46
  • 10
    $\begingroup$ "Post-$p$-value reality" has a nice dystopian ring to it. $\endgroup$ Commented Mar 8, 2016 at 9:38
  • 4
    $\begingroup$ The discussion papers posted along with the ASA statement are worth reading as some of them have suggestions on what could replace p-values. Supplemental Content $\endgroup$
    – Seth
    Commented Mar 8, 2016 at 20:11
  • 3
    $\begingroup$ I have posted a related question based on another part of the ASA report, one of its warnings about the potential abuses of p-values: How much do we know about p-hacking? $\endgroup$
    – Silverfish
    Commented Mar 9, 2016 at 13:24
  • 1
    $\begingroup$ As a comment to my own question, there is a nice thread that discusses similar topic: stats.stackexchange.com/questions/17897/… $\endgroup$
    – Tim
    Commented Mar 9, 2016 at 13:39

10 Answers 10

110
+100
$\begingroup$

I will focus this answer on the specific question of what are the alternatives to $p$-values.

There are 21 discussion papers published along with the ASA statement (as Supplemental Materials): by Naomi Altman, Douglas Altman, Daniel J. Benjamin, Yoav Benjamini, Jim Berger, Don Berry, John Carlin, George Cobb, Andrew Gelman, Steve Goodman, Sander Greenland, John Ioannidis, Joseph Horowitz, Valen Johnson, Michael Lavine, Michael Lew, Rod Little, Deborah Mayo, Michele Millar, Charles Poole, Ken Rothman, Stephen Senn, Dalene Stangl, Philip Stark and Steve Ziliak (some of them wrote together; I list all for future searches). These people probably cover all existing opinions about $p$-values and statistical inference.

I have looked through all 21 papers.

Unfortunately, most of them do not discuss any real alternatives, even though the majority are about the limitations, misunderstandings, and various other problems with $p$-values (for a defense of $p$-values, see Benjamini, Mayo, and Senn). This already suggests that alternatives, if any, are not easy to find and/or to defend.

So let us look at the list of "other approaches" given in the ASA statement itself (as quoted in your question):

[Other approaches] include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates.

  1. Confidence intervals

Confidence intervals are a frequentist tool that goes hand-in-hand with $p$-values; reporting a confidence interval (or some equivalent, e.g., mean $\pm$ standard error of the mean) together with the $p$-value is almost always a good idea.

Some people (not among the ASA disputants) suggest that confidence intervals should replace the $p$-values. One of the most outspoken proponents of this approach is Geoff Cumming who calls it new statistics (a name that I find appalling). See e.g. this blog post by Ulrich Schimmack for a detailed critique: A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics. See also We cannot afford to study effect size in the lab blog post by Uri Simonsohn for a related point.

See also this thread (and my answer therein) about the similiar suggestion by Norm Matloff where I argue that when reporting CIs one would still like to have the $p$-values reported as well: What is a good, convincing example in which p-values are useful?

Some other people (not among the ASA disputants either), however, argue that confidence intervals, being a frequentist tool, are as misguided as $p$-values and should also be disposed of. See, e.g., Morey et al. 2015, The Fallacy of Placing Confidence in Confidence Intervals linked by @Tim here in the comments. This is a very old debate.

  1. Bayesian methods

(I don't like how the ASA statement formulates the list. Credible intervals and Bayes factors are listed separately from "Bayesian methods", but they are obviously Bayesian tools. So I count them together here.)

  • There is a huge and very opinionated literature on the Bayesian vs. frequentist debate. See, e.g., this recent thread for some thoughts: When (if ever) is a frequentist approach substantively better than a Bayesian? Bayesian analysis makes total sense if one has good informative priors, and everybody would be only happy to compute and report $p(\theta|\text{data})$ or $p(H_0:\theta=0|\text{data})$ instead of $p(\text{data at least as extreme}|H_0)$—but alas, people usually do not have good priors. An experimenter records 20 rats doing something in one condition and 20 rats doing the same thing in another condition; the prediction is that the performance of the former rats will exceed the performance of the latter rats, but nobody would be willing or indeed able to state a clear prior over the performance differences. (But see @FrankHarrell's answer where he advocates using "skeptical priors".)

  • Die-hard Bayesians suggest to use Bayesian methods even if one does not have any informative priors. One recent example is Krushke, 2012, Bayesian estimation supersedes the $t$-test, humbly abbreviated as BEST. The idea is to use a Bayesian model with weak uninformative priors to compute the posterior for the effect of interest (such as, e.g., a group difference). The practical difference with frequentist reasoning seems usually to be minor, and as far as I can see this approach remains unpopular. See What is an "uninformative prior"? Can we ever have one with truly no information? for the discussion of what is "uninformative" (answer: there is no such thing, hence the controversy).

  • An alternative approach, going back to Harold Jeffreys, is based on Bayesian testing (as opposed to Bayesian estimation) and uses Bayes factors. One of the more eloquent and prolific proponents is Eric-Jan Wagenmakers, who has published a lot on this topic in recent years. Two features of this approach are worth emphasizing here. First, see Wetzels et al., 2012, A Default Bayesian Hypothesis Test for ANOVA Designs for an illustration of just how strongly the outcome of such a Bayesian test can depend on the specific choice of the alternative hypothesis $H_1$ and the parameter distribution ("prior") it posits. Second, once a "reasonable" prior is chosen (Wagenmakers advertises Jeffreys' so called "default" priors), resulting Bayes factors often turn out to be quite consistent with the standard $p$-values, see e.g. this figure from this preprint by Marsman & Wagenmakers:

    Bayes factors vs p-values

    So while Wagenmakers et al. keep insisting that $p$-values are deeply flawed and Bayes factors are the way to go, one cannot but wonder... (To be fair, the point of Wetzels et al. 2011 is that for $p$-values close to $0.05$ Bayes factors only indicate very weak evidence against the null; but note that this can be easily dealt with in a frequentist paradigm simply by using a more stringent $\alpha$, something that a lot of people are advocating anyway.)

    One of the more popular papers by Wagenmakers et al. in the defense of Bayes factors is 2011, Why psychologists must change the way they analyze their data: The case of psi where he argues that infamous Bem's paper on predicting the future would not have reached their faulty conclusions if only they had used Bayes factors instead of $p$-values. See this thoughtful blog post by Ulrich Schimmack for a detailed (and IMHO convincing) counter-argument: Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior.

    See also The Default Bayesian Test is Prejudiced Against Small Effects blog post by Uri Simonsohn.

  • For completeness, I mention that Wagenmakers 2007, A practical solution to the pervasive problems of $p$-values suggested to use BIC as an approximation to Bayes factor to replace the $p$-values. BIC does not depend on the prior and hence, despite its name, is not really Bayesian; I am not sure what to think about this proposal. It seems that more recently Wagenmakers is more in favour of Bayesian tests with uninformative Jeffreys' priors, see above.


For further discussion of Bayes estimation vs. Bayesian testing, see Bayesian parameter estimation or Bayesian hypothesis testing? and links therein.

  1. Minimum Bayes factors

Among the ASA disputants, this is explicitly suggested by Benjamin & Berger and by Valen Johnson (the only two papers that are all about suggesting a concrete alternative). Their specific suggestions are a bit different but they are similar in spirit.

  • The ideas of Berger go back to the Berger & Sellke 1987 and there is a number of papers by Berger, Sellke, and collaborators up until last year elaborating on this work. The idea is that under a spike and slab prior where point null $\mu=0$ hypothesis gets probability $0.5$ and all other values of $\mu$ get probability $0.5$ spread symmetrically around $0$ ("local alternative"), then the minimal posterior $p(H_0)$ over all local alternatives, i.e. the minimal Bayes factor, is much higher than the $p$-value. This is the basis of the (much contested) claim that $p$-values "overstate the evidence" against the null. The suggestion is to use a lower bound on Bayes factor in favour of the null instead of the $p$-value; under some broad assumptions this lower bound turns out to be given by $-ep\log(p)$, i.e., the $p$-value is effectively multiplied by $-e\log(p)$ which is a factor of around $10$ to $20$ for the common range of $p$-values. This approach has been endorsed by Steven Goodman too.

    Later update: See a nice cartoon explaining these ideas in a simple way.

    Even later update: See Held & Ott, 2018, On $p$-Values and Bayes Factors for a comprehensive review and further analysis of converting $p$-values to minimum Bayes factors. Here is one table from there:

    Mininum Bayes factors

  • Valen Johnson suggested something similar in his PNAS 2013 paper; his suggestion approximately boils down to multiplying $p$-values by $\sqrt{-4\pi\log(p)}$ which is around $5$ to $10$.


For a brief critique of Johnson's paper, see Andrew Gelman's and @Xi'an's reply in PNAS. For the counter-argument to Berger & Sellke 1987, see Casella & Berger 1987 (different Berger!). Among the APA discussion papers, Stephen Senn argues explicitly against any of these approaches:

Error probabilities are not posterior probabilities. Certainly, there is much more to statistical analysis than $P$-values but they should be left alone rather than being deformed in some way to become second class Bayesian posterior probabilities.

See also references in Senn's paper, including the ones to Mayo's blog.

  1. ASA statement lists "decision-theoretic modeling and false discovery rates" as another alternative. I have no idea what they are talking about, and I was happy to see this stated in the discussion paper by Stark:

The "other approaches" section ignores the fact that the assumptions of some of those methods are identical to those of $p$-values. Indeed, some of the methods use $p$-values as input (e.g., the False Discovery Rate).


I am highly skeptical that there is anything that can replace $p$-values in actual scientific practice such that the problems that are often associated with $p$-values (replication crisis, $p$-hacking, etc.) would go away. Any fixed decision procedure, e.g. a Bayesian one, can probably be "hacked" in the same way as $p$-values can be $p$-hacked (for some discussion and demonstration of this see this 2014 blog post by Uri Simonsohn).

To quote from Andrew Gelman's discussion paper:

In summary, I agree with most of the ASA’s statement on $p$-values but I feel that the problems are deeper, and that the solution is not to reform $p$-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

And from Stephen Senn:

In short, the problem is less with $P$-values per se but with making an idol of them. Substituting another false god will not help.

And here is how Cohen put it into his well-known and highly-cited (3.5k citations) 1994 paper The Earth is round ($p<0.05$) where he argued very strongly against $p$-values:

[...] don't look for a magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn't exist.

$\endgroup$
7
  • 1
    $\begingroup$ @amoeba regarding Wagenmakers and BIC it is good to compare it with the critique, e.g. by Gelman: andrewgelman.com/2008/10/23/i_hate_bic_blah $\endgroup$
    – Tim
    Commented Mar 15, 2016 at 11:17
  • 3
    $\begingroup$ This is a truly impressive answer that deserves to be among the top voted answers on CV. I may add another bounty sometime after Tim's. $\endgroup$ Commented Mar 15, 2016 at 21:03
  • $\begingroup$ Thanks, @gung, I am happy to hear that, it means a lot coming from you. I should say though that I am only superficially familiar with Bayesian testing and have zero hands-on experience with it. So this answer provides a summary of what I've been reading, but it's not really an expert opinion. $\endgroup$
    – amoeba
    Commented Mar 15, 2016 at 22:06
  • 1
    $\begingroup$ No you do not need an informative prior for Bayes to work well. As Spiegelhalter has shown so well, skeptical priors have a major role and are easy to use. Bayesian posterior probabilities have major advantages. $\endgroup$ Commented Jul 12, 2017 at 22:09
  • 2
    $\begingroup$ Great answer, but instead of ending on this capitulation from Gelman "but rather to move toward a greater acceptance of uncertainty and embracing of variation." don't you think it's worth pointing out that the need for dichotomous decisions often arises from the application scenario, not the statistical methods. In those cases, not all but many, the sound statistician may well show that all methods are imperfect but should still try to suggest the least worst one. If he doesn't, the decision still needs to be taken and someone else will use another method, worse than the least worst one. $\endgroup$ Commented Sep 21, 2017 at 13:13
29
$\begingroup$

Here is my two cents.

I think that at some point, many applied scientists stated the following "theorem":

Theorem 1: $p\text{-value}<0.05\Leftrightarrow \text{my hypothesis is true}.$

and most of the bad practices come from here.

The $p$-value and scientific induction

I used to work with people using statistics without really understanding it and here is some of the stuff I see:

  1. running many possible tests/reparametrisations (without looking once at the distribution of the data) until finding the "good" one: the one giving $p<0.05$;

  2. trying different preprocessing (e.g. in medical imaging) to get the data to analyse until getting the one giving $p<0.05$;

  3. reach $0.05$ by applying one-tailed t-test in the positive direction for the data with positive effect and in the negative direction for the data with negative effect (!!).

All that is done by well-versed, honest scientists having no strong sensation of cheating. Why ? IMHO, because of Theorem 1.

At a given moment, applied scientist may believe strongly in their hypothesis. I even suspect that they believe they known they are true and the fact is that in many situations they have seen data from years, have thought about them while working, walking, sleeping... and they are the best to say something about the answer to this question. The fact is, in their mind (sorry I think I look a bit arrogant here), by Theorem 1 if they hypothesis is true, the $p$-value must be lower than $0.05$ ; no matter what the amount of data is, how they are distributed, the alternative hypothesis, the size effect, the quality of the data acquisition. If the $p$-value is not $<0.05$ and the hypothesis is true, then something is not correct: the preprocessing, the choice of test, the distribution, the acquisition protocol... so we change them... $p$-value $<0.05$ is just the ultimate key of scientific induction.

To this point, I agree with the two previous answers that confidence intervals or credible intervals make the statistical answer more proper to the discussion and to the interpretation. While $p$-value is difficult to interpret (IMHO) and ends the discussion, interval estimates can serve a scientific induction illustrated by objective statistics but lead by expert arguments.

The $p$-value and the alternative hypothesis

Another consequence of Th.1 is that if $p$-value$>0.05$ then the alternative hypothesis is false. Again this is something that I encounter many times :

  1. try to compare (just because we have the data) a hypothesis of the type $H_0: \mu_1 \ne \mu_2$: take randomly 10 data-points for each of the two groups, compute the $p$-value for $H_0$. Find $p=0.2$, notice in some part of the brain that there is no difference between the two groups.

A main issue with the $p$-value is that the alternative is never mentioned while I think in many cases this could help a lot. A typical example is point 4., where I proposed to my colleague to compute posterior ratio for $p(\mu_1>\mu_2|x)$ vs. $p(\mu_1<\mu_2|x)$ and get something like 3 (I know this figure is ridiculously low). The researcher asks me if it means that the probability that $\mu_1>\mu_2$ is 3 times stronger than those $\mu_2>\mu_1$. I answered that this is a way to interpret it and she finds this amazing and that she should look at more data and write a paper... My point is not that this "3" helps her to understand that there is something in the data (again 3 is clearly anedoctic) but that it underlines that she misinterprets the p-value as "p-value>0.05 means nothing interesting/equivalent groups". So in my opinion, always at least discussing the alternative hypothesis (es!) is mandatory, allows to avoid simplification, gives element to debate.

Another related case is when experts want to :

  1. test $\mu_1>\mu_2>\mu_3$. For that they test and reject $\mu_1=\mu_2=\mu_3$ then conclude $\mu_1>\mu_2>\mu_3$ using the fact that the ML estimates are ordered.

Mentioning the alternative hypothesis is the only solution to solve this case.

So using posterior odds, Bayes factor or likelihood ratio conjointly with confidence/credible intervals seems to reduce the main involved issues.

The common misinterpretation of $p$-value / confidence intervals is a relatively minor flaw (in practice)

While I am a Bayesian enthusiast, I really think that the common misinterpretation of $p$-value and CI (i.e. the $p$-value is not the probability that the null hypothesis is false and the CI is not the interval that contains the parameter value with 95% chance) is not the main concern for this question (while I am sure this is a major point from a philosophical point of view). The Bayesian/Frequentist view have both pertinent answers to help practitioner in this "crisis".

My two cents conclusion

Using credible interval and Bayes factor or posterior odds is what I try to do in my practice with experts (but am also enthusiast in CI+likelihood ratio). I came to statistics a few years ago mainly by self-studying from the web (so many thanks to Cross Validated !) and so grew up with the numerous agitations around $p$-values. I do not know if my practice is a good one but it is what I pragmatically find as a good compromise between being efficient and making my job properly.

$\endgroup$
5
  • $\begingroup$ Maybe you could edit your example to be more clear since as for now what were you calculating, what was the data and where did the numbers come from? $\endgroup$
    – Tim
    Commented Mar 9, 2016 at 9:04
  • $\begingroup$ @Tim. Tks for the feedbak. Which example are you refering ? $\endgroup$
    – beuhbbb
    Commented Mar 9, 2016 at 9:07
  • $\begingroup$ "try to compare (just because we have the data) an hypothesis: take 10 and 10 data, compute p-value. Find p=0.2 ...." $\endgroup$
    – Tim
    Commented Mar 9, 2016 at 9:08
  • 1
    $\begingroup$ I also don't think that "knowing" your hypothesis is true even if the data seem to suggest otherwise is necessarily a bad thing. This is apparently how Gregor Mendel sensed when there was something wrong with his experiments, because he had such a strong intuition that his theories were correct. $\endgroup$
    – dsaxton
    Commented Mar 9, 2016 at 20:48
  • $\begingroup$ @dsaxton Fully agree with you. Maybe it is not so clear but this is one thing I try to illustrate in my 1st point : p-value is not the ultimate key of scientific induction (while it appears to be for a certain audience). It is a statistical measurment of evidence brough by a certain amount of data, on certain conditions. And in a case where you have too many external reasons to think that the hyp is true but when the data does provide the "good" p value, other things may be discussed as you suitably mentioned it. I will try to make it clearer in my anwser. $\endgroup$
    – beuhbbb
    Commented Mar 10, 2016 at 8:00
26
$\begingroup$

The only reasons I continue to use $P$-values are

  1. More software is available for frequentist methods than Bayesian methods.
  2. Currently, some Bayesian analyses take a long time to run.
  3. Bayesian methods require more thinking and more time investment. I don't mind the thinking part but time is often short so we take shortcuts.
  4. The bootstrap is a highly flexible and useful everyday technique that is more connected to the frequentist world than to the Bayesian.

$P$-values, analogous to highly problematic sensitivity and specificity as accuracy measures, are highly deficient in my humble opinion. The problem with all three of these measures is that they reverse the flow of time and information. When you turn a question from "what is the probability of getting evidence like this if the defendant is innocent" to "what is the probability of guilt of the defendant based on the evidence", things become more coherent and less arbitrary. Reasoning in reverse time makes you have to consider "how did we get here?" as opposed to "what is the evidence now?". $P$-values require consideration of what could have happened instead of what did happen. What could have happened makes one have to do arbitrary multiplicity adjustments, even adjusting for data looks that might have made an impact but actually didn't.

When $P$-values are coupled with highly arbitrary decision thresholds, things get worse. Thresholds almost always invite gaming.

Except for Gaussian linear models and the exponential distribution, almost everything we do with frequentist inference is approximate (a good example is the binary logistic model which causes problems because its log likelihood function is very non-quadratic). With Bayesian inference, everything is exact to within simulation error (and you can always do more simulations to get posterior probabilities/credible intervals).

I've written a more detailed accounting of my thinking and evoluation at http://www.fharrell.com/2017/02/my-journey-from-frequentist-to-bayesian.html

$\endgroup$
11
  • 3
    $\begingroup$ (+1) How do you propose we handle the more mundane questions like "does this treatment have any effect at all?" where all we might care about is a simple yes / no answer. Should we still do away with $p$-values in these situations? $\endgroup$
    – dsaxton
    Commented Mar 14, 2016 at 14:00
  • 2
    $\begingroup$ Frank, I don't exactly see how this answers the question about what are the alternatives to $p$-values; can you maybe clarify? Imagine some typical application of a t-test: say, an experimenter comes to you with some performance measures of 40 rats, with 20 experimental and 20 control animals. They want to know if the experimental manipulation changes the performance (in a predicted direction). Usually they would run a t-test or a ranksum test and report a p-value (together with the means, SDs, perhaps confidence interval for the group difference, etc.). What would you suggest to do instead? $\endgroup$
    – amoeba
    Commented Mar 14, 2016 at 14:02
  • 3
    $\begingroup$ My favorite approach would be to use a Bayesian semiparametric model, e.g., Bayesian proportional odds ordinal logistic regression, then get a credible interval and posterior probabilities for the effect of interest. That is a generalization of the Wilcoxon test. If I wanted to go parametric I would use the Bayesian $t$-test of the Box & Tiao extension that allows a prior distribution for the degree of non-normality. $\endgroup$ Commented Mar 14, 2016 at 15:57
  • 1
    $\begingroup$ Frank, thanks. I am not very familiar with Bayesian testing (and have not heard of Box & Tiao before), but my general impression is that the Bayes factor that one gets out of a Bayesian test can depend quite strongly on the specific choice of an uninformative prior that goes in. And these choices can be difficult to motivate. I guess the same goes for credible intervals -- they will strongly depend on the choice of an uninformative prior. Is it not true? If it is, then how should one deal with it? $\endgroup$
    – amoeba
    Commented Mar 14, 2016 at 17:46
  • 2
    $\begingroup$ Yes although I don't use Bayes factors. The frequentist approach chooses a prior too - one that ignores all other knowledge about the subject. I prefer Spiegelhalter's skeptical prior approach. In an ideal world you will let your skeptics provide the prior. $\endgroup$ Commented Mar 14, 2016 at 18:14
8
$\begingroup$

In this thread, there is already a good amount of illuminating discussion on this subject. But let me ask you: "Alternatives to what exactly?" The damning thing about p-values is that they're forced to live between two worlds: decision theoretic inference and distribution free statistics. If you are looking for an alternative to "p<0.05" as a decision theoretic rule to dichotomize studies as positive/negative or significant/non-significant then I tell you: the premise of the question is flawed. You can contrive and find many branded alternatives to $p$-value based inference which have the exact same logical shortcomings.

I'll point out that the way we conduct modern testing in no way agrees with the theory and perspectives of Fisher and Neyman-Pearson who both contributed greatly to modern methods. Fisher's original suggestion was that scientists should qualitatively compare the $p$-value to the power of the study and draw conclusions there. I still think this is an adequate approach, which leaves the question of scientific applicability of the findings in the hands of those content experts. Now, the error we find in modern applications is in no way a fault of statistics as a science. Also at play is fishing, extrapolation, and exaggeration. Indeed, if (say) a cardiologist should lie and claim that a drug which lowers average blood pressure 0.1mmHg is "clinically significant" no statistics will ever protect us from that kind of dishonesty.

We need an end to decision theoretic statistical inference. We should endeavor to think beyond the hypothesis. The growing gap between the clinical utility and hypothesis driven investigation compromises scientific integrity. The "significant" study is extremely suggestive but rarely promises any clinically meaningful findings.

This is evident if we inspect the attributes of hypothesis driven inference:

  • The null hypothesis stated is contrived, does not agree with current knowledge, and defies reason or expectation.
  • Hypotheses may be tangential to the point the author is trying to mak. Statistics rarely align with much of the ensuing discussion in articles, with authors making far reaching claims that, for instance, their observational study has implications for public policy and outreach.
  • Hypotheses tend to be incomplete in the sense that they do not adequately define the population of interest, and tend lead to overgeneralization

To me, the alternative there is a meta-analytic approach, at least a qualitative one. All results should be rigorous vetted against other "similar" findings and differences described very carefully, especially inclusion/exclusion criteria, units or scales used for exposures/outcomes, as well as effect sizes and uncertainty intervals (which are best summarized with 95% CIs).

We also need to conduct independent confirmatory trials. Many people are swayed by one seemingly significant trial, but without replication we cannot trust that the study was done ethically. Many have made scientific careers out of falsification of evidence.

$\endgroup$
1
  • 1
    $\begingroup$ "Fisher's original suggestion was that scientists should qualitatively compare the p-value to the power of the study and draw conclusions there." I love this point---do you have a reference I could cite where Fisher said this? It would be a huge step forward if scientists moved from a simple dichotomy of p<0.05 to a only-slightly-less-simple dichotomy: "If p<0.05 AND power was high, we have reasonably strong evidence. If p>0.05 OR power was low, we'll withhold judgment about this hypothesis until we get more data." $\endgroup$
    – civilstat
    Commented Oct 1, 2018 at 13:38
6
$\begingroup$

A Brilliant forecaster Scott Armstrong from Wharton published an article a almost 10 years ago titled Significance Tests Harm Progress in Forecasting in the international journal of forecasting a journal that he co-founded. Even though this is in forecasting, it could be generalized to any data analysis or decision making. In the article he states that:

"tests of statistical significance harms scientific progress. Efforts to find exceptions to this conclusion have, to date, turned up none."

This is an excellent read for any one interested in antithetical view of significance testing and P values.

The reason why I like this article is because Armstrong provides alternatives to significance testing which is succinct and could be easily understood especially for a non-statistician like me. This is much better in my opinion than the ASA article cited in the question:enter image description here

All of which I continue to embrace and ever since stopped using significance testing or looking at P values except when I do randomized experimental studies or quasi experiments. I must add randomized experiments are very rare in practice except in pharmaceutical industry/life sciences and in some fields in Engineering.

$\endgroup$
3
  • 4
    $\begingroup$ What do you mean "randomized experiments are very rare in practice except in pharmaceutical industry and in some fields in Engineering"? Randomized experiments are everywhere in biology and psychology. $\endgroup$
    – amoeba
    Commented Apr 1, 2016 at 8:42
  • $\begingroup$ I edited it to include life sciences. $\endgroup$
    – forecaster
    Commented Apr 1, 2016 at 11:35
  • 3
    $\begingroup$ Okay, but saying that rand. exp. are "very rare" except in medicine and life sciences and psychology is basically saying that they are "very common". So I am not sure about your point. $\endgroup$
    – amoeba
    Commented Apr 1, 2016 at 12:17
6
$\begingroup$

What is preferred and why must depend on the field of study. About 30 years ago articles started appearing in medical journals suggesting that $p$-values should be replaced by estimates with confidence intervals. The basic reasoning was that $p$-values just tell you the effect was there whereas the estimate with its confidence interval tells you how big it was and how precisely it has been estimated. The confidence interval is particularly important when the $p$-value fails to reach the conventional level of significance because it enables the reader to tell whether this is likely due to there genuinely being no difference or the study being inadequate to find a clinically meaningful difference.

Two references from the medical literature are (1) by Langman, M J S entitled Towards estimation and confidence intervals and Gardner M J and Altman, D G entitled Confidence intervals rather than {P} values: estimation rather than hypothesis testing

$\endgroup$
5
  • 2
    $\begingroup$ Actually, CI's do not show effect size and precision, check e.g. Morey et al (2015) "The fallacy of placing confidence in confidence intervals" Psychonomic Bulletin & Review: learnbayes.org/papers/confidenceIntervalsFallacy $\endgroup$
    – Tim
    Commented Mar 8, 2016 at 15:34
  • 10
    $\begingroup$ @Tim, nice paper, I have not seen it before; I liked the submarine example. Thanks for the link. But one should say that it is written by true Bayesian partisans: "The non-Bayesian intervals have undesirable, even bizarre properties, which would lead any reasonable analyst to reject them as a means to draw inferences". Any reasonable analyst! Impressive arrogance. $\endgroup$
    – amoeba
    Commented Mar 8, 2016 at 16:23
  • 1
    $\begingroup$ @amoeba agree, I'm just providing counter-example, since, as for me, it is not that obvious that the alternatives are that clear and direct as may appear at first sight. $\endgroup$
    – Tim
    Commented Mar 8, 2016 at 20:44
  • 4
    $\begingroup$ While interesting I didn't find the submarine example all that compelling. No thinking statistician would reason the way the one in the example does. You don't stop thinking and apply a method blindly to all situations just because it's useful in others. $\endgroup$
    – dsaxton
    Commented Mar 10, 2016 at 21:15
  • 3
    $\begingroup$ @amoeba: In that particular quote, "The non-Bayesian intervals" refers specifically to the intervals discussed in that example, not all intervals justified by non-Bayesian logic. See here for more context: stats.stackexchange.com/questions/204530/… $\endgroup$ Commented Mar 31, 2016 at 13:55
3
$\begingroup$

Decision theoretic modeling is superior to $p$-values because it requires the researcher to

  • develop a more sophisticated model that is capable of simulating outcomes in a target population
  • identify and measure attributes of a target population in whom a proposed decision, treatment, or policy could be implemented
  • estimate by way of simulation an expected loss in raw units of a target quantity such as life years, quality adjusted life years, dollars, crop output etc, and to assess the uncertainty of that estimate.

By all means this doesn't preclude normal hypothesis significance testing, but it underscores that statistically significant findings are very early, intermediary steps on the path to real discovery and we should be expecting researchers to do much more with their findings.

$\endgroup$
3
$\begingroup$

There is a collection of alternatives in the special issue Statistical Inference in the 21st Century: A World Beyond p < 0.05 of The American Statistician that, I think, deserves special mention. I cannot possibly and won't try to list all the ideas about alternatives and/or additions to p-values that are given in the 43 papers of this special issue, but I warmly recommend reading the editorial Moving to a World Beyond "p < 0.05" to get a solid overview.

$\endgroup$
3
$\begingroup$

The statistical community's responses to the problem tend to assume that the answer lies in statistics. (The applied research community's preferred response is to ignore the problem entirely.)

In a forthcoming comment, colleagues and I argue that purely statistical standard error underestimates uncertainty, and that behavioral researchers should commit to estimating all material components of uncertainty associated with each measurement, as metrologists do in some physical sciences and in legal forensics. When statistical means to estimate some components are unavailable, researchers must rely on nonstatistical means.

Across the broad metrology community, there are research centers devoted to the quantification of uncertainty. Across many other fields, of course, there are no such centers, yet.

Rigdon, E. E., Sarstedt, M., & Becker, J.-M. (in press), Quantify uncertainty in behavioral research. Nature Human Behaviour.

$\endgroup$
2
$\begingroup$

My choice would be to continue using p values, but simply adding confidence/credible intervals, and possibly for the primary outcomes prediction intervals. There is a very nice book by Douglas Altman (Statistics with Confidence, Wiley), and thanks to boostrap and MCMC approaches, you can always build reasonably robust intervals.

$\endgroup$
2
  • 6
    $\begingroup$ I think you do not really answer the main question which is "why are they better?"/"Why this approach should convince your lead researcher, editor, or readers?". Can you develop your choice ? $\endgroup$
    – beuhbbb
    Commented Mar 8, 2016 at 11:04
  • 1
    $\begingroup$ 1. That merely enables current practice. 2. There's a tendency to do "backdoor significance testing" with the CI anyway, 3. Significance testing (with p-values or CIs) leads to a low rate of reproducibility (see articles by Tim Lash). 4. Researchers cannot be bothered to prespecify a clinically significant boundary or threshold of effect. $\endgroup$
    – AdamO
    Commented Oct 1, 2018 at 15:14

Not the answer you're looking for? Browse other questions tagged or ask your own question.