4
$\begingroup$

I have been puzzled for a long time at the way psychologists and medical researchers state that they have 'significant' results, and at the way this statement is relayed to the public who are misled into thinking the results are in some way important. I also started to wonder why the 'significant' correlation was supposedly all the more remarkable when a very large number of people comprised the sample, since that made it almost inevitable that an effect would be detected, either positive or negative, no matter how weak.

While researching this, I browsed the thread https://datascience.stackexchange.com/questions/89308/p-value-and-effect-size, and in the answer there desertnaut recommended to read the paper Using Effect Size—or Why the P Value Is Not Enough.

The paper makes a very good point. But it seems to be a very obvious point, so I am wondering how long the point has been getting made. When was this point first made? And why does it seem to be ignored?

$\endgroup$

1 Answer 1

6
$\begingroup$

Depending on how narrowly the point is pinpointed the dating can be spread out, but it is old, some of it predates the official introduction of NHST by Fisher, see Nickerson, Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy:

"Criticism of the method, which essentially began with the introduction of the technique (Pearce, 1992), has waxed and waned over the years; it has been intense in the recent past. Apparently, controversy regarding the idea of NHST more generally extends back more than two and a half centuries (Hacking, 1965).".

That "statistical significance" is misleading as to significance in the usual sense of the word was spelled out e. g. by Eysenck in 1960:

"Eysenck (1960) made a case for not using the term significance in reporting the results of research. C. A. Clark (1963) argued that statistical significance tests do not provide the information scientists need and that the null hypothesis is not a sound basis for statistical investigation."

The more recent flare ups are associated with APA's near banishment of $p$-values in 1999 (some psychology journals did banish them), see Hypothesis testing: Fisher vs. Popper vs. Bayes, and the 2010-2014 unrest that culminated in Nuzzo's 2014 article in Nature that became one of the most highly viewed in its history. Along with spreading sentiments, also captured by Siegfried's contemporaneous quip "statistical techniques for testing hypotheses… have more flaws than Facebook’s privacy policies", it prompted an unprecedented ASA's 2016 policy statement on $p$-values, where principle #5 reads:"A $p$-value, or statistical significance, does not measure the size of an effect or the importance of a result".

One explanation as to why the point has to be made over and over again can be found in Leek's post subtitled why the $p$-value bashers just don't get it:

"Despite their flaws, from a practical perspective it is and oversimplification to point to the use of P-values as the critical flaw in scientific practice. The problem is not that people use P-values poorly it is that the vast majority of data analysis is not performed by people properly trained to perform data analysis... By scientific standards, the growth of data came on at a breakneck pace. Over a period of about 40 years we went from a scenario where data was measured in bytes to terabytes in almost every discipline. Training programs haven’t adapted to this new era... Since most people performing data analysis are not statisticians there is a lot of room for error in the application of statistical methods. This error is magnified enormously when naive analysts are given too many “researcher degrees of freedom”.

[...] P-values can and are misinterpreted, misused, and abused both by naive analysts and by statisticians. Sometimes these problems are due to statistical naiveté, sometimes they are due to wishful thinking and career pressure, and sometimes they are malicious. The reason is that P-values are complicated and require training to understand. Critics of the P-value argue in favor of a large number of the procedures to be used in place of P-values. But when considering the scale at which the methods must be used to address the demands of the current data rich world, many alternatives would result in similar flaws. This is in no way proves the use of P-values is a good idea, but it does prove that coming up with an alternative is hard."

$\endgroup$
4
  • $\begingroup$ " Ap-parently, controversy regarding the idea of NHSTmore generally extends back more than two and a halfcenturies (Hacking, 1965). " Copy pasted from a low resolution archive.org pdf or whatever. NHST means "null hypothesis significance testing". That answers my main question. I had thought I might be the first to notice the shortcomings and common misuses of NHST. Who is Siegfried? The short Leek quote is full of typos and logical flaws e.g. this non sequitur: "[...]sometimes they are malicious. The reason is that P-values are complicated and require training to understand." $\endgroup$ Commented Apr 19, 2021 at 2:46
  • $\begingroup$ Great answer by the way. I am sensing that I am not the only one feeling very irritated and somewhat dumbfounded by this situation. $\endgroup$ Commented Apr 19, 2021 at 3:02
  • $\begingroup$ Hard to believe that in six years or so, there is not one comment on his article at his website. $\endgroup$ Commented Apr 19, 2021 at 3:08
  • $\begingroup$ Conifold, Sir, I'd like to chat with you in a chat room or communicate with you on Twitter about P-values. My name on Twitter is @bartshmatthew. $\endgroup$ Commented Apr 29, 2021 at 17:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.