Is this really how p-values work? Can a million research papers per year be based on pure randomness?

Question

I'm very new to statistics, and I'm just learning to understand the basics, including $p$-values. But there is a huge question mark in my mind right now, and I kind of hope my understanding is wrong. Here's my thought process:

Aren't all researches around the world somewhat like the monkeys in the "infinite monkey theorem"? Consider that there are 23887 universities in the world. If each university has 1000 students, that's 23 million students each year.

Let's say that each year, each student does at least one piece of research, using hypothesis testing with $\alpha=0.05$.

Doesn't that mean that even if all the research samples were pulled from a random population, about 5% of them would "reject the null hypothesis as invalid". Wow. Think about that. That's about a million research papers per year getting published due to "significant" results.

If this is how it works, this is scary. It means that a lot of the "scientific truth" we take for granted is based on pure randomness.

A simple chunk of R code seems to support my understanding:

library(data.table)
dt <- data.table(p=sapply(1:100000,function(x) t.test(rnorm(10,0,1))$p.value))
dt[p<0.05,]

So does this article on successful $p$-fishing: I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.

Is this really all there is to it? Is this how "science" is supposed to work?

The true problem is potentially far worse than multiplying the number of true nulls by the significance level, due to pressure to find significance (if an important journal won't publish non-significant results, or a referee will reject a paper that doesn't have significant results, there's pressure to find a way to achieve significance ... and we do see 'significance hunting' expeditions in many questions here); this can lead to true significance levels that are quite a lot higher than they appear to be. — Glen_b, Commented Jul 19, 2015 at 10:58
On the other hand, many null hypotheses are point nulls, and those are very rarely actually true. — Glen_b, Commented Jul 19, 2015 at 11:35
Please do not conflate the scientific method with p-values. Among other things, science insists on reproducibility. That is how papers on, say, cold fusion could get published (in 1989) but cold fusion has not existed as a tenable scientific theory for the last quarter century. Note, too, that few scientists are interested in working in areas where the relevant null hypothesis actually is true. Thus, your hypothesis that "all the research samples were pulled from [a] random population" does not reflect anything realistic. — whuber, Commented Jul 19, 2015 at 13:22
Compulsory reference to the xkcd jelly beans cartoon. Short answer - this is unfortunately happening too often, and some journals are now insisting on having a statistician reviewing every publication to reduce the amount of "significant" research that makes its way into the public domain. Lots of relevant answers and comments in this earlier discussion — Floris, Commented Jul 19, 2015 at 16:54
Perhaps I don't get the complaint... "We successfully defeated 95% of bogus hypotheses. The remaining 5% were not so easy to defeat due to random fluctuations looking like meaningful effects. We should look at those more closely and ignore the other 95%." This sounds exactly like the right sort of behaviour for anything like "science". — Eric Towers, Commented Jul 20, 2015 at 16:55

Peter Flom · Accepted Answer · 2015-07-19 11:14:47Z

73

This is certainly a valid concern, but this isn't quite right.

If 1,000,000 studies are done and all the null hypotheses are true then approximately 50,000 will have significant results at p < 0.05. That's what a p value means. However, the null is essentially never strictly true. But even if we loosen it to "almost true" or "about right" or some such, that would mean that the 1,000,000 studies would all have to be about things like

The relationship between social security number and IQ
Is the length of your toes related to the state of your birth?

and so on. Nonsense.

One trouble is, of course, that we don't know which nulls are true. Another problem is the one @Glen_b mentioned in his comment - the file drawer problem.

This is why I so much like Robert Abelson's ideas that he puts forth in Statistics as Principled Argument. That is, statistical evidence should be part of a principled argument as to why something is the case and should be judged on the MAGIC criteria:

Magnitude: How big is the effect?
Articulation: Is it full of "ifs", "ands" and "buts" (that's bad)
Generality: How widely does it apply?
Interestingness
Credibilty: Incredible claims require a lot of evidence

answered Jul 19, 2015 at 11:14

Peter Flom

125k36 gold badges179 silver badges410 bronze badges

4

$\begingroup$ Could one even say "if 1M studies are done and even if all the null hypotheses are true, then approximately 50.000 will perform type 1 error and incorrectly reject the null hypothesis? If a researcher gets p<0.05 they only know that "h0 is correct and a rare event has occurred OR h1 is incorrect". There's no way of telling which it is by only looking at the results of this one study, is there? $\endgroup$
– n_mu_sigma
Commented Jul 19, 2015 at 13:31
5

$\begingroup$ You can only get a false positive if the positive is, in fact, false. If you picked 40 IVs that were all noise, then you would have a good chance of a type I error. But generally we pick IVs for a reason. And the null is false. You can't make a type I error if the null is false. $\endgroup$
– Peter Flom
Commented Jul 19, 2015 at 14:24
6

$\begingroup$ I don't understand your second paragraph, including the bullet points, at all. Let's say for the sake of argument all 1 million studies were testing drug compounds for curing a specific condition. The null hypothesis for each of these studies is that the drug does not cure the condition. So, why must that be "essentially never strictly true"? Also, why do you say all the studies would have to be about nonsensical relationships, like ss# and IQ? Thanks for any additional explanation that can help me understand your point. $\endgroup$
– Chelonian
Commented Jul 19, 2015 at 18:03
11

$\begingroup$ To make @PeterFlom's examples concrete: the first three digits of an SSN (used to) encode the applicant's zip code. Since the individual states have somewhat different demographics and toe size might be correlated with some demographic factors (age, race, etc), there is almost certainly a relationship between social security number and toe size--if one has enough data. $\endgroup$
– Matt Krause
Commented Jul 19, 2015 at 22:59
6

$\begingroup$ @MattKrause good example. I prefer finger count by gender. I am sure if I took a census of all men and all women, I would find that one gender has more fingers on average than the other. Without taking an extremely large sample, I have no idea which gender has more fingers. Furthermore, I doubt as a glove manufacturer I would use finger census data in glove design. $\endgroup$
– emory
Commented Jul 20, 2015 at 12:55

| Show 6 more comments

Chelonian · Accepted Answer · 2015-07-19 18:23:29Z

Aren't all researches around the world somewhat like the "infinite monkey theorem" monkeys?

Remember, scientists are critically NOT like infinite monkeys, because their research behavior--particularly experimentation--is anything but random. Experiments are (at least supposed to be) incredibly carefully controlled manipulations and measurements that are based on mechanistically informed hypotheses that builds on a large body of previous research. They are not just random shots in the dark (or monkey fingers on typewriters).

Consider that there are 23887 universities in the world. If each university has 1000 students, that's 23 millions of students each year. Let's say that each year, each student does at least one research,

That estimate for the number of published research findings has got to be way way off. I don't know if there are 23 million "university students" (does that just include universities, or colleges too?) in the world, but I know that the vast majority of them never publishes any scientific findings. I mean, most of them are not science majors, and even most science majors never publish findings.

A more likely estimate (some discussion) for number of scientific publications each year is about 1-2 million.

Doesn't that mean that even if all the research samples were pulled from random population, about 5% of them would "reject the null hypothesis as invalid". Wow. Think of that. That's about a million research papers per year getting published due to "significant" results.

Keep in mind, not all published research has statistics where significance is right at the p = 0.05 value. Often one sees p values like p<0.01 or even p<0.001. I don't know what the "mean" p value is over a million papers, of course.

If this is how it works, this is scary. It means that a lot of the "scientific truth" we take for granted is based on pure randomness.

Also keep in mind, scientists are really not supposed to take a small number of results at p around 0.05 as "scientific truth". Not even close. Scientists are supposed to integrate over many studies, each of which has appropriate statistical power, plausible mechanism, reproducibility, magnitude of effect, etc., and incorporate that into a tentative model of how some phenomenon works.

But, does this mean that almost all of science is correct? No way. Scientists are human, and fall prey to biases, bad research methodology (including improper statistical approaches), fraud, simple human error, and bad luck. Probably more dominant in why a healthy portion of published science is wrong are these factors rather than the p<0.05 convention. In fact, let's just cut right to the chase, and make an even "scarier" statement than what you have put forth:

Why Most Published Research Findings Are False

I'd say that Ioannidis is making a rigorous argument that backs up the question. Science is not done anything like as well as the optimists answering here seem to think. And a lot of published research is never replicated. Moreover, when replication is attempted, the results tend to back up the Ioannidis argument that much published science is basically bollocks. — matt_black, Commented Jul 19, 2015 at 18:40
It may be of interest that in particle physics our p-value threshold to claim a discovery is 0.00000057. — David Z, Commented Jul 20, 2015 at 7:57
And in many cases, there are no p values at all. Mathematics and theoretical physics are common cases. — Davidmh, Commented Jul 22, 2015 at 21:43

Community · Accepted Answer · 2017-04-13 12:44:46Z

Your understanding of $p$-values seems to be correct.

Similar concerns are voiced quite often. What makes sense to compute in your example, is not only the number of studies out of 23 mln that arrive to false positives, but also the proportion of studies that obtained significant effect that were false. This is called "false discovery rate". It is not equal to $\alpha$ and depends on various other things such as e.g. the proportion of nulls across your 23 mln studies. This is of course impossible to know, but one can make guesses. Some people say that the false discovery rate is at least 30%.

See e.g. this recent discussion of a 2014 paper by David Colquhoun: Confusion with false discovery rate and multiple testing (on Colquhoun 2014). I have been arguing there against this "at least 30%" estimate, but I do agree that in some fields of research false discovery rate can be a lot bit higher than 5%. This is indeed worrisome.

I don't think that saying that null is almost never true helps here; Type S and Type M errors (as introduced by Andrew Gelman) are not much better than type I/II errors.

I think what it really means, is that one should never trust an isolated "significant" result.

This is even true in high energy physics with their super-stringent $\alpha\approx 10^{-7}$ criterion; we believe the discovery of the Higgs boson partially because it fits so well to the theory prediction. This is of course much much MUCH more so in some other disciplines with much lower conventional significance criteria ($\alpha=0.05$) and lack of very specific theoretical predictions.

Good studies, at least in my field, do not report an isolated $p<0.05$ result. Such a finding would need to be confirmed by another (at least partially independent) analysis, and by a couple of other independent experiments. If I look at the best studies in my field, I always see a whole bunch of experiments that together point at a particular result; their "cumulative" $p$-value (that is never explicitly computed) is very low.

To put it differently, I think that if a researcher gets some $p<0.05$ finding, it only means that he or she should go and investigate it further. It definitely does not mean that it should be regarded as "scientific truth".

Re "cumulative p values": Can you just multiply the individual p values, or do you need to do some monstrous combinatorics to make it work? — Kevin, Commented Jul 20, 2015 at 23:02
@Kevin: one can multiply individual $p$-values, but one needs to adapt the significance threshold $\alpha$. Think of 10 random $p$-values uniformly distributed on [0,1] (i.e. generated under null hypothesis); their product will most likely be below 0.05, but it would be nonsense to reject the null. Look for Fisher's method of combining p-values; there's a lot of threads about it here on CrossValidated too. — amoeba, Commented Jul 22, 2015 at 9:44

Patrick S. Forscher · Accepted Answer · 2015-07-23 13:43:51Z

Your concern is exactly the concern that underlies a great deal of the current discussion in science about reproducability. However, the true state of affairs is a bit more complicated than you suggest.

First, let's establish some terminology. Null hypothesis significance testing can be understood as a signal detection problem -- the null hypothesis is either true or false, and you can either choose to reject or retain it. The combination of two decisions and two possible "true" states of affairs results in the following table, which most people see at some point when they're first learning statistics:

enter image description here

Scientists who use null hypothesis significance testing are attempting to maximize the number of correct decisions (shown in blue) and minimize the number of incorrect decisions (shown in red). Working scientists are also trying to publish their results so that they can get jobs and advance their careers.

Of course, bear in mind that, as many other answerers have already mentioned, the null hypothesis is not chosen at random -- instead, it is usually chosen specifically because, based on prior theory, the scientist believes it to be false. Unfortunately, it is hard to quantify the proportion of times that scientists are correct in their predictions, but bear in mind that, when scientists are dealing with the "$H_0$ is false" column, they should be worried about false negatives rather than false positives.

You, however, seem to be concerned about false positives, so let's focus on the "$H_0$ is true" column. In this situation, what is the probability of a scientist publishing a false result?

Publication bias

As long as the probability of publication does not depend on whether the result is "significant", then the probability is precisely $\alpha$ -- .05, and sometimes lower depending on the field. The problem is that there is good evidence that the probability of publication does depend on whether the result is significant (see, for example, Stern & Simes, 1997; Dwan et al., 2008), either because scientists only submit significant results for publication (the so-called file-drawer problem; Rosenthal, 1979) or because non-significant results are submitted for publication but don't make it through peer review.

The general issue of the probability of publication depending on the observed $p$-value is what is meant by publication bias. If we take a step back and think about the implications of publication bias for a broader research literature, a research literature affected by publication bias will still contain true results -- sometimes the null hypothesis that a scientist claims to be false really will be false, and, depending on the degree of publication bias, sometimes a scientist will correctly claim that a given null hypothesis is true. However, the research literature will also be cluttered up by too large a proportion of false positives (i.e., studies in which the researcher claims that the null hypothesis is false when really it's true).

Researcher degrees of freedom

Publication bias is not the only way that, under the null hypothesis, the probability of publishing a significant result will be greater than $\alpha$. When used improperly, certain areas of flexibility in the design of studies and analysis of data, which are sometimes labeled researcher degrees of freedom (Simmons, Nelson, & Simonsohn, 2011), can increase the rate of false positives, even when there is no publication bias. For example, if we assume that, upon obtaining a non-significant result, all (or some) scientists will exclude one outlying data point if this exclusion will change the non-significant result into a significant one, the rate of false positives will be greater than $\alpha$. Given the presence of a large enough number of questionable research practices, the rate of false positives can go as high as .60 even if the nominal rate was set at .05 (Simmons, Nelson, & Simonsohn, 2011).

It's important to note that the improper use of researcher degrees of freedom (which is sometimes known as a questionable research practice; Martinson, Anderson, & de Vries, 2005) is not the same as making up data. In some cases, excluding outliers is the right thing to do, either because equipment fails or for some other reason. The key issue is that, in the presence of researcher degrees of freedom, the decisions made during analysis often depend on how the data turn out (Gelman & Loken, 2014), even if the researchers in question are not aware of this fact. As long as researchers use researcher degrees of freedom (consciously or unconsciously) to increase the probability of a significant result (perhaps because significant results are more "publishable"), the presence of researcher degrees of freedom will overpopulate a research literature with false positives in the same way as publication bias.

An important caveat to the above discussion is that scientific papers (at least in psychology, which is my field) seldom consist of single results. More common are multiple studies, each of which involves multiple tests -- the emphasis is on building a larger argument and ruling out alternative explanations for the presented evidence. However, the selective presentation of results (or the presence of researcher degrees of freedom) can produce bias in a set of results just as easily as a single result. There is evidence that the results presented in multi-study papers is often much cleaner and stronger than one would expect even if all the predictions of these studies were all true (Francis, 2013).

Conclusion

Fundamentally, I agree with your intuition that null hypothesis significance testing can go wrong. However, I would argue that the true culprits producing a high rate of false positives are processes like publication bias and the presence of researcher degrees of freedom. Indeed, many scientists are well aware of these problems, and improving scientific reproducability is a very active current topic of discussion (e.g., Nosek & Bar-Anan, 2012; Nosek, Spies, & Motyl, 2012). So you are in good company with your concerns, but I also think there are also reasons for some cautious optimism.

References

Stern, J. M., & Simes, R. J. (1997). Publication bias: Evidence of delayed publication in a cohort study of clinical research projects. BMJ, 315(7109), 640–645. http://doi.org/10.1136/bmj.315.7109.640

Dwan, K., Altman, D. G., Arnaiz, J. A., Bloom, J., Chan, A., Cronin, E., … Williamson, P. R. (2008). Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS ONE, 3(8), e3081. http://doi.org/10.1371/journal.pone.0003081

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. http://doi.org/10.1037/0033-2909.86.3.638

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. http://doi.org/10.1177/0956797611417632

Martinson, B. C., Anderson, M. S., & de Vries, R. (2005). Scientists behaving badly. Nature, 435, 737–738. http://doi.org/10.1038/435737a

Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102, 460-465.

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153–169. http://doi.org/10.1016/j.jmp.2013.02.003

Nosek, B. A., & Bar-Anan, Y. (2012). Scientific utopia: I. Opening scientific communication. Psychological Inquiry, 23(3), 217–243. http://doi.org/10.1080/1047840X.2012.692215

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631. http://doi.org/10.1177/1745691612459058

+1. Nice collection of links. Here is one very relevant paper for your "Researcher degrees of freedom" section: The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time by Andrew Gelman and Eric Loken (2013). — amoeba, Commented Jul 21, 2015 at 18:11
Thanks, @amoeba, for that interesting reference. I especially like the point that Gelman and Loken (2013) make that capitalizing on researcher degrees of freedom need not be a conscious process. I've edited my answer to include that paper. — Patrick S. Forscher, Commented Jul 21, 2015 at 18:39
I just found the published version of Gelman & Loken (2014) in American Scientist. — Patrick S. Forscher, Commented Jul 21, 2015 at 18:47

EdM · Accepted Answer · 2015-07-22 19:41:11Z

10

A substantial check on the important issue raised in this question is that "scientific truth" is not based on individual, isolated publications. If a result is sufficiently interesting it will prompt other scientists to pursue the implications of the result. That work will tend to confirm or refute the original finding. There might be a 1/20 chance of rejecting a true null hypothesis in an individual study, but only a 1/400 of doing so twice in a row.

If scientists did simply repeat experiments until they find "significance" and then published their results the problem might be as large as the OP suggests. But that's not how science works, at least in my nearly 50 years of experience in biomedical research. Furthermore, a publication is seldom about a single "significant" experiment but rather is based on a set of inter-related experiments (each required to be "significant" on its own) that together provide support for a broader, substantive hypothesis.

A much larger problem comes from scientists who are too committed to their own hypotheses. They then may over-interpret the implications of individual experiments to support their hypotheses, engage in dubious data editing (like arbitrarily removing outliers), or (as I have seen and helped catch) just make up the data.

Science, however, is a highly social process, regardless of the mythology about mad scientists hiding high up in ivory towers. The give and take among thousands of scientists pursuing their interests, based on what they have learned from others' work, is the ultimate institutional protection from false positives. False findings can sometimes be perpetuated for years, but if an issue is sufficiently important the process will eventually identify the erroneous conclusions.

edited Jul 22, 2015 at 19:41

answered Jul 19, 2015 at 12:17

EdM

95.7k10 gold badges93 silver badges282 bronze badges

7

$\begingroup$ The $1/4000$ estimate may be misleading. If one is in the business of repeating experiments until achieving "significance" and then publishing, then the expected number of experiments needed to publish an initial "significant" result and to follow it up with a second "significant" result is only $40$. $\endgroup$
– whuber ♦
Commented Jul 19, 2015 at 13:25
2

$\begingroup$ Out of 23M studies, we still couldn't tell if 5.000 results reject null hypothesis only due to noise, could we? It really is also a problem of scale. Once you have millions of researches, type 1 error will be common. $\endgroup$
– n_mu_sigma
Commented Jul 19, 2015 at 13:34
4

$\begingroup$ If there were only 5000 erroneous conclusions out of 23,000,000 studies I would call that uncommon indeed! $\endgroup$
– whuber ♦
Commented Jul 19, 2015 at 14:39
4

$\begingroup$ In nearly 50 years of doing science and knowing other scientists, I can't think of any who repeated experiments until they achieved "significance." The theoretical possibility raised by @whuber is, in my experience, not a big practical problem. The much bigger practical problem is making up data, either indirectly by throwing away "outliers" that don't fit a preconception, or by just making up "data" to start with. Those behaviors I have seen first hand, and they can't be fixed by adjusting p-values. $\endgroup$
– EdM
Commented Jul 19, 2015 at 14:53
3

$\begingroup$ @EdM "There might be a 1/20 chance of rejecting a true null hypothesis in an individual study, but only a 1/4000 of doing so twice in a row." How did you get the second number? $\endgroup$
– Aksakal
Commented Jul 21, 2015 at 18:49

| Show 2 more comments

Antoine · Accepted Answer · 2015-07-19 14:18:52Z

5

Just to add to the discussion, here is an interesting post and subsequent discussion about how people are commonly misunderstanding p-value.

What should be retained in any case is that a p-value is just a measure of the strength of evidence in rejecting a given hypothesis. A p-value is definitely not a hard threshold below which something is "true" and above which it is only due to chance. As explained in the post referenced above:

results are a combination of real effects and chance, it’s not either/or

answered Jul 19, 2015 at 14:18

Antoine

6,1878 gold badges34 silver badges56 bronze badges

$\begingroup$ maybe this will contribute to the understanding of p-values: stats.stackexchange.com/questions/166323/… $\endgroup$
– user83346
Commented Sep 6, 2015 at 15:36

Add a comment |

Count Iblis · Accepted Answer · 2015-07-19 16:47:37Z

As also pointed out in the other answers, this will only cause problems if you are going to selectively consider the positive results where the null hypothesis is ruled out. This is why scientists write review articles where they consider previously published research results and try to develop a better understanding of the subject based on that. However, there then still remains a problem, which is due to the so-called "publication bias", i.e. scientists are more likely to write up an article about a positive result than on a negative result, also a paper on a negative result is more likely to get rejected for publication than a paper on a positive result.

Especially in fields where statistical test are very important will this be a big problem, the field of medicine is a notorious example. This is why it was made compulsory to register clinical trials before they are conducted (e.g. here). So, you must explain the set up, how the statistical analysis is going to be performed, etc. etc. before the trial gets underway. The leading medical journals will refuse to publish papers if the trials they report on where not registered.

Unfortunately, despite this measure, the system isn't working all that well.

maybe this will contribute to the understanding of p-values: stats.stackexchange.com/questions/166323/… — user83346, Commented Sep 6, 2015 at 15:37

Cort Ammon · Accepted Answer · 2015-07-20 15:14:47Z

4

This is close to a very important fact about the scientific method: it emphasizes falsifiability. The philosophy of science which is most popular today has Karl Popper's concept of falsifiability as a corner stone.

The basic scientific process is thus:

Anyone can claim any theory they want, at any time. Science will admit any theory which is "falsifiable." The most literal sense of that word is that, if anyone else doesn't like the claim, that person is free to spend the resources to disprove the claim. If you don't think argyle socks cure cancer, you are free to use your own medical ward to disprove it.
Because this bar for entry is monumentally low, it is traditional that "Science" as a cultural group will not really entertain any idea until you have done a "good effort" to falsify your own theory.
Acceptance of ideas tends to go in stages. You can get your concept into a journal article with one study and a rather low p-value. What that does buy you is publicity and some credibility. If someone is interested in your idea, such as if your science has engineering applications, they may want to use it. At that time, they are more likely to fund an additional round of falsification.
This process goes forward, always with the same attitude: believe what you want, but to call it science, I need to be able to disprove it later.

This low bar for entry is what allows it to be so innovative. So yes, there are a large number of theoretically "wrong" journal articles out there. However, the key is that every published article is in theory falsifiable, so at any point in time, someone could spend the money to test it.

This is the key: journals contain not only things which pass a reasonable p-test, but they also contain the keys for others to dismantle it if the results turn out to be false.

answered Jul 20, 2015 at 15:14

Cort Ammon

5772 silver badges5 bronze badges

1

$\begingroup$ This is very idealistic. Some people are concerned that too many wrong papers can create too low signal-to-noise ratio in the literature and seriously slow down or misguide the scientific process. $\endgroup$
– amoeba
Commented Jul 20, 2015 at 15:22
1

$\begingroup$ @amoeba You do bring up a good point. I certainly wanted to capture the ideal case because I find it is oft lost in the noise. Beyond that, I think the question of SNR in the literature is a valid question, but at least it is one that should be balancable. There's already concepts of good journals vs poor journals, so there's some hints that that balancing act has been underway for some time. $\endgroup$
– Cort Ammon
Commented Jul 20, 2015 at 16:07
$\begingroup$ This grasp of the philosophy of science seems to be several decades out of date. Popperian falsifiability is only "popular" in the sense of being a common urban myth about how science happens. $\endgroup$
– 410 gone
Commented Jul 21, 2015 at 17:24
$\begingroup$ @EnergyNumbers Could you enlighten me on the new way of thinking? The philosophy SE has a very different opinion from yours. If you look at the question history over there, Popperian falsifiability is the defining characteristic of science for the majority of those who spoke their voice. I'd love to learn a newer way of thinking and bring it over there! $\endgroup$
– Cort Ammon
Commented Jul 21, 2015 at 17:56
$\begingroup$ New? Kuhn refuted Popper decades ago. If you've got no one post Popperian on philosophy.se, then updating it would seem to be a lost cause - just leave it in the 1950s. If you want to update yourself, then any undergraduate primer from the 21st-century on the philosophy of science should get you started. $\endgroup$
– 410 gone
Commented Jul 21, 2015 at 18:32

| Show 5 more comments

Aksakal · Accepted Answer · 2015-07-22 02:37:27Z

Is this how "science" is supposed to work?

That's how a lot of social sciences work. No so much with physical sciences. Think of this: you typed your question on a computer. People were able to build these complicated beasts called computers using the knowledge of physics, chemistry and other fields of physical sciences. If the situation was as bad as you describe, none of the electronics would work. Or think of the things like a mass of an electron, which is known with insane precision. They pass through billions of logic gates in a computer over an over, and your computer still works and works for years.

UPDATE: To respond to the down votes I received, I felt inspired to give you a couple of examples.

The first one is from physics: Bystritsky, V. M., et al. "Measuring the astrophysical S factors and the cross sections of the p (d, γ) 3He reaction in the ultralow energy region using a zirconium deuteride target." Physics of Particles and Nuclei Letters 10.7 (2013): 717-722.

As I wrote before, these physicist don't even pretend doing any statistics beyond computing the standard errors. There's a bunch of graphs and tables, not a single p-value or even confidence interval. The only evidence of statistics is the standard errors notes as $0.237 \pm 0.061$, for instance.

My next example is from... psychology: Paustian-Underdahl, Samantha C., Lisa Slattery Walker, and David J. Woehr. "Gender and perceptions of leadership effectiveness: A meta-analysis of contextual moderators." Journal of Applied Psychology, 2014, Vol. 99, No. 6, 1129 –1145.

These researchers have all the usual suspects: confidence intervals, p-values, $\chi^2$ etc.

Now, look at some tables from papers and guess which papers they are from:

enter image description here

That's the answer why in one case you need "cool" statistics and in another you don't: because the data is either crappy or not. When you have good data, you don't need much stats beyond standard errors.

UPDATE2: @PatrickS.Forscher made an interesting statement in the comment:

It is also true that social science theories are "softer" (less formal) than physics theories.

I must disagree. In Economics and Finance the theories are not "soft" at all. You can randomly lookup a paper in these fields and get something like this:

enter image description here

and so on.

It's from Schervish, Mark J., Teddy Seidenfeld, and Joseph B. Kadane. "Extensions of expected utility theory and some limitations of pairwise comparisons." (2003). Does this look soft to you?

I'm re-iterating my point here that when your theories are not good and the data is crappy, you can use the hardest math and still get a crappy result.

In this paper they're talking about utilities, the concept like happiness and satisfaction - absolutely unobservable. It's like what is a utility of having a house vs. eating a cheeseburger? Presumably there's this function, where you can plug "eat cheeseburger" or "live in own house" and the function will spit out the answer in some units. As crazy as it sounds this is what modern ecnomics is built on, thank to von Neuman.

+1 Not sure why this was downvoted twice. You are basically pointing out that discoveries in physics can be tested with experiments, and most "discoveries" in the social sciences can't be, which doesn't stop them getting plenty of media attention. — Flounderer, Commented Jul 20, 2015 at 4:53
Most experiments ultimately involve some sort of statistical test and still leave room for type 1 errors and misbehaviours like p-value fishing. I think that singling out the social sciences is a bit off mark. — Kenji, Commented Jul 20, 2015 at 11:23
To amend a bit what @GuilhermeKenjiChihaya is saying, the standard deviation of the errors could presumably used to perform a statistical test in physical experiments. Presumably this statistical test would come to the same conclusion that the authors reach upon viewing the graph with its error bars.The main difference with physics papers, then, is the underlying amount of noise in the experiment, a difference that is independent of whether the logic underlying the use of p-values is valid or invalid. — Patrick S. Forscher, Commented Jul 21, 2015 at 20:02
Also, @Flounderer, you seem to be using the term "experiment" in a sense with which I am unfamiliar, as social scientists do "experiments" (i.e., randomization of units to conditions) all the time. It is true that social science experiments are difficult to control to the same degree that is present as in physics experiments. It is also true that social science theories are "softer" (less formal) than physics theories. But these factors are independent of whether a given study is an "experiment". — Patrick S. Forscher, Commented Jul 21, 2015 at 20:06
@Aksakal while I disagree with -1's, I also partly disagree with your critic of social sciences. Your example of economic paper is also not a good example of what social scientists do on daily basis because the utility theory is a strictly economical/mathematical/statistical concept (so it already has math in it) and it does not resemble e.g. psychological theories that are tested experimentally... However I agree that it is often the case that statistics are used loosely in many areas of research, including social sciences. — Tim, Commented Jul 22, 2015 at 7:15

Stack Exchange Network

Is this really how p-values work? Can a million research papers per year be based on pure randomness?

9 Answers 9

Not the answer you're looking for? Browse other questions tagged
hypothesis-testing
statistical-significance
p-value
or ask your own question.

Linked

Hot Network Questions

Is this really how p-values work? Can a million research papers per year be based on pure randomness?

9 Answers 9

Not the answer you're looking for? Browse other questions tagged hypothesis-testingstatistical-significancep-value or ask your own question.

Linked

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
hypothesis-testing
statistical-significance
p-value
or ask your own question.