2

Publication bias. Reproducibility problem. Abusing statistical tests.

These are some of the many criticisms received by all fields of science for a long time. If I read an article on Psychological Science and am sceptical of their results, or if I want to apply another statistical techniques to see if the results remain convincing to me, I can't. I need to run another experiment. Or if I want to conduct a meta-analysis, maybe having other researcher's raw data is better than just the mean/CI they report in journals.

If scientists' mission is for public good and for the advancement of knowledge, why don't they publish their results in raw (of course they need to remove research participants' privacy information). They shouldn't be afraid of others' criticising their work. Only truth can endure the testing of time.

Nowadays, with the prevalence (and low price) of online storage platform and sophisticated database management, why don't they do it for the public's good?

EDIT: by raw data, I mean to make the dataset public and accessible to everyone (well... at least researchers)

2
  • 4
    What is the question? Researchers do publish the raw data. Often times they are not attached to the research paper. As no one wants to see a 500,000 observational table spanning 50 pages. Instead, they document their sources and methods such that you can reproduce it should you choose to. On this note, part of the review process is examining and dissecting the method itself to see if it is flawed.
    – Bluebird
    Commented Oct 11, 2017 at 7:48
  • 3
    The Square Kilometre Array telescope is forecast to produce 2 Petabytes of raw data a day. How does letting members of the public randomly browse that amount of meaningless (to them) numbers help with any of the problems you mentioned? Commented Oct 11, 2017 at 17:55

2 Answers 2

11

A few thoughts:

Publication bias

While making data available would help address some problems in publication bias, merely having data sitting on the internet somewhere isn't even close to enough to solve it if people still use journals as content curators.

Reproducibility problem

Available data = / = Reproducible data

I need to run another experiment

This is, in many ways, desirable. "I took the same data and got the same answer" is a fairly low form of reducibility. Yes, it catches statistical errors, and enables you to try new methods, but at least in my field, before something can really be thought of as "reproduced" it needs to be obtained via an entirely different experiment, preferably in a different population. That enables the understanding if there is a consistent effect that occurs in a variety of contexts, or if it was a fleeting result that was either noise, or (more philosophically) just the tail of an effect that is randomly but not perfectly distributed around 1.

Or if I want to conduct a meta-analysis, maybe having other researcher's raw data is better than just the mean/CI they report in journals.

It depends on the analysis you want to do, but this is not inherently true. Also note that it is often the case that, if this was what you're doing, an email to the researchers may provide what you need. Both times in my career where I have genuinely needed someone's raw data, I've been able to get it.

If scientists' mission is for public good and for the advancement of knowledge

You are making a massive assumption here: That the mission of scientists is the public good. A few notes:

  • Even for idealistic scientists, the actually doing of science doesn't occur in a vacuum. In order for you to continue to do your science, keep your people paid and the lights on, etc. you have to compete with other labs. Collecting data is often a long and laborious process, and there is a very real temptation to continue to mine that data past the initial publication. It is a competitive advantage, and science is competitive.
  • It is not axiomatically true that the public good from the release of data > the public good that comes from a lab being otherwise successful. An idealistic lab sacrificing themselves on the altar of data access doesn't necessarily help.

why don't they publish their results in raw (of course they need to remove research participants' privacy information).

This is a considerable hurdle. For example, there are a number of studies I've worked on where identifiable information is essential to the finding in question. This might be a special case, but it's not an uncommon one. There may also be agreements in place preventing this - many minority groups, for example, are very justifiably skeptical of "And then we can do anything with your data we want".

They shouldn't be afraid of others' criticising their work. Only truth can endure the testing of time.

Very often, the concern is much more about preserving the ability of their data to generate new publications over time.

Nowadays, with the prevalence (and low price) of online storage platform and sophisticated database management, why don't they do it for the public's good?

Because the public's good doesn't pay my postdoc's salary.

Now this all sounds jaded and bitter and horrible. Which is ironic, because I actually do try to make as much data as possible available to the public. But there are very real constraints, both on the nature of the data themselves and in the doing of science, that stand in the way of automatically making data available. One must be able to acknowledge these when thinking about data accessibility and reproducibility.

In my case, for example, what "the data" is is often a somewhat murky concept, and I find the tendency to view "I downloaded your code and data and ran it" as what people sometimes view as reproducibility to be...troubling.

1

Once upon a time, no data was published.

Today, large amounts of data is published, though in ways that preserve confidentiality of human subjects.

US Federal grants and contracts for most research now have a requirement that the investigators include a plan to make their original data available to users. One common way is to deposit the data in an archive managed by a large research organization. One example of this is the ICPSR (Interuniversity Consortium for Political and Social Research) Archive which is maintained at the University of Michigan's Institute for Social Research. Other universities have archives for their own research data.

The data is often modified to meet the standards of privacy required for federally sponsored human subjects. This usually means that some identifiers have been modified or removed. For example, age might be categorized in intervals of years rather than exact age. These modification are a condition of doing the research, so they can't be helped.

Federal survey data is mostly available on-line at the agency that collected the data, Census, Bureau of Labor Statistics, National Center for Health Statistics, etc. Some of it is free. Some kinds of data can be obtained if the requestor proves that they can insure the security data, such as patient data from the Center for Medicare and Medicaid Services. Federal agencies often have data centers where you can do your analyses and any requested results are examined by staff to insure that its release will not compromise privacy of respondents.

A vast amount of data is now published. The notion that an investigator owns their data and completely controls its release is dead, at least if someone else paid for its collection.

If you would like some data, contact the principal investigator or the funding agency and ask how to get it. At least in the US.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .