Do I have to report all data and how should I deal with outliers if there are too many?

Question

Say I have a hypothesis and then did the experiment 10 times and collected 10 data points. Five of the data points agree well with my hypothesis, but the other five are outliers. I strongly believe in the validity of my hypothesis (which every experimentalist does I guess), but I can't find an explanation for the outliers, or the possible explanations are too many. Can I still publish with five good data points? How should I deal with the other five? I assume it's unethical to not report them, right?

Update: Sorry I didn't make my situation clear. We did use a lot of different techniques to test our hypothesis. The 50% outlier is just one technique we used. But the other techniques we used all converge very well and support the hypothesis. The 50% outlier for that one technique probably only accounts for 10% of the total data points of all the techniques we used. That's what bothers me and my question is

Can I only report data points for the other techniques we used?
If I have to report data for the technique that has a lot of outliers, do I have to find a reasonable explanation for the outliers?

Related, though the opposite problem: Is it unethical/unscientific to omit outlier data in a publication when they are in FAVOR of your argument? — cag51, Commented Oct 1, 2022 at 4:51
Mother Nature really doesn’t care how much you believe in your hypothesis. You need to care a lot more about what the experiments are telling you about your hypothesis. — Jon Custer, Commented Oct 1, 2022 at 14:24
If 5 out of 10 data points disagree with your hypothesis, and you still "strongly believe" in it, then I have other concerns besides explaining the "outliers." — BillOnne, Commented Oct 1, 2022 at 18:29
If the so called outliers come from a different technique, they aren't outliers but a different population. Outliers are far away values from the same population. Your problem seems to be that the results from one technique (or more) agree with your hypothesis and results from another technique don't, which is a reasonable concern but it is not what you are asking in the question. — Pere, Commented Oct 2, 2022 at 17:32

kaya3 · Accepted Answer · 2022-10-03 17:11:34Z

Yes, you have to report all of your data. You also need additional expertise in experimental design, statistics, the scientific method and Quality Control/Quality Assurance.

Say I have a hypothesis and then did the experiment 10 times and collected 10 data points.

The way this is described, the 'experiment' is your one independent data point. The repeat executions of that experiment do not qualify as additional unique data points because they should be highly correlated... but they arent...

Five of the data points agree well with my hypothesis, but the other five are outliers.

(Sigh.) Half of your data is not an outlier by definition. An observation doesnt become an outlier because it doesnt support your hypothesis.

I strongly believe in the validity of my hypothesis (which every experimentalist does I guess),

Stop this talk right now. This is wrong on every level. As an experimentalist, you definitely do not speak for me. The data does the speaking. Imagine this was a forensics experiment as part of a court case and you were the prosecutions witness and you said this under oath in court. Imagine that for a moment.

but I can't find an explanation for the outliers, or the possible explanations are too many.

Which is it? The hypothesis you do advance has to be testable and falsifiable. Remember, that your different experiments are the data points, and NOT the unique repetitions of the same experiment. You clearly have enough explanations. If you have too many explanations, reformulate the question. over and over and over.

Can I still publish with five good data points?

Of course you can submit all the data. It sounds to me like you have bigger problems.

How should I deal with the other five?

Include the single experiment-with-multiple-repetitions as 'not supporting our hypothesis'. Formulate a testable, falsifiable, defensible, reproducible hypothesis that can be tested and muse on how that should be tested as 'future work'.

Get some help in experimental design, statistics, quality.

I assume it's unethical to not report them, right?

It is unethical yes, but it goes well beyond unethical. The phrase I've heard at work is that you 'know enough to be dangerous'. The path you are dancing around is a mix of fraud and negligence and needs to be addressed today.

Update: Sorry I didn't make my situation clear. We did use a lot of different techniques to test our hypothesis.

Good. These are your 'unique data points'. As written, you only have 'one outlier', not 5. Behold the magic of statistics!

The 50% outlier is just one technique we used. But the other techniques we used all converge very well and support the hypothesis.

Great! Time to submit, pending the new hypothesis for explaining the new/repeat observations.

The 50% outlier for that one technique probably only accounts for 10% of the total data points of all the techniques we used.

Apparently there is a complete lack of uncertainty analysis, statistics, quality control etc. The technical term is 'physics envy' :)

That's what bothers me and my question is

Can I only report data points for the other techniques we used?

No, but you can change the 'resolution' of reporting from unique executions to unique approaches. Changing the 'resolution' of your aggregate data is the best, most common, most defensible, most reproducible way (that i know of) to fundamentally alter (shall we say 'improve') the data you have available.

If I have to report data for the technique that has a lot of outliers, do I have to find a reasonable explanation for the outliers?

You cannot have 'a lot of outliers'. That is contrary the definition of an outlier. You can have 'too many'.

You need something that is testable and falsifiable at a minimum. My personal suggestion is not to 'guess the correct answer' but to advance multiple, multiple hypothesis. If you advance multiple hypothesis, the odds are better you will list the 'correct' one. You want a path forward, this is the opportunity to lay that path. With multiple hypothesis advanced, you have room for natural selection to run its course and let the data select the most robust hypothesis.

lordy · Accepted Answer · 2022-10-01 20:10:33Z

50

Leaving half the data out is scientific fraud. You have to report all data points and if they do not agree with your hypothesis then maybe your hypothesis is wrong.

answered Oct 1, 2022 at 20:10

lordy

9,7592 gold badges26 silver badges47 bronze badges

Thanks for your comment. The 50% outlier is just one part of the characterization we did. But the other 4 characterizations we did all converge very well and support the hypothesis. That's what bothers me the most. And the 50% outlier for that one characterization probably only accounts for 10% of the total data points of all the characterizations we did.
– user162189
Commented Oct 2, 2022 at 3:21
16

@Simon It's possible that there is a hidden variable that is not controlled that is causing that so called outlier. That's why reporting all data is crucial. It allows us to notice something is wrong, and to find explanation for it.
– justhalf
Commented Oct 2, 2022 at 14:42

Add a comment |

Ethan Bolker · Accepted Answer · 2022-10-01 19:32:56Z

23

Scientific integrity requires that you report all the data. Period.

You are then free to focus on the parts that support your hypothesis and try to explain those that don't.

answered Oct 1, 2022 at 19:32

Ethan Bolker

37.6k1 gold badge85 silver badges135 bronze badges

Add a comment |

Neuchâtel · Accepted Answer · 2022-10-04 05:12:12Z

I read the edit. If the model assumptions are correct, it is totally fine to have a situation where a test rejects your hypothesis while other tests (different approaches) do not reject your hypothesis. This "multiple-testing" approach is still safe here as long as you do not reject your hypothesis based on the result of the last "failed" test without any modification of the procedure. It is safe to report the data and result of the last test.

Furthermore, random variation could be the reason why there are outliers, but depending on the nature of the "outliers" (e.g. if they are too far from each other), there could still be potential problems. I still believe you should do more experiments or examine the datasets again to be sure.

Original answer:

In my opinion, don't do that. Do more experiments if it is possible. You can not say that they are outliers if half of these data points disapprove your hypothesis. That is a terrible ratio. It is likely that there is a problem with your experiment/hypothesis.

Furthermore, it is not a good practice to remove outliers without a proper explanation. "Outliers" are not objectively well-defined without assumptions. When someone says that some data points are outliers, they imply the structure of the data generating process (model assumptions), so they can conclude that these data points are too unlikely to be generated from the process. Of course, model asumptions should be made reasonably (according to what we know). Based on this statistical reasoning, they decide to remove the units.

However, if you have no reasonable explanation (model assumptions) to consider the data points as outliers, there is no reason for you to remove them. It is unethical to remove data points just to produce what you want to see rather than to produce the correct result.

Sometimes, "outliers" suggests a completely different conclusion. It raises a different question: "How wrong are the model assumptions?". It's not an ethical problem, but a malpractice (and a waste of resources) to believe that to assume that the model assumptions are always correct (or not too wrong to be accepted), and ignore the findings. With a proper (reasonable) reconsideration of model assumptions, the correct conclusion can be derived. Of course, it is unethical to modify model assumptions just to produce what you want to see, especially when there is nothing "significant" going on.

Probably I used the wrong word in this context, but I follow a strict interpretation of hypothesis testing, so generally I do not say "something is disproved" because HT cannot disprove/prove a hypothesis. — Neuchâtel, Commented Oct 5, 2022 at 5:30

gnasher729 · Accepted Answer · 2022-10-01 20:56:08Z

8

With five “outliers” out of ten, either your experiments are done quite badly (50% giving wrong results), or your hypothesis is very incomplete. For example your hypothesis might be “mice do x” but the correct hypothesis would be “male mice do X”, which would have 50% “outliers” as long as you use both male and female mice.

answered Oct 1, 2022 at 20:56

gnasher729

3,57414 silver badges16 bronze badges

3

This is a good way to examine your experiments privately, and may be well-suited for a "future research" section in a paper. However, OP should not write anything purporting to prove such a hypothesis based on the data already collected; new experiments must be run to evaluate that hypothesis.
– Angelica
Commented Oct 2, 2022 at 21:29

Add a comment |

Stack Exchange Network

Do I have to report all data and how should I deal with outliers if there are too many?

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

Do I have to report all data and how should I deal with outliers if there are too many?

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions