Isn't it problematic to look at the data to decide to use a parametric vs. non-parametric test?

Question

I've seen in some instances of people mentioning that using a parametric vs. non-parametric approach may be decided by looking at the data. For example this question: nonparametric vs. parametric

Isn't it an example of the forking paths problem?

In particular, does looking at the data to decide on a test affect the type 1 and type 2 errors? What about type S and type M errors?

If it's a problem, is there something that justifies looking at the data like that, as it seems a common practice? Are there some ways to remedy the problem, besides not looking at the data to decide on a test?

What I personally find more concerning is when people really want to use a parametric test to answer question X, then find that the prerequisites are not given, so they use a nonparametric alternative... which actually answers a different question Y: stats.stackexchange.com/a/624494/1352 — Stephan Kolassa, Commented Jun 25 at 16:49
Given Gelman's explanation I think his claim is fine. The terminology (accept the null) I don't agree with but this is more an issue of phrasing not of the practical consequences of failure to reject (where he's quite clear about his actions). — Glen_b, Commented Jun 26 at 5:17
Just to stir up an already complicated and contentious question: I too still see people agonising about these matters and want to urge that this supposed dilemma -- parametric or non-parametric tests -- is largely a throwback to the 1950s. Using transformed scales and/or generalized linear models and/or using confidence intervals rather than significance tests go a long way towards many better analyses (though not all). This is not to deny the importance of laying out a plan of analysis in advance which is by far best practice and strongly advisable for say clinical trials. — Nick Cox, Commented Jun 26 at 10:43
In practice, it needs massive reform of scientific publication to get people to be explicit and honest about all the decisions they made -- e.g. about missing values, outliers, nonlinearity, etc. etc. -- and I am not even convinced that would be ideal. It would lead to unreadable papers no-one wants to read. There are many studies with more or less the right conclusions for the wrong reasons (e.g. citing significance tests that are irrelevant or ornamental). Reproducibility is important, but having access to others' data and being able to carry your own analyses are the largest part of that. — Nick Cox, Commented Jun 26 at 10:51

Stephan Kolassa · Accepted Answer · 2024-06-25 16:02:52Z

24

Abso-****ing-lutely yes.

Ideally, one would decide on the entire analysis before seeing the first shred of data, by leveraging pilot data (which is not used in the "real" analysis). My recommendation would be to decide on a data structure first, essentially setting up the database structure (or Excel spreadsheet - it doesn't really matter), then fill it with dummy data, then write the analysis script, then use simulations to determine sample sizes - and only then start filling the sheet with actual observations and data.

Unfortunately, nobody does this. Rather, people collect data and only afterwards come here to ask how to analyze it. As you write, this directly leads us into the garden of forking paths by looking at data before deciding how to analyze it.

This is one more reason for the reproducibility crisis afflicting much of science, and one more reason to not overinterpret any one study. Rather, one needs to see every study in the context of similar research, run meta-analyses and reviews, and fund pre-registered research designed to stress-test published results and to counteract the various biases arising from the garden of forking paths and the file drawer problem.

answered Jun 25 at 16:02

Stephan Kolassa

127k20 gold badges251 silver badges479 bronze badges

8

$\begingroup$ Rather, people collect data and only afterwards come here to ask how to analyze it. To consult the statistician after... $\endgroup$
– Dave
Commented Jun 25 at 16:09
8

$\begingroup$ When I worked assisting researchers, either at Rusk Institute, NDRI or Downstate Medical Center, I constantly told them to come see me when they had a glimmer of an idea of something they were interested in. They almost never did. At Rusk, one person had years of questionnaires from her patients and gave me the files and asked me to do something with them. She had changed the questions multiple times, had no research question ... Oy! $\endgroup$
– Peter Flom
Commented Jun 25 at 16:25
5

$\begingroup$ @PeterFlom This sounds painfully familiar. I don't have a perfect solution, but at my institute, during courses where students perform their own mini-research, we grade them on their proposals, which are required to include an intended analysis. They can come by the statisticians of our department for help. $\endgroup$
– Frans Rodenburg
Commented Jun 25 at 16:30
8

$\begingroup$ I recall one woman who ran a big group at (I think) Penn State. She insisted that all her researchers consult statisticians early and often. She also said she wanted to see all grant proposals 3 months before they were due. Her groups had a fantastic record of "acceptance" (not even "revise and resubmit")> $\endgroup$
– Peter Flom
Commented Jun 25 at 16:34
4

$\begingroup$ Thanks for the answer! Regarding your paragraph about planning analyses, I wonder if there are disciplines where it's not possible or very difficult to gather pilot data, due to very expensive or very scarce study material, which can only lead to very small sample sizes (I have archeology in mind, but I'm not very knowledgeable about it, and perhaps there are better examples than that). I guess that in this case, we could use data from other similar studies, instead of pilot data from the specific study we plan to conduct. $\endgroup$
– Coris
Commented Jun 25 at 20:49

| Show 4 more comments

Frans Rodenburg · Accepted Answer · 2024-06-25 16:32:18Z

I strongly agree with Stephan Kolassa, but I do think there is an exception:

Suppose you carefully consider in advance what test or model you want to use. You justify your choices of assumptions and are fairly confident that this is a reasonable approximation to the process.

Fast forward to the analysis and you find out that despite your best efforts to choose a model in advance, its assumptions simply don't hold up in the sample you have. Hence you decide on a model with more relaxed assumptions.$^\dagger$

Has the false positive rate gone up? In some cases, literature suggests it has. But you can account for this by correcting for the number of models considered.

In practice, correcting for the number of methods/models considered rarely happens, because preregistration of hypotheses is very uncommon outside clinical science. Unfortunately people largely still chase significance at arbitrary thresholds.

In many cases though, I would argue that it should not meaningfully affect the false positive rate. You decide to use an overdispersed model, because a Poisson GLM did not provide a reasonable approximation? I don't think this is harmful and have never seen any studies demonstrate that it is.

As usual, many of these problems arise due to null-hypothesis significance testing with $p$-values. Bayesian analyses suffer less from issues related to hypothesis testing in general, but depending on your field, you may not be able to convince others to use it instead of a frequentist test or model.

Bayesian analyses don't solve a lack of thinking in advance though. I suppose you could ask the same question regarding empirical Bayes, and indeed, objective Bayesians oppose this approach.

$^\dagger$: A non-parametric test is one such option, but there are often better alternatives that provide more interesting output than just a $p$-value.

I can't accept multiple answers unfortunately, but thanks, your answer is really useful to better understand the issue! — Coris, Commented Jun 25 at 20:38

Glen_b · Accepted Answer · 2024-06-26 13:14:31Z

Isn't it an example of the forking paths problem?

Yes.

There are a number of posts on site that address this issue, (and related issues of various aspects of model selection such as deciding whether to assume constant variance or not).

The degree to which it's a problem can vary quite a bit; in some kinds of circumstance it matters only a little (e.g. perhaps it only pushes up significance levels noticeably on one of the tests and perhaps not by a lot), other times it matters more.

Of course it can be hard to gauge quite how much in practice because impact on significance level and power (and indeed rates of other kinds of error) are phenomena at the level of the population-process (across all possible samples), while you have only a single data set -- the actual situation you're in with real data are naturally unknown (or you wouldn't have needed a hypothesis test in the first place). You can make some assumptions and investigate but it's easy to fall into the trap of only investigating a few very tame assumptions and then asserting a general stamp of approval.

It's also not simply that you're choosing between two different competing tests -- typically you're choosing between two distinct hypotheses, and in that sense is a form of Testing hypotheses suggested by the data, albeit not the worst possible example of it.

In that sense you can be choosing not just between two test but between two potentially very different conclusions (which may be in opposite directions).

I recently discussed these two related issues here. While I didn't specifically mention Gelman and Loken's forking paths in it, it is a term I've raised in discussing it at other times (albeit I'm not certain whether I've done that in an answer here or not).

As a general principle, one should be selecting hypotheses at the earliest stages of the study - relatively early in the planning. After all the question about population parameters that you wanted to answer shouldn't be a mystery to you.

Many people pursue a learned policy of making their hypotheses deliberately vague. This seems deliberately designed to enable the poor practice of choosing the hypothesis after seeing the data.

You should then be selecting models and tests before seeing data (ideally before collecting it so even inadvertent data leakage does not occur).

If you're not in a position to choose an explicit distributional model, you may be better to avoid doing so. This is in many cases a perfectly defensible option though if sample sizes are very small I would warn against using nonparametric tests at typical significance levels*.

I will say as an aside that the choice between parametric models (NB parametric does not mean 'normal') and non-parametric models does not necessarily dictate the form of the hypothesis. If I write a hypothesis about comparing two population means I am not locked to a t-test nor am I locked to any of a large collection of other parametric tests of means. It is possible to test means (e.g. hypotheses of equality vs inequality) under weaker assumptions, ones that don't require a parametric model. In short, you can choose nonparametric tests of means, if that's what you wish. However it is still very useful to decide a meaningful form of test statistic (not always based on a difference) for that comparison, and where possible to try to at least get an approximately pivotal quantity.

Indeed even when you have an explicit parametric model and perhaps have a nice powerful test when that model holds, you aren't necessarily restricted to perform that parametric test with that statistic. Often you can retain that power (or at least nearly all of it except perhaps in very small samples) in the case where the model was close to right and still get protection against the significance level being very far from the desired one.

Many practitioners seem to operate as if their perfectly common variables had never existed before this particular instance of collecting them and I have - many more times than I'd wish to count - been told that they literally know nothing about them whatever, when in fact quite a lot may be known about their variables -- enough to formulate perfectly reasonable models. Indeed very often reasonable models may be selected simply by knowing what you are measuring and what the support of the variables are ("what values they could possibly take").

The process of getting to at least a roughly suitable model or at least ruling out a few obvious non-starters is not especially mysterious.

But for a worker within a particular area, they have much more to draw on than the kind of variable and its support; previous studies, expert knowledge, theory, and so forth should provide a much richer context for model selection.

Even when one makes a new variable from scratch, very often it's either very like a variable that already exists. Failing that there's pilot studies.

Naturally, once in a while you might feel that there are special circumstances that will impact the situation (even under the null), and that nothing but examining some data will do. If there is no pilot stage, you should then plan from the start to split data into a part or parts for such things as model selection and any other data dependent choices that might be made, and sequestering the remainder for the actual test.

In some circumstances there may be little that can be done - a model is wrong and no more data will be forthcoming but some action is still needed (particularly in a business application for example). One must admit their type I error rate is not what it was intended to be and still do something.

I also note that there are some methods for inference after model selection (more typically - but not always - relating to variable selection); e.g. a number of papers on this topic can be found on arxiv. It's a topic I should pursue more but have only read a little on. If you intend to use these they should be in the plan at the start, naturally; choosing them post hoc, after some other approach was already tried won't necessarily lead to the process working as advertized.

---

* I have many times seen people use tests that have no chance whatever of rejecting the null, in complete ignorance of the fact that their lowest available alpha is above the nominal one they're blithely comparing their p-values to (and so will never find a sample that leads to rejection of the null). Quite recently it happened twice in one week. A common example is using the Mann-Whitney Wilcoxon with $n_1=n_2=3$, a two-sided alternative and a rejection rule of "reject $H_0$ if $p<0.05$". The lowest possible p-value in that case is $0.10$.

Needless to say, it's even worse if there's correction for multiple testing.

This issue of limited available significance levels occurs with permutation tests more generally, but some particular test statistics can make it worse than others. On the other hand I'd also avoid bootstrap tests with very small samples, albeit for different reasons. At the same time, since power is going to be low at all but quite large effect sizes, if you can reasonably justify a parametric approach, it may be worth pursuing in this situation.

Interesting, I had never heard of testing hypotheses suggested by the data, but often of hypothesizing after the results are known. I guess the difference is subtle. — Frans Rodenburg, Commented Jun 26 at 11:47
I think they're more different takes on more or less the same phenomenon, or at least very similar varieties of phenomena. — Glen_b, Commented Jun 26 at 13:00
+1 for allowing for nuance. The situation is not cut and dried, especially when the concern lies with assessing some kind of difference (usually in location of distributions) before having the opportunity to collect data in a setting where it's known that the distribution might be effectively modeled using a small finite collection of standard families rather than one such family. A common instance is in environmental sampling data, where often the distribution family is assumed to be a union of Normal, Lognormal, and Gamma families. — whuber, Commented Jun 26 at 14:49
I don't see that it's problematic in principle -- you can have some form of indicator variable joining members of such a super-family. Alternatively you can form a union of lognormal and normal (approximately) by placing them both within shifted lognormal; it's probably possible to do something to contain all three. The issue then is in the estimation of the additional parameters (such as the index in the first instance); if they're based on the same data as used in the test you would want to look at how that estimation impacts the properties of the procedure overall. ... — Glen_b, Commented Jun 26 at 23:12
... the concern would only come if the selection was done based on the same data as was used in the subsequent test but the inference proceeded conditionally on that choice without accounting for its impact on the properties of the inference (how would you know the impact was small enough to ignore in each case?). $\,$ I'd probably be inclined to put that in a Bayesian framework where - if the inference is on a common population parameter - it's not so hard to avoid conditioning on the choice of components by multi-model inference) — Glen_b, Commented Jun 26 at 23:18

Ben Bolker · Accepted Answer · 2024-06-27 00:02:08Z

I want to make a distinction that doesn't seem to have come up in previous answers. There is a difference between

examining the results of several complete analyses (estimates, p-values, confidence intervals) and deciding post hoc which one you prefer/when to stop and accept the results of the last analyses because they're 'good enough' (this is terrible)
vs. examining the results of significance tests of the assumptions (or even graphical evaluations of the assumptions) to decide what to do next.

The latter is not as bad, because you're not directly conditioning on outcomes you want to find, although there are still plenty of reasons (frequently discussed on Cross Validated, e.g. here) to avoid this kind of testing. The bottom line is that the properties of the conditional procedure (e.g. do a parametric test if we fail to reject Normality of residuals, otherwise do a nonparametric test) are different from, and much harder to analyze, than the unconditional procedures ("always parametric" or "always nonparametric"). Two-stage testing may inflate type I error and/or fail to increase overall power, but it depends on the details.

Many statisticians have looked at particular cases (see refs below); I found Hennig's (2022) presentation below especially clear.

Campbell, Harlan. 2021. “The Consequences of Checking for Zero-Inflation and Overdispersion in the Analysis of Count Data.” Methods in Ecology and Evolution 12 (4): 665–80. https://doi.org/10.1111/2041-210X.13559.

Campbell, H., and C. B. Dean. 2014. “The Consequences of Proportional Hazards Based Model Selection.” Statistics in Medicine 33 (6): 1042–56. https://doi.org/10.1002/sim.6021.

Hennig, Christian. 2022. “Testing in Models That Are Not True.” University of Bologna. https://www.wu.ac.at/fileadmin/wu/d/i/statmath/Research_Seminar/SS_2022/2022_04_Hennig.pdf.

Rochon, Justine, Matthias Gondan, and Meinhard Kieser. 2012. “To Test or Not to Test: Preliminary Assessment of Normality When Comparing Two Independent Samples.” BMC Medical Research Methodology 12 (1): 81. https://doi.org/10.1186/1471-2288-12-81.

Shamsudheen, Iqbal, and Christian Hennig. 2023. “Should We Test the Model Assumptions Before Running a Model-Based Test?” Journal of Data Science, Statistics, and Visualisation 3 (3). https://doi.org/10.52933/jdssv.v3i3.73.

Zimmerman, Donald W. 2004. “A Note on Preliminary Tests of Equality of Variances.” British Journal of Mathematical and Statistical Psychology 57 (1): 173–81. https://doi.org/10.1348/000711004849222.

Yes; thank you! It really depends on how hard you need to draw your conclusions. "Oh weird; [X] looks like it goes along with [Y]. That's probably worth looking into!" is a totally different standard than "[Wet streets] cause [rain]; I will die on this hill." — fectin, Commented Jun 28 at 3:00

Nick Cox · Accepted Answer · 2024-06-27 08:07:00Z

Nicely, in some answers and comments there are already links to some stuff I have written.

I will try to give a message in a nutshell here.

Yes, you are right. There is a problem. One nice way to show this is to state it as a paradox, as I have done here.

Now I also find it important to acknowledge that model assumptions are never perfectly fulfilled, and that we routinely apply methods derived from model assumptions that are in reality violated. The difficult problem is that some violations of model assumptions are hardly problematic whereas some others are very problematic. In practice there is violation of model assumptions through data-dependent decision making regarding which method (test) to use, as correctly stated in the question, but the tricky issue is whether this is worse than not detecting a critical violation of a model assumption by fixing a test in advance and not checking model assumptions for the data at hand. There are cases in which the bigger problem is avoided by looking at the data and taking the hands off an inappropriate test that seemed appropriate before having looked at the data.

It is correct that we should take into account as much as we can and make provisional decisions before collecting the data, but in many situations researchers just can't predict in advance enough of the nastiness that the data may later show.

It is unfortunately very situation-dependent whether data-dependent method selection does more harm than good - normally it does both, to a certain extent, and simple general rules to decide in any practical situation whether advantages outweigh disadvantages don't exist.

With enough data one can split the data and run model diagnostics on one part and use this to decide what method to run on the other part (this may of course be somewhat harder if you do cross-validation or even something like double cross-validation for model selection on top of it; also if there is dependence, for example a time series structure, suitable data splitting isn't trivial).

I end by saying that as a curious person I tend to do all kinds of things with the data, visualisations, parametric and non-parametric tests (and admittedly some things may happen conditionally on results of earlier things because I may want to understand a specific aspect better), but then I will not decide to get my message from just one test, selected based on the data and maybe even "a nice result that my client likes". Rather I try to understand how exactly that comes about and what it means if different methods give results that don't seem in line with each other. If all methods support the same interpretation of the data, I will be pretty confident about it, and if they disagree, this will usually indicate bigger uncertainty than what a single method conveys, and this bigger uncertainty deserves reporting. Also, good visualisation may give me very detailed information why different methods such as tests may at first sight seem to be in disagreement, potentially revealing interesting and unexpected aspects of the data.

The problem with this is that it isn't formalised, there is subjectivity in it, and nobody can derive Type I or Type II "error probabilities" for the overall procedure. But then remember model assumptions don't hold in reality anyway, so any "guarantee" needs to be taken with a grain of salt.

Frank Harrell · Accepted Answer · 2024-06-27 11:51:22Z

Incredible answers. To summarize some elements of a statistical philosophy that avoids problems of model uncertainty / forking paths:

Pre-specify flexible but powerful methods that are less likely to have assumptions violated. For ordinal or continuous Y this can mean always using semiparametric ordinal models that contain the usual nonparametric tests as very special cases and that are invariant to monotonic transformations on Y. There is no need to hope for a parametric model to fit in most cases, as semiparametric/ordinal methods are so efficient.
Include parameters for things you don’t know such as departures from assumptions, nonlinear effects, and interactions. For example a Bayesian $t$-test has a parameter for non-normality and for the variance ratio. Bayesian methods work better because they allow partial degrees of freedom so that the sample size does not have to get much larger to accommodate the relaxed assumptions. Prior distributions are placed on the “departure” parameters, e.g., degree of non-normality, non linearity, non-additivity (interactions).

An example of this philosophy: For time-to-event analysis the semiparametric ordinal Cox proportional hazards model is very commonly used because it doesn’t depend on the event time distribution having any specific shape, and it is invariant to increasing monotonic transformations of event times. Practitioners then worry about non-proportional hazards for a variable of interest such as treatment. A Bayesian proportional hazards model can put a prior on the amount of non-constancy of the hazard ratio.

The same idea can be applied to a partial proportional odds model to relax the parallelism (proportional odds) assumption made by the Wilcoxon test (the assumption that the logit of the cumulative distributions are parallel, which is needed for Wilcoxon to have maximum power).

Sextus Empiricus · Accepted Answer · 2024-06-27 11:37:07Z

In the end, what matters is what the data sampled from nature tells about the questions we have about nature.

When data supplies us at with novel questions that we want the same data to answer us, then there isn't nessarily a problem.

The data still provides information about questions about nature.

Only for people obsessed by precise statististical significance cut-off values there would be an issue, and they would need to consider whether re-using data to answer questions that came from the very same data is a problem regarding their significance computations.

Whether this is a problem just depends on how much the answer to the question suggested by the data is dependent on the suggestion by the data.

In many schemes the influence is not so large. Testing model assumptions and switching to different models is not gonna make your conclusion a lot different.

The cases where this would be a problem is when the model assumptions are increasing the power. A switch to a a non-parametric test is often not doing this. However, a parametric test does improve the power. So first performing a test about model assumptions might sound like a lucky shot. This means that there are researchers unsure about the assumptions and test them (just to see if they can use the nice parametric powerful test). Sometimes they get lucky, and as a consequence perform tests that they should not perform. The result is a higher rate of false rejections of null hypothesis.

I believe that these situations are less problematic than the regular type I errors which occur with a much larger rate (the significance level, which is often at 5%). This problem of testing assumptions is a second order influence on the occurrence of type I errors and less of a problem.

Stack Exchange Network

Isn't it problematic to look at the data to decide to use a parametric vs. non-parametric test?

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
hypothesis-testing
inference
nonparametric
or ask your own question.

Linked

Hot Network Questions

Isn't it problematic to look at the data to decide to use a parametric vs. non-parametric test?

7 Answers 7

Not the answer you're looking for? Browse other questions tagged hypothesis-testinginferencenonparametric or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
hypothesis-testing
inference
nonparametric
or ask your own question.