20
$\begingroup$

Statistical methods are based on model assumptions. For example, an independent one-way ANOVA makes the following assumptions:

  • Normally distributed residuals

  • Homogeneity of variance

  • Independence of observations

Whether or not these assumptions are met will influence the reliability of the independent one-way ANOVA’s results and the conclusions drawn from them (to varying degrees, depending on what assumptions are violated and how).

My question is: When should we check the assumptions of our model? Is it preferable to first check model assumptions or inspect model fit? How might that influence the interpretations and decisions we make thereafter, and why might this be preferable to the other approach?

In the case of general linear models, we first need to fit our model, otherwise we cannot test whether the residuals are normally distributed. But immediately after that we could either choose to check model assumptions or inspect model fit.

In particular I am interested in answers that speak to any of the following three approaches:

  1. The purpose of checking model assumptions is to decide whether the originally chosen test is appropriate for the data, so assumptions should be checked first. A different, more appropriate, test should be used if assumptions are violated, and conclusions should be drawn from this test. This approach is endorsed in textbooks (e.g., Dowdy et al., 2004), and is also the one I’ve encountered in statistics courses I’ve taken.

  2. The purpose of checking model assumptions is to assess the quality of the model we originally chose in light of our data, so assumptions should be checked second. Depending on the severity of violations, conclusions drawn from test results might be reigned in, or the model might be respecified. This seems to be the approach endorsed by Fisher (see Spanos, 2017).

  3. Model assumptions are often violated in the real world, so there’s no need to check them. Instead we should choose better default tests that are less constrained and stick with those (e.g., Declare et al., 2017, 2019). This is also a popular approach, for example, see Section 4 in the Spanos, 2017 paper cited above.

This preprint has some good discussion and examples comparing the performance of these approaches in different circumstances and with different tests. It concludes:

“In some setups either running a less constrained test or running the model-based test without preliminary testing have been found superior to the combined procedure involving preliminary [assumption checking to guide test selection].” However, “a sober look at the results reveals that the combined procedures are almost always competitive with at least one of the unconditional tests, and often with them both. It is clear, though, that recommendations need to depend on the specific problem, the specific tests involved. Results often also depend on in what way exactly model assumptions of the model-based test are violated, which is hard to know without some kind of data dependent reasoning.”

So the best performing approach depends on circumstances and the test used. However, if you were to pick one of the previously mentioned approaches as a general principle to follow when better information isn’t available, which would make the most preferable default?

Please base your answers on experience or evidence.

$\endgroup$
3
  • 3
    $\begingroup$ I would also add that checking model assumptions change the type I error rate or other statistical properties of a test. Because the p-value you calculate, let's say from the t-test, corresponds to the p-value of the t-test and not the p-value of the whole procedure of checking assumptions and then selecting either t-test or Mann-Whitney test. I am not saying don't check the assumptions, but it's also something to consider when talking about these checks. $\endgroup$
    – rep_ho
    Commented Nov 7, 2021 at 12:52
  • 7
    $\begingroup$ This is a good broad question raising deep and difficult issues. Speaking for myself , that is why I am reluctant to try an answer because I despair of being concise and clear and correct all at once. A tiny contribution: although the term assumption is almost always used, it is to some extent misleading. I prefer to talk about ideal conditions. The point is that failure of assumptions doesn't necessarily invalidate a statistical analysis. It is odd that you mention first normality of residuals (strictly, errors) when that is in several senses the least important assumption of ANOVA. $\endgroup$
    – Nick Cox
    Commented Oct 20, 2022 at 8:05
  • 5
    $\begingroup$ The biggest question in my view about ANOVA is whether a summary in terms of means is sensible given the data and the goals of the analysis. Even if the data contain a few outliers, and so normality of errors fails, it could still be that means are the best way to summarize (e.g. because means link to totals, which have substantive meaning). This is a judgment question. Also, many discussions are decades out of date insofar as (for example) bootstrapping allows other ways of getting at confidence intervals. Note also permutation tests, generalized linear models, etc. $\endgroup$
    – Nick Cox
    Commented Oct 20, 2022 at 8:10

3 Answers 3

13
$\begingroup$

Some general observations: First, we tend to spend a lot of time using non-robust models when there are many robust options that have equal or greater statistical power. Semiparametric (ordinal) regression for continuous Y is one class of model that is not used often enough. Such models (e.g., the proportional odds model) do not depend on how Y is transformed (as long as the transformation is rank-preserving) and are not affected by outliers in Y (they have the same problems as parametric models with regard to outliers in X). Routine use of semiparametric models would result in fewer assumptions in need of checking.

Second, goodness of fit needs to be judged against alternatives. What if a more flexible, well-fitting model requires many more parameters and the net effect is overfitting that makes predictions unreliable and/or makes confidence intervals too wide? Almost always, some kind of analysis is required, and the badness of fit of a proposed model should not rule the day, when the proposed alternative analysis is actually worse.

On the second point, it is often more useful to think of assessment of goodness of fit in terms of a contest between the proposed model and a more general model. A fully worked out example is here. In that example the proposed model is a proportional odds model in which the effect of a treatment on lowering the odds that $Y\geq j$ is assumed to be the same for all $j$ save the lowest value. There are at least two alternative models: a partial proportional odds model that relaxes this assumption with respect to treatment, and a multinomial logistic model that relaxes the assumption with respect to all predictors. I cast model checking as a contest between these models and use two metrics for comparison: AIC and bootstrap confidence intervals for differences in predicted probabilities of various $Y=j$ from pairs of models. I also show formal likelihood ratio $\chi^2$ tests of goodness of fit by pairwise comparison of these three models. I show that there is a cost of not assuming proportional odds.

The above setting is exactly analogous to comparing (1) a linear model with constant variance with (2) a linear model that allows the residual variance $\sigma^2$ to be a function of X.

For continuous Y a very cogent approach in my view is to hope for a simple linear model but to allow for departures from that. A Bayesian model could put priors on the amount of variation of $\sigma^2$ and on the amount of non-normality of residuals. By tilting the model towards normal residuals and constant $\sigma^2$ but allowing for departures from those as the sample size allows, one obtains a general solution that does not require binary model choices. A simple example of this is the Bayesian $t$-test.

$\endgroup$
12
+50
$\begingroup$

Some remarks:

  1. Model assumptions are never fulfilled in reality, and there is therefore no way to make sure or even check that they are fulfilled. The correct question is not whether model assumptions are fulfilled, but rather whether violations of the model assumptions can be suspected that mislead the interpretation of the results. Note that standard model misspecification tests are not always good at assessing this.

  2. Of course this depends on knowledge about what kinds of violations of the model assumptions have what kind of impact. Regarding standard ANOVA, extreme outliers are known to be harmful; skewness may make inference based on means questionable; difference in variances in the first place raises the question how to interpret the obviously existing differences between groups if differences in variance are much clearer and more characteristic of the situation than differences in means. Some literature suggests that what impact heteroscedasticity has and how much of a problem it is depends on the group sizes and whether bigger variance occurs in a smaller or larger group. Mild violations of these model assumptions are very often harmless in that characteristics of the test behaviour are not affected much. Also, with larger numbers of observations, the Central Limit Theorem will justify the application of normal theory for most non-normal distributions. Dependence between observations though can very often be harmful and should be modelled if it is clear enough to be detected.

  3. Note that dependence is as much a problem for any non-parametric, robust, or "model-free" alternatives to ANOVA as for standard ANOVA. Note also (example given in the preprint) that there are situations in which standard ANOVA is better than robust/non-parametric alternatives even if standard ANOVA model assumptions are violated whereas non-parametric assumptions hold (this may happen for example for lighter tailed distributions than the normal)! This is because parametric theory is often concerned about optimality (which is lost in theory if assumptions are not met, but the parametric method can still be rather good), whereas nonparametric and robust theory are often concerned about general minimum/worst case quality assurances, which means that these methods, even though legitimate, may not be particularly good in some cases even where their assumptions are fulfilled. This isn't always the case: there are situations in which standard ANOVA breaks down and nonparametric/robust alternatives are much better, particularly with outliers. Robust methods may be slightly better than standard ANOVA in case of heteroscedasticity but nonparametric alternatives such as Kruskal-Wallis or permutation testing may not (I'm actually not sure, not even whether this has been investigated in detail; the last statement is rather my intuition). Permutation tests will also still be affected by outliers as long as their test statistic is non-robust! The message here is that nonparametric/robust/"model free" approaches are not a silver bullet.

  4. Technically (as also mentioned in our paper which is cited as "this preprint" in the question) applying misspecification tests, and then running an ANOVA on the same data conditionally on not having rejected model assumptions will invalidate the standard ANOVA-tests as correctly written in a comment by @rep_ho. This is due to the "misspecification paradox": see https://stats.stackexchange.com/a/592495/247165
    Advice by @Björn is good to test model assumptions on independent (hopefully equally distributed) data where possible. Note also that this problem does apply not only to formal misspecification testing, but also to looking at data visualisation (of the same data) and making decisions about whether to run the standard ANOVA dependent on these visualisations.

  5. However, given that model assumptions never hold anyway, violation of assumptions due to the misspecification paradox is usually preferable to leaving some bigger problems undetected because of not checking the model assumptions. There are even some situations in which one can show that misspecification testing will not affect the characteristics of the later ANOVA in case the assumed model held before misspecification testing, namely where the misspecification test is independent of the finally used statistic. This happens for example when running misspecification tests on only the residuals in the linear model, and then running a standard normality-based test only if normality is not rejected.

  6. Generally the formal impact of misspecification testing and looking at data will be small under the assumed model if fulfilled assumptions are rejected with low probability, i.e., testing at low level, or, when looking at the data, really only rejecting the model if violations are clear and strong.

  7. Now getting to recommendations what to do actually, I don't have a general recipe. When doing consultation, whatever I do or advise will depend on all available background information about the subject matter and aim of the study. This concerns for example the question whether the mean is an appropriate good statistic to summarise group-wise results in case of skewness or heteroscedasticity. It also concerns potential consequences of type I and type II errors (outliers will not normally cause type I errors, but can increase the probability of type II errors big time), and more generally how "strongly" results will be interpreted.

  8. The first thing to do is always, if possible before having seen the data, to think about potentially harmful model assumption violations that could occur in the given situation. Are potential sources for dependence known? These should then be modelled or people should think hard about methods for data gathering where these problems don't occur. Dependence can be harmful, even in cases where it is hard or impossible to detect from the data! See https://arxiv.org/abs/2108.09227. Also if I know in advance that outliers can be expected to occur for this kind of data, I'd use something robust or nonparametric. On the other hand, there are situations in which the occurrence of outliers can be ruled out because the value range of outcomes is limited, and it can also with good reasons be expected that there is enough variation that not almost all observations concentrate on one side of the scale (in which case stray observations on the other side could still be seen as outliers).

  9. I'm a curious person and just the fact that a certain test or check rejects the model assumption wouldn't stop me from running a method that uses this assumption if I'd have used it otherwise. For a number of checks (based on residuals) this is necessary anyway. If I don't have any indication before seeing the data against standard ANOVA (see above), I'd run it. (Obviously "legal" and procedural concerns such as pre-registration have to be taken into account.)

  10. I'd also look at visual model diagnostics. I don't necessarily use formal misspecification tests because given the earlier discussion I know what kind of deviations from the model I look for, and misspecification tests don't know that. However, somebody with less experience and knowledge about these things will often do better running a formal misspecification test than not doing anything at all; also regularly running misspecification tests and comparing them with the visual impression gives us experience about what kind of apparent deviations from the model can still be compatible with random variation. So I run misspecification tests for informing my intuition rather than for making formal decisions about how to proceed. Anyway, such diagnostics may or may not prompt me to run a nonparametric or robust alternative on top of the standard ANOVA rather than "instead".

  11. Most importantly, I will also look at the raw data and boxplots. The question that I ask here is not in the first place "are the model assumptions violated?" (That's anyway the case.) Rather I ask: "Do I get the impression that the message from looking at the data about group differences is in line with what my tests say?" If yes, I can interpret the test results with some confidence (also possibly confirmed by the observation that standard ANOVA and an alternative approach convey the same message). If not I try to understand exactly how the data led the tests astray (or how my intuition went wrong - this may happen, too), and I may then explain both what I see and why one or more test results are not in line with it. (This can also concern meaningless significant results because of a so large sample that a meaninglessly small effect came out significant; or on the other hand insignificance even if a too small sample visually suggests differences between groups. Obviously one should also look at effect sizes and confidence intervals.)

Overall I'm probably closest to option 2 in the list in the question.

$\endgroup$
11
  • 2
    $\begingroup$ Outstanding answer (+1). I made various edits that are mostly minor but please check my deletion of a "not" in #8. A small point of disagreement: I find (conventional) box plots oversold, especially in the context of ANOVA (1) especially if they don't show means too (2) because they often omit detail of importance (3) because people are over-confident about their interpretation. $\endgroup$
    – Nick Cox
    Commented Oct 26, 2022 at 13:13
  • $\begingroup$ @NickCox Thanks. I edited #8 as the deleted "not" was in fact correct, but my original double negation there was hard to read. I'm looking at boxplots but of course also at means and other plots, so I'm not that worried about what boxplots omit. Of course people should know the limits of any plot. $\endgroup$ Commented Oct 26, 2022 at 14:56
  • 1
    $\begingroup$ Do you have a reference for this claim:"There are even some situations in which one can show that misspecification testing will not affect the characteristics of the later ANOVA in case the assumed model held before misspecification testing, namely where the misspecification test is independent of the finally used statistic. This happens for example when running misspecification tests on only the residuals in the linear model." I was under the impression that this is not the case. $\endgroup$
    – Björn
    Commented Oct 26, 2022 at 15:11
  • $\begingroup$ @Björn This is pretty easy to show; I have seen Aris Spanos mentioning it in some publications, I believe even the one cited in the question. $\endgroup$ Commented Oct 26, 2022 at 15:13
  • $\begingroup$ @ChristianHennig Did you see some proofs or extensive simulations there? I have seen such statements from that author before in some other discussions (e.g. here: stats.stackexchange.com/questions/303887/…), but I was not clear on the basis of the statements. $\endgroup$
    – Björn
    Commented Oct 26, 2022 at 15:17
5
$\begingroup$

It's useful to distinguish different scenarios.

In the first scenario, strict type I error control, as well as pre-specification (otherwise the first is hard to imagine), are required. E.g. think of a randomized Phase III trial for a new drug. In this scenario, you would check your assumptions before you even run your trial using data that is as relevant as possible. Because it is well known that switching analysis methods based on some pre-tests (e.g. for normality*) leads to some type I error inflation (and other issues), you would not do an analysis strategy where you only check assumptions and determine your analysis method based on that (even if you could pre-specify the conditions and how it influences the analysis). Of course, certain small deviations can be looked at and judged to be somewhat irrelevant (e.g. there's a reason why things like blood pressure are usually analyzed using linear regression with a baseline value in the model, even though we know the model cannot be precisely true since blood pressure cannot be negative) and ignored, because many models are very robust to mild deviations. On the other hand, for other situations some appropriate re-medial actions are needed (e.g. we know it's much better to log-transform urine albumin-creatine ratios than to analyze untransformed values).

Many scientific investigations fall into this first category, even if they are not as tightly regulated as drug trials or overseen by regulatory bodies. However, very often we have a clear up-front plan and relevant similar data are available and allow us to check our assumptions to a sufficient degree that assumption checking on the actual data is usually more or less a waste of time.

The second setting is when things are completely new, we are trying out new measurement methods, trying to understand something about a new system (or how a new drug behaves in the human body) and so on. In this setting, there may just not be anything similar we can base reliable assumptions on. Here, we may have to be quite flexible and explore a good bit. However, then we should also not sell this as a definite confirmatory experiment. That's why some larger definite experiments have pilot parts to be able to make assumptions.

The third setting is the most difficult: When you don't know for sure whether you are in the first or the second setting, because there's some data/prior knowledge that may or may not be relevant and/or may be ambiguous and/or something might turn out to be very different (e.g. the first immuno-Oncology treatments with delayed treatment effects). However, it's very hard to say something general about these situations.

* Based on the discussion under another post, here's an illustration of the pre-test for normality using only residuals causing type I error inflation.

library(tidyverse)

set.seed(1234)
corrs <- map_dfr(1:1000000,
                     function(x){
                       sim_data <- tibble(y=rnorm(n=50), 
                       x=rbernoulli(n=50, p=0.5))
                       lmfit <- lm(y~1+x, data=sim_data)
                       wtres <- wilcox.test(sim_data$y[sim_data$x==0], 
                                 sim_data$y[sim_data$x==1])
                       data.frame(test_statistic = 
     summary(lmfit)$coeff["xTRUE", "t value"], lm_pval =  
 summary(lmfit)$coeff["xTRUE", "Pr(>|t|)"],  
     wt_pval = wtres$p.value, 
 shapiro_wilk_pval = shapiro.test(lmfit$residuals)$p.value)
       })

corrs %>%
  mutate(conditional = ifelse(shapiro_wilk_pval<=0.1, wt_pval, 
                            lm_pval),
         signif1 = (lm_pval<=0.05),
         signif2 = (wt_pval<=0.05),
         signif3 = (conditional<=0.05)) %>%
  summarize(lm_lcl = qbeta(0.025, sum(signif1), 
                                  1+length(signif1)-sum(signif1)),
            lm_ucl = qbeta(0.975, 1+sum(signif1), 
                                  length(signif1)-sum(signif1)),
            wilcox_lcl = qbeta(0.025, sum(signif2), 
                               1+length(signif2)-sum(signif2)),
            wilcox_ucl = qbeta(0.975, 1+sum(signif2), 
                                      length(signif2)-sum(signif2)),
            cond_lcl = qbeta(0.025, sum(signif3), 
                                 1+length(signif3)-sum(signif3)),
            cond_ucl = qbeta(0.975, 1+sum(signif3), 
                             length(signif3)-sum(signif3)))

This gives this result for tests with a nominal level of 0.05 showing a type I error inflation when doing a pre-test for normality.

Analysis method Clopper-Pearson confidence interval for type I error rate
Linear model 0.0495 to 0.0503
two-sample Wilcoxon aka Mann-Whitney test 0.0482 to 0.0490
Conditional (switch to Wilcoxon if Shapiro-Wilk test of normality of residuals has p-value<=0.1 0.0512 to 0.0521
$\endgroup$
1
  • $\begingroup$ "Conditional" is a good word for this process. Slightly better terminology is "adaptive". $\endgroup$ Commented Sep 9, 2023 at 16:08

Not the answer you're looking for? Browse other questions tagged or ask your own question.