140
$\begingroup$

Certain hypotheses can be tested using Student's t-test (maybe using Welch's correction for unequal variances in the two-sample case), or by a non-parametric test like the Wilcoxon paired signed rank test, the Wilcoxon-Mann-Whitney U test, or the paired sign test. How can we make a principled decision about which test is most appropriate, particularly if the sample size is "small"?

Many introductory textbooks and lecture notes give a "flowchart" approach where normality is checked (either – inadvisedly – by normality test, or more broadly by QQ plot or similar) to decide between a t-test or non-parametric test. For the unpaired two-sample t-test there may be a further check for homogeneity of variance to decide whether to apply Welch's correction. One issue with this approach is the way the decision on which test to apply depends on the observed data, and how this affects the performance (power, Type I error rate) of the selected test.

Another problem is how hard checking normality is in small data sets: formal testing has low power so violations may well not be detected, but similar issues apply eyeballing the data on a QQ plot. Even egregious violations could go undetected, e.g. if the distribution is mixed but no observations were drawn from one component of the mixture. Unlike for large $n$, we can't lean on the safety-net of the Central Limit Theorem, and the asymptotic normality of the test statistic and t distribution.

One principled response to this is "safety first": with no way to reliably verify the normality assumption in a small sample, stick to non-parametric methods. Another is to consider any grounds for assuming normality, theoretically (e.g. variable is sum of several random components and CLT applies) or empirically (e.g. previous studies with larger $n$ suggest variable is normal), and using a t-test only if such grounds exist. But this usually only justifies approximate normality, and on low degrees of freedom it's hard to judge how near normal it needs to be to avoid invalidating a t-test.

Most guides to choosing a t-test or non-parametric test focus on the normality issue. But small samples also throw up some side-issues:

  • If performing an "unrelated samples" or "unpaired" t-test, whether to use a Welch correction? Some people use a hypothesis test for equality of variances, but here it would have low power; others check whether SDs are "reasonably" close or not (by various criteria). Is it safer simply to always use the Welch correction for small samples, unless there is some good reason to believe population variances are equal?

  • If you see the choice of methods as a trade-off between power and robustness, claims about the asymptotic efficiency of the non-parametric methods are unhelpful. The rule of thumb that "Wilcoxon tests have about 95% of the power of a t-test if the data really are normal, and are often far more powerful if the data is not, so just use a Wilcoxon" is sometimes heard, but if the 95% only applies to large $n$, this is flawed reasoning for smaller samples.

  • Small samples may make it very difficult, or impossible, to assess whether a transformation is appropriate for the data since it's hard to tell whether the transformed data belong to a (sufficiently) normal distribution. So if a QQ plot reveals very positively skewed data, which look more reasonable after taking logs, is it safe to use a t-test on the logged data? On larger samples this would be very tempting, but with small $n$ I'd probably hold off unless there had been grounds to expect a log-normal distribution in the first place.

  • What about checking assumptions for the non-parametrics? Some sources recommend verifying a symmetric distribution before applying a Wilcoxon test (treating it as a test for location rather than stochastic dominance), which brings up similar problems to checking normality. If the reason we are applying a non-parametric test in the first place is a blind obedience to the mantra of "safety first", then the difficulty assessing skewness from a small sample would apparently lead us to the lower power of a paired sign test.

With these small-sample issues in mind, is there a good - hopefully citable - procedure to work through when deciding between t and non-parametric tests?

There have been several excellent answers, but a response considering other alternatives to rank tests, such as permutation tests, would also be welcome.

$\endgroup$
10
  • 2
    $\begingroup$ I should explain what a "method for choosing a test" might be - introductory texts often use flowcharts. For unpaired data, maybe: "1. Use some method to check if both samples are normally distributed (if not go to 3), 2. Use some method to check for unequal variances: if so, perform two-sample t-test with Welch's correction, if not, perform without correction. 3. Try transforming data to normality (if works go to 2 else go to 4). 4. Perform U test instead (possibly after checking various assumptions)." But many of these steps seem unsatisfactory for small n, as I hope my Q explains! $\endgroup$
    – Silverfish
    Commented Oct 29, 2014 at 16:01
  • 2
    $\begingroup$ Interesting question (+1) and a brave move to set up a bounty. Looking forward for some interesting answers. By the way, what I often see applied in my field is a permutation test (instead of either t-test or Mann-Whitney-Wilcoxon). I guess it could be considered a worthy contender as well. Apart from that, you never specified what you mean by "small sample size". $\endgroup$
    – amoeba
    Commented Nov 4, 2014 at 1:09
  • 2
    $\begingroup$ It is not clear to me why one would assert that nonparametric tests (rank sum or sign rank) would require symmetry? $\endgroup$
    – Alexis
    Commented Nov 5, 2014 at 22:36
  • 4
    $\begingroup$ @Silverfish " if the results are seen as a statement about location" That is an important caveat, as these tests are most generally statements about evidence for H$_{0}: P(X_{A} > X_{B}) =0.5$. Making additional distributional assumptions narrows the scope of inference (e.g. tests for median difference), but are not generally requisites for the tests. $\endgroup$
    – Alexis
    Commented Nov 6, 2014 at 3:09
  • 2
    $\begingroup$ It might be worth exploring just how "flawed" the "95% power for the Wilcoxon" reasoning is for small samples (in part it depends on what, exactly, one does, and how small is small). If for example, you're happy to conduct tests at say 5.5% instead of 5%, should that be the nearest suitable achievable significance level, the power often tends to hold up fairly well. Once can of course - at the "power calculation" stage before you collect data - figure out what the circumstances may be and get a sense of what the properties of the Wilcoxon are at the sample sizes you're considering. $\endgroup$
    – Glen_b
    Commented Oct 22, 2015 at 22:07

8 Answers 8

89
+100
$\begingroup$

I am going to change the order of questions about.

I've found textbooks and lecture notes frequently disagree, and would like a system to work through the choice that can safely be recommended as best practice, and especially a textbook or paper this can be cited to.

Unfortunately, some discussions of this issue in books and so on rely on received wisdom. Sometimes that received wisdom is reasonable, sometimes it is less so (at the least in the sense that it tends to focus on a smaller issue when a larger problem is ignored); we should examine the justifications offered for the advice (if any justification is offered at all) with care.

Most guides to choosing a t-test or non-parametric test focus on the normality issue.

That’s true, but it’s somewhat misguided for several reasons that I address in this answer.

If performing an "unrelated samples" or "unpaired" t-test, whether to use a Welch correction?

This (to use it unless you have reason to think variances should be equal) is the advice of numerous references. I point to some in this answer.

Some people use a hypothesis test for equality of variances, but here it would have low power. Generally I just eyeball whether the sample SDs are "reasonably" close or not (which is somewhat subjective, so there must be a more principled way of doing it) but again, with low n it may well be that the population SDs are rather further apart than the sample ones.

Is it safer simply to always use the Welch correction for small samples, unless there is some good reason to believe population variances are equal? That’s what the advice is. The properties of the tests are affected by the choice based on the assumption test.

Some references on this can be seen here and here, though there are more that say similar things.

The equal-variances issue has many similar characteristics to the normality issue – people want to test it, advice suggests conditioning choice of tests on the results of tests can adversely affect the results of both kinds of subsequent test – it’s better simply not to assume what you can’t adequately justify (by reasoning about the data, using information from other studies relating to the same variables and so on).

However, there are differences. One is that – at least in terms of the distribution of the test statistic under the null hypothesis (and hence, its level-robustness) - non-normality is less important in large samples (at least in respect of significance level, though power might still be an issue if you need to find small effects), while the effect of unequal variances under the equal variance assumption doesn’t really go away with large sample size.

What principled method can be recommended for choosing which is the most appropriate test when the sample size is "small"?

With hypothesis tests, what matters (under some set of conditions) is primarily two things:

  • What is the actual type I error rate?

  • What is the power behaviour like?

We also need to keep in mind that if we're comparing two procedures, changing the first will change the second (that is, if they’re not conducted at the same actual significance level, you would expect that higher $\alpha$ is associated with higher power).

(Of course we're usually not so confident we know what distributions we're dealing with, so the sensitivity of those behaviors to changes in circumstances also matter.)

With these small-sample issues in mind, is there a good - hopefully citable - checklist to work through when deciding between t and non-parametric tests?

I will consider a number of situations in which I’ll make some recommendations, considering both the possibility of non-normality and unequal variances. In every case, take mention of the t-test to imply the Welch-test:

  • n medium-large

Non-normal (or unknown), likely to have near-equal variance:

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney, though if it’s only slightly heavy, the t-test should do okay. With light-tails the t-test may (often) be preferred. Permutation tests are a good option (you can even do a permutation test using a t-statistic if you're so inclined). Bootstrap tests are also suitable.

Non-normal (or unknown), unequal variance (or variance relationship unknown):

If the distribution is heavy-tailed, you will generally be better with a Mann-Whitney

  • if inequality of variance is only related to inequality of mean - i.e. if H0 is true the difference in spread should also be absent. GLMs are often a good option, especially if there’s skewness and spread is related to the mean. A permutation test is another option, with a similar caveat as for the rank-based tests. Bootstrap tests are a good possibility here.

Zimmerman and Zumbo (1993)$^{[1]}$ suggest a Welch-t-test on the ranks which they say performs better that the Wilcoxon-Mann-Whitney in cases where the variances are unequal.

  • n moderately small

rank tests are reasonable defaults here if you expect non-normality (again with the above caveat). If you have external information about shape or variance, you might consider GLMs . If you expect things not to be too far from normal, t-tests may be fine.

  • n very small

Because of the problem with getting suitable significance levels, neither permutation tests nor rank tests may be suitable, and at the smallest sizes, a t-test may be the best option (there’s some possibility of slightly robustifying it). However, there’s a good argument for using higher type I error rates with small samples (otherwise you’re letting type II error rates inflate while holding type I error rates constant). Also see de Winter (2013)$^{[2]}$.

The advice must be modified somewhat when the distributions are both strongly skewed and very discrete, such as Likert scale items where most of the observations are in one of the end categories. Then the Wilcoxon-Mann-Whitney isn’t necessarily a better choice than the t-test.

Simulation can help guide choices further when you have some information about likely circumstances.

I appreciate this is something of a perennial topic, but most questions concern the questioner's particular data set, sometimes a more general discussion of power, and occasionally what to do if two tests disagree, but I would like a procedure to pick the correct test in the first place!

The main problem is how hard it is to check the normality assumption in a small data set:

It is difficult to check normality in a small data set, and to some extent that's an important issue, but I think there's another issue of importance that we need to consider. A basic problem is that trying to assess normality as the basis of choosing between tests adversely impacts the properties of the tests you're choosing between.

Any formal test for normality would have low power so violations may well not be detected. (Personally I wouldn't test for this purpose, and I'm clearly not alone, but I've found this little use when clients demand a normality test be performed because that's what their textbook or old lecture notes or some website they found once declare should be done. This is one point where a weightier looking citation would be welcome.)

Here’s an example of a reference (there are others) which is unequivocal (Fay and Proschan, 2010$^{[3]}$):

The choice between t- and WMW DRs should not be based on a test of normality.

They are similarly unequivocal about not testing for equality of variance.

To make matters worse, it is unsafe to use the Central Limit Theorem as a safety net: for small n we can't rely on the convenient asymptotic normality of the test statistic and t distribution.

Nor even in large samples -- asymptotic normality of the numerator doesn’t imply that the t-statistic will have a t-distribution. However, that may not matter so much, since you should still have asymptotic normality (e.g. CLT for the numerator, and Slutsky’s theorem suggest that eventually the t-statistic should begin to look normal, if the conditions for both hold.)

One principled response to this is "safety first": as there's no way to reliably verify the normality assumption on a small sample, run an equivalent non-parametric test instead.

That’s actually the advice that the references I mention (or link to mentions of) give.

Another approach I've seen but feel less comfortable with, is to perform a visual check and proceed with a t-test if nothing untowards is observed ("no reason to reject normality", ignoring the low power of this check). My personal inclination is to consider whether there are any grounds for assuming normality, theoretical (e.g. variable is sum of several random components and CLT applies) or empirical (e.g. previous studies with larger n suggest variable is normal).

Both those are good arguments, especially when backed up with the fact that the t-test is reasonably robust against moderate deviations from normality. (One should keep in mind, however, that "moderate deviations" is a tricky phrase; certain kinds of deviations from normality may impact the power performace of the t-test quite a bit even though those deviations are visually very small - the t-test is less robust to some deviations than others. We should keep this in mind whenever we're discussing small deviations from normality.)

Beware, however, the phrasing "suggest the variable is normal". Being reasonably consistent with normality is not the same thing as normality. We can often reject actual normality with no need even to see the data – for example, if the data cannot be negative, the distribution cannot be normal. Fortunately, what matters is closer to what we might actually have from previous studies or reasoning about how the data are composed, which is that the deviations from normality should be small.

If so, I would use a t-test if data passed visual inspection, and otherwise stick to non-parametrics. But any theoretical or empirical grounds usually only justify assuming approximate normality, and on low degrees of freedom it's hard to judge how near normal it needs to be to avoid invalidating a t-test.

Well, that’s something we can assess the impact of fairly readily (such as via simulations, as I mentioned earlier). From what I've seen, skewness seems to matter more than heavy tails (but on the other hand I have seen some claims of the opposite - though I don't know what that's based on).

For people who see the choice of methods as a trade-off between power and robustness, claims about the asymptotic efficiency of the non-parametric methods are unhelpful. For instance, the rule of thumb that "Wilcoxon tests have about 95% of the power of a t-test if the data really are normal, and are often far more powerful if the data is not, so just use a Wilcoxon" is sometimes heard, but if the 95% only applies to large n, this is flawed reasoning for smaller samples.

But we can check small-sample power quite easily! It’s easy enough to simulate to obtain power curves as here.
(Again, also see de Winter (2013)$^{[2]}$).

Having done such simulations under a variety of circumstances, both for the two-sample and one-sample/paired-difference cases, the small sample efficiency at the normal in both cases seems to be a little lower than the asymptotic efficiency, but the efficiency of the signed rank and Wilcoxon-Mann-Whitney tests is still very high even at very small sample sizes.

At least that's if the tests are done at the same actual significance level; you can't do a 5% test with very small samples (and least not without randomized tests for example), but if you're prepared to perhaps do (say) a 5.5% or a 3.2% test instead, then the rank tests hold up very well indeed compared with a t-test at that significance level.

Small samples may make it very difficult, or impossible, to assess whether a transformation is appropriate for the data since it's hard to tell whether the transformed data belong to a (sufficiently) normal distribution. So if a QQ plot reveals very positively skewed data, which look more reasonable after taking logs, is it safe to use a t-test on the logged data? On larger samples this would be very tempting, but with small n I'd probably hold off unless there had been grounds to expect a log-normal distribution in the first place.

There’s another alternative: make a different parametric assumption. For example, if there’s skewed data, one might, for example, in some situations reasonably consider a gamma distribution, or some other skewed family as a better approximation - in moderately large samples, we might just use a GLM, but in very small samples it may be necessary to look to a small-sample test - in many cases simulation can be useful.

Alternative 2: robustify the t-test (but taking care about the choice of robust procedure so as not to heavily discretize the resulting distribution of the test statistic) - this has some advantages over a very-small-sample nonparametric procedure such as the ability to consider tests with low type I error rate.

Here I'm thinking along the lines of using say M-estimators of location (and related estimators of scale) in the t-statistic to smoothly robustify against deviations from normality. Something akin to the Welch, like:

$$\frac{\stackrel{\sim}{x}-\stackrel{\sim}{y}}{\stackrel{\sim}{S}_p}$$

where $\stackrel{\sim}{S}_p^2=\frac{\stackrel{\sim}{s}_x^2}{n_x}+\frac{\stackrel{\sim}{s}_y^2}{n_y}$ and $\stackrel{\sim}{x}$, $\stackrel{\sim}{s}_x$ etc being robust estimates of location and scale respectively.

I'd aim to reduce any tendency of the statistic to discreteness - so I'd avoid things like trimming and Winsorizing, since if the original data were discrete, trimming etc will exacerbate this; by using M-estimation type approaches with a smooth $\psi$-function you achieve similar effects without contributing to the discreteness. Keep in mind we're trying to deal with the situation where $n$ is very small indeed (around 3-5, in each sample, say), so even M-estimation potentially has its issues.

You could, for example, use simulation at the normal to get p-values (if sample sizes are very small, I'd suggest that over bootstrapping - if sample sizes aren't so small, a carefully-implemented bootstrap may do quite well, but then we might as well go back to Wilcoxon-Mann-Whitney). There's be a scaling factor as well as a d.f. adjustment to get to what I'd imagine would then be a reasonable t-approximation. This means we should get the kind of properties we seek very close to the normal, and should have reasonable robustness in the broad vicinity of the normal. There are a number of issues that come up that would be outside the scope of the present question, but I think in very small samples the benefits should outweigh the costs and the extra effort required.

[I haven't read the literature on this stuff for a very long time, so I don't have suitable references to offer on that score.]

Of course if you didn't expect the distribution to be somewhat normal-like, but rather similar to some other distribution, you could undertake a suitable robustification of a different parametric test.

What if you want to check assumptions for the non-parametrics? Some sources recommend verifying a symmetric distribution before applying a Wilcoxon test, which brings up similar problems to checking normality.

Indeed. I assume you mean the signed rank test*. In the case of using it on paired data, if you are prepared to assume that the two distributions are the same shape apart from location shift you are safe, since the differences should then be symmetric. Actually, we don't even need that much; for the test to work you need symmetry under the null; it's not required under the alternative (e.g. consider a paired situation with identically-shaped right skewed continuous distributions on the positive half-line, where the scales differ under the alternative but not under the null; the signed rank test should work essentially as expected in that case). The interpretation of the test is easier if the alternative is a location shift though.

*(Wilcoxon’s name is associated with both the one and two sample rank tests – signed rank and rank sum; with their U test, Mann and Whitney generalized the situation studied by Wilcoxon, and introduced important new ideas for evaluating the null distribution, but the priority between the two sets of authors on the Wilcoxon-Mann-Whitney is clearly Wilcoxon’s -- so at least if we only consider Wilcoxon vs Mann&Whitney, Wilcoxon goes first in my book. However, it seems Stigler's Law beats me yet again, and Wilcoxon should perhaps share some of that priority with a number of earlier contributors, and (besides Mann and Whitney) should share credit with several discoverers of an equivalent test.[4][5] )

References

[1]: Zimmerman DW and Zumbo BN, (1993),
Rank transformations and the power of the Student t-test and Welch t′-test for non-normal populations,
Canadian Journal Experimental Psychology, 47: 523–39.

[2]: J.C.F. de Winter (2013),
"Using the Student’s t-test with extremely small sample sizes,"
Practical Assessment, Research and Evaluation, 18:10, August, ISSN 1531-7714
https://openpublishing.library.umass.edu/pare/article/id/1434/

[3]: Michael P. Fay and Michael A. Proschan (2010),
"Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules,"
Stat Surv; 4: 1–39.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857732/

[4]: Berry, K.J., Mielke, P.W. and Johnston, J.E. (2012),
"The Two-sample Rank-sum Test: Early Development,"
Electronic Journal for History of Probability and Statistics, Vol.8, December
pdf

[5]: Kruskal, W. H. (1957),
"Historical notes on the Wilcoxon unpaired two-sample test,"
Journal of the American Statistical Association, 52, 356–360.

$\endgroup$
10
  • 2
    $\begingroup$ A couple of things I'd like clarification on. There's several points where you mention e.g. "If the distribution is heavy-tailed, ..." (or skewed etc) - presumably this should be read as "if it's reasonable to assume that the distribution will be heavy-tailed" (from theory/previous studies/whatever) rather than "if the sample is heavy-tailed", otherwise we are back at multi-step testing again which is the thing we are trying to avoid? (It seems to me that a central issue in this topic is how to justify beliefs or assumptions about distributions, without reading too much into the sample.) $\endgroup$
    – Silverfish
    Commented Nov 10, 2014 at 16:51
  • $\begingroup$ Yes, that should be understood as "population is either known to be heavy-tailed, or may reasonably expected to be heavy tailed". That certainly includes things like theory (or sometimes even general reasoning about the situation that doesn't quite reach the status of theory), expert knowledge, and previous studies. It's not suggesting testing for heavy-tailedness. In situations where it's simply unknown, it may be worth investigating how bad things might be under various distributions which might be plausible for the specific situation you have. $\endgroup$
    – Glen_b
    Commented Nov 10, 2014 at 17:13
  • 1
    $\begingroup$ Any chance that this already excellent answer could incorporate a little more detail on what options there might be to "robustify" the t-test? $\endgroup$
    – Silverfish
    Commented Nov 10, 2014 at 18:59
  • $\begingroup$ Silverfish - I'm not sure if I sufficiently addressed your question asking for detail on robustifying. I'll add a little more now. $\endgroup$
    – Glen_b
    Commented Jan 28, 2015 at 1:33
  • $\begingroup$ Many thanks for the addition, I thought that added a lot to the quality of this answer. Now this question has settled down a bit, and generated a good set of responses, I'd like to give the original question a good copy-edit and remove anything which might be misleading (for the benefit of readers who don't read past the question!). Is it okay when I do so for me to make appropriate edits to your response so quotes match with the reorganized question? $\endgroup$
    – Silverfish
    Commented Feb 17, 2015 at 17:21
32
$\begingroup$

In my view the principled approach recognizes that (1) tests and graphical assessments of normality have insufficient sensitivity and graph interpretation is frequently not objective, (2) multi-step procedures have uncertain operating characteristics, (3) many nonparametric tests have excellent operating characteristics under situations in which parametric tests have optimum power, and (4) the proper transformation of $Y$ is not generally the identity function, and nonparametric $k$-sample tests are invariant to the transformation chosen (not so for one-sample tests such as the Wilcoxon signed rank test). Regarding (2), multi-step procedures are particularly problematic in areas such as drug development where oversight agencies such as FDA are rightfully concerned about possible manipulation of results. For example, an unscrupulous researcher might conveniently forget to report the test of normality if the $t$-test results in a low $P$-value.

Putting all this together, some suggested guidance is as follows:

  1. If there is not a compelling reason to assume a Gaussian distribution before examining the data, and no covariate adjustment is needed, use a nonparametric test.
  2. If covariate adjustment is needed, use the semiparametric regression generalization of the rank test you prefer. For the Wilcoxon test this is the proportional odds model and for a normal scores test this is probit ordinal regression.

These recommendations are fairly general although your mileage may vary for certain small sample sizes. But we know that for larger samples the relative efficiency of the Wilcoxon 2-sample test and signed rank tests compared to the $t$-test (if equal variance holds in the 2-sample case) is $\frac{3}{\pi}$ and that the relative efficiency of rank tests is frequently much greater than 1.0 when the Gaussian distribution does not hold. To me, the information loss in using rank tests is very small compared to the possible gains, robustness, and freedom from having to specify the transformation of $Y$.

Nonparametric tests can perform well even if their optimality assumptions are not satisfied. For the $k$-sample problem, rank tests make no assumptions about the distribution for a given group; they only make assumptions for how the distributions of the $k$ groups are connected to each other, if you require the test to be optimal. For a $-\log-\log$ link cumulative probability ordinal model the distributions are assumed to be in proportional hazards. For a logit link cumulative probability model (proportional odds model), the distributions are assumed to be connected by the proportional odds assumptions, i.e., the logits of the cumulative distribution functions are parallel. The shape of one of the distributions is irrelevant. Details may be found here in Chapter 15 of Handouts.

There are two types of assumptions of a frequentist statistical method that are frequently considered. The first is assumptions required to make the method preserve type I error. The second relates to preserving type II error (optimality; sensitivity). I believe that the best way to expose the assumptions needed for the second are to embed a nonparametric test in a semiparametric model as done above. The actual connection between the two is from Rao efficient score tests arising from the semiparametric model. The numerator of the score test from a proportional odds model for the two-sample case is exactly the rank-sum statistic.

For background information on ordinal models see this. For equivalence of Wilcoxon and proportion odds tests see this.

$\endgroup$
16
  • 1
    $\begingroup$ Thanks for this, I'm very sympathetic to the philosophy of this answer - for instance, lots of sources suggest I should at least eyeball-check data for normality before deciding on a test. But this sort of multi-step procedure clearly, albeit subtly, influences how the tests operate. $\endgroup$
    – Silverfish
    Commented Nov 4, 2014 at 14:06
  • 1
    $\begingroup$ Some queries: (1) suppose there's good reason to assume a Gaussian distribution a priori (eg previous studies) so we prefer a t-test. For tiny $n$ there's no point trying to assess normality - there'd be no way to detect its breach. But for $n=15$ or so, a QQ plot may well show up eg if there's severe skew. Does the philosophy of avoiding multi-step procedures mean we should simply justify our normality assumption, then proceed without checking the apparent distribution of our data? Similarly, in the k sample case, should we by default assume unequal variances rather than try to check it? $\endgroup$
    – Silverfish
    Commented Nov 4, 2014 at 14:19
  • 3
    $\begingroup$ (+1) I am wondering what is your take on Mann-Whitney-Wilcoxon vs. permutation tests (I am referring to Monte Carlo permutation test, when group labels are shuffled e.g. $10\,000$ times and $p$-value is computed directly as the number of shuffles resulting in a larger group difference)? $\endgroup$
    – amoeba
    Commented Nov 4, 2014 at 15:55
  • 4
    $\begingroup$ Permutation tests are ways to control type I error but do not address type II error. A permutation test based on suboptimal statistics (e.g., ordinary mean and variance when the data come from a log-Gaussian distribution) will suffer in terms of power. $\endgroup$ Commented Nov 4, 2014 at 16:26
  • 3
    $\begingroup$ Yes Chapter 15 in the Handouts is expanded into a new chapter in the upcoming 2nd edition of my book which I'll submit to the publisher next month. $\endgroup$ Commented Nov 6, 2014 at 12:40
15
$\begingroup$

Rand Wilcox in his publications and books makes some very important points, many of which were listed by Frank Harrell and Glen_b in earlier posts.

  1. The mean is not necessarily the quantity we want to make inferences about. There may be other quantities that better exemplify a typical observation.
  2. For t-tests, power can be low even for small departures from normality.
  3. For t-tests, observed probability coverage can be substantially different than nominal.

Some key suggestions are:

  1. A robust alternative is to compare trimmed means or M-estimators using the t-test. Wilcox suggests 20% trimmed means.
  2. Empirical Likelihood methods are theoretically more advantageous (Owen, 2001) but not necessarily so for medium to small n.
  3. Permutations tests are great if one needs to control Type I error, but one cannot get Confidence Intervals.
  4. For many situations, Wilcox proposes the bootstrap-t to compare trimmed means. In R, this is implemented in the functions yuenbt, yhbt in the WRS package.
  5. Percentile bootstrap may be better than percentile-t when the amount of trimming is >/=20%. In R this is implemented in the function pb2gen in the aforementioned WRS package.

Two good references are Wilcox (2010) and Wilcox (2012).

$\endgroup$
9
$\begingroup$

Bradley, in his work Distribution-Free Statistical Tests (1968, pp. 17–24), brings thirteen contrasts between what he calls "classical" and "distribution-free" tests. Note that Bradley differentiates between "non-parametric" and "distribution-free," but for the purposes of your question this difference is not relevant. Included in those thirteen are elements that relate not just to the derivatinos of the tests, but their applications. These include:

  • Choice of significance level: Classical tests have continuous significance levels; distribution-free tests usually have discrete observations of the significance levels, so the classical tests offer more flexibility in setting said level.
  • Logical validity of rejection region: Distribution-free test rejection regions can be less intuitively understandable (neither necessarily smooth nor continuous) and may cause confusion as to when the test should be considered to have rejected the null hypothesis.
  • Type of statistics which are testable: To quote Bradley directly: "Statistics defined in terms of arithmetical operations upon observation magnitudes can be tested by classical techniques, wheras thse defined by order relationships (rank) or category-frequencies, etc. can be tested by distribution-free methods. Means and variances are examples of the former and medians and interquartile ranges, of the latter." Especially when dealing with non-normal distributions, the ability to test other statistics becomes valuable, lending weight to the distribution-free tests.
  • Testability of higher-order interactions: Much easier under classical tests than distribution-free tests.
  • Influence of sample size: This is a rather important one in my opinion. When sample sizes are small (Bradley says around n = 10), it may be very difficult to determine if the parametric assumptions underlying the classical tests have been violated or not. Distribution-free tests do not have these assumptions to be violated. Moreover, even when the assumptions have not been violated, the distribution-free tests are often almost as easy to apply and almost as efficient of a test. So for small sample sizes (less than 10, possible up to 30) Bradley favors an almost routine application of distribution-free tests. For large sample sizes, the Central Limit Theorem tends to overwhelm parametric violations in that the sample mean and sample variance will tend to the normal, and the parametric tests may be superior in terms of efficieny.
  • Scope of Application: By being distribution-free, such tests are applicable to a much larger class of populations than classical tests assuming a specific distribution.
  • Detectibility of violation of assumption of a continuous distribution: Easy to see in distributio-free tests (e.g. existence of tied scores), harder in parametric tests.
  • Effect of violation of assumption of a continuous distribution: If the assumption is violated the test becomes inexact. Bradley spends time explaining how the bounds of the inexactitude can be estimated for distribution-free tests, but there is no analogous routine for classical tests.
$\endgroup$
4
  • 1
    $\begingroup$ Thank you for the citation! Bradley's work seems quite old so I suspect it does not have much work on modern simulation studies to compare efficiencies and Type I/II error rates in various scenarios? I'd also be interested in what he suggests about Brunner-Munzel tests - should they be used instead of a U test if variances in the two groups are not known to be equal? $\endgroup$
    – Silverfish
    Commented Nov 5, 2014 at 22:11
  • 1
    $\begingroup$ Bradley does discuss efficiencies, although most of the time, it is in the context of asymptotic relative efficiency. He brings sources sometimes for statements about finite sample-size efficiency, but as the work is from 1968, I'm sure much better analyses have been done since then. Speaking of which, If I have it right, Brunner and Munzel wrote their article in 2000, which explains why there is no mention of it in Bradley. $\endgroup$
    – Avraham
    Commented Nov 5, 2014 at 23:23
  • $\begingroup$ Yes that would indeed explain it! :) Do you know if there is a more up to date survey than Bradley? $\endgroup$
    – Silverfish
    Commented Nov 5, 2014 at 23:25
  • $\begingroup$ A brief search shows that there are a lot of recent texts on non-parametric statistics. For example: Nonparametric Statistical Methods (Hollander et al, 2013), Nonparametric Hypothesis Testing: Rank and Permutation Methods with Applications in R (Bonnini et al, 2014), Nonparametric Statistical Inference, Fifth Edition (Gibbons and Chakraborti, 2010). There are many others which come up in various searches. As I don't have any, I cannot make any recommendations. Sorry. $\endgroup$
    – Avraham
    Commented Nov 6, 2014 at 0:20
7
$\begingroup$

One aspect that does not seem to have been raised in the other answers is that different tests (e.g., the t-test vs., the Mann-Whitney U test) test different things.

The t-test is a test on the difference of means between two populations. Per Wikipedia on the Mann-Whitney U test:

[T]he Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW/MWU), Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null hypothesis that, for randomly selected values $X$ and $Y$ from two populations, the probability of $X$ being greater than $Y$ is equal to the probability of $Y$ being greater than $X$.

Here are little illustrations to show that these are different things.


Consider two populations. Population 1 is a mixture between a uniform distribution on $[0,1]$ with weight $0.4$ and a uniform distribution on $[2,3]$ with weight $0.6$. Population 2 is a uniform distribution on $[1.4,2.0]$.

Both populations have the same mean: $$ 0.4\times 0.5+0.6\times 2.5 = 1.7. $$ Running t-tests on samples from both populations will yield p-values that are uniform on $[0,1]$. In one out of twenty cases, we will get $p<.05$, rejecting the null hypothesis of equal means at the conventional alpha level, and commit a type I error.

However, if we draw $X$ at random from population 1 and $Y$ from population 2, then the probability is $0.4$ that $X<1$, and $P(X>2)=0.6$. On the other hand, $Y$ is always between $1.4$ and $2.0$. Thus $$ P(X<Y) = 0.4 \neq 0.6 = P(X>Y). $$ Thus, the null hypothesis that the Mann-Whitney U test tests for is false. Repeated applications of the MWU to new samples will yield p-values that are not uniform, and $p>.05$ is a type II error.

We have two populations with equal means, i.e., the null hypothesis that the t-test tests for is true, but the null hypothesis that the MWU tests for is false.


Conversely, if the weights for the mixture in population 1 are no longer $0.4/0.6$, but $0.5/0.5$, then the means are not equal any more (so the null hypothesis for the t-test is false), but $P(X<Y)=P(X>Y)$, i.e., the null hypothesis for the MWU is true.


Thus, the t-test and the Mann-Whitney U test test different things. We can run either one, and decide which one to use based on a variety of factors. However, if we decide to replace the t-test with the MWU, then we should be aware that we are testing a different quantity, i.e., we are asking a different question of the data. (Based on my experience, the question that the MWU addresses is much harder to understand, especially for non-statisticians.)

If I wanted to replace the t-test with a nonparametric alternative, I would usually go for a bootstrap or permutation test on the mean, because that is presumably what I am interested in (otherwise, why consider the t-test in the first place?).

And of course, it makes sense to first consider what question we should be asking of the data, and then to choose the appropriate test.

$\endgroup$
1
  • $\begingroup$ Mann-Whitney U and t-test can be testing the same thing/hypothesis when that thing/hypothesis is that 'two populations follow the same distribution'. The one is sensitive to deviations of the mean, the other to deviations of stochastic dominance. Both relate to the same tendency of a variable to be larger than the other. $\endgroup$ Commented Jun 27 at 11:12
5
$\begingroup$

Starting to answer this very interesting question.

For non-paired data:

Performance of five two-sample location tests for skewed distributions with unequal variances by Morten W. Fagerland, Leiv Sandvik (behind paywall) performs a series of experiments with 5 different tests (t-test, Welch U, Yuen-Welch, Wilcoxon-Mann-Whitney and Brunner-Munzel) for different combinations of sample size, sample ratio, departure from normality, and so on. The paper ends up suggesting Welch U in general,

But appendix A of the paper lists the results for each combination of sample sizes. And for small sample sizes (m=10 n=10 or 25) the results are more confusing (as expected) - in my estimation of the results (not the authors') Welch U, Brunner-Munzel seems to perform equally well, and t-test also well in m=10 and n=10 case.

This is what I know so far.

For a "fast" solution, I used to cite Increasing Physicians’ Awareness of the Impact of Statistics on Research Outcomes: Comparative Power of the t-test and Wilcoxon Rank-Sum Test in Small Samples Applied Research by Patrick D Bridge and Shlomo S Sawilowsky (also behind paywall) and go straight to Wilcoxon no matter the sample size, but caveat emptor, for example Should we always choose a nonparametric test when comparing two apparently nonnormal distributions? by Eva Skovlund and Grete U. Fensta.

I have not yet found any similar results for paired data

$\endgroup$
3
  • $\begingroup$ I appreciate the citations! For clarification, is the "Welch U" being referred to, the same test also known as the "Welch t" or "Welch-Aspin t" or (as I perhaps impropery called it in the question) "t test with Welch correction"? $\endgroup$
    – Silverfish
    Commented Nov 4, 2014 at 13:51
  • $\begingroup$ As far as I understand from the paper, Welch U is not the usual Welch-Aspin - it does not use the Welch–Satterthwaite equation for the degrees of freedom, but a formula that has a difference of the cube and the square of the sample size. $\endgroup$ Commented Nov 4, 2014 at 15:24
  • $\begingroup$ Is it still a t-test though, despite its name? Everywhere else I search for "Welch U" I seem to find it's referring to the Welch-Aspin, which is frustrating. $\endgroup$
    – Silverfish
    Commented Nov 5, 2014 at 22:27
3
$\begingroup$

Simulating the difference of means of Gamma populations

Comparing the t-test and the Mann Whitney test

Summary of results

  • When the variance of the two populations is the same, the Mann Whitney test has greater true power but also greater true type 1 error than the t-test.
  • For large sample N = 1000, the minimum true type 1 error for the Mann whitney test is 9%, whereas the t-test has true Type 1 of 5% as required by the experiment setup (reject $H_0$ for p values below 5%)
  • When the variance of two populations is different, then the Mann Whitney test leads to large type 1 error, even when the means are the same. This is expected since the Mann Whitney tests for difference in distributions, not in means.
  • The t test is robust to differences in variance but identical means

Experiment 1) Different means, same variance

Consider two gamma distributions parametrized using k (shape) and scale $\theta$, with parameters

  • $X_1$: gamma with $k = 0.5$ and $\theta = 1$ hence mean $E[X_1] = k\theta = 0.5$ and variance $Var[X_1] = k\theta^2 = 0.5$
  • $X_2$: gamma with $k = 1.445$ and $\theta = 0.588235$ $E[X_2] = .85$ and variance $Var[X_2] = .5$

We will be testing for a difference in means of samples from $X_1$ and $X_2$. Here the setup is chosen such that $X_1$ and $X_2$ have the same variance, hence the true cohen d distance is 0.5

$$ d = (.85 - .5) / \sqrt{.5} = 0.5$$

We will compare two testing methods: the two sample t-test and the Mann Whitney non parametric test, and simulate the true Type I and Power of these tests for different sample size (assuming we reject null hypothesis for $p$ value < 0.05)

  • $H_0: \mu_{X_1} = \mu_{X_2} = 0.5$
  • $H_1: \mu_{X_1} \neq \mu_{X_2}$

The true type 1 error is calculated as: $P(\text{reject} | H_0)$ and the true power is calculated as: $P(\text{reject} | H_1)$. We simulate thousands of experiment using the true distribution of $H_0$ and $H_1$

Sources:

Population distributions

enter image description here

Simulation results

enter image description here

Discussion

  • As expected, the sample mean is not normally distributed for small sample size ($N = 10$) as shown by the distribution skew and kurtosis. For larger sample size, the distribution is approximately normal
  • For all sample sizes, the Mann Whitney test has more power than the t-test, and in some cases by a factor of 2
  • For all samples sizes, the Mann Whitney test has greater type I error, and this by a factor or 2 - 3
  • t-test has low power for small sample size

Discussion: when the variance of the two populations are indeed the same, the Mann Whitney test greatly outperforms the t-test in terms of power for small sample size, but has a higher Type 1 error rate


Experiment 2: Different variances, same mean

  • $X_1$: gamma with $k = 0.5$ and $\theta = 1$ hence mean $E[X_1] = k\theta = .5$ and variance $Var[X_1] = k\theta^2 = .5$
  • $X_2$: gamma with $k = 0.25$ and $\theta = 2$ $E[X_2] = .5$ and variance $Var[X_2] = 1$

Here we won't be able to computer the power because the simulation does not contain the true $H_1$ scenario. However we can compute the type 1 error when $Var[X_1] = Var[X_2]$ and when $Var[X_1] \neq Var[X_2]$

Discussion Results from the simulation show that the t-test is very robust to different variance, and the type I error is close to 5% for all sample sizes. As expected, the Mann Whitney test performs poorly in this case since it is not testing for a difference in means but for a difference in distributions

enter image description here

$\endgroup$
0
2
$\begingroup$

Considering following links:

Is normality testing 'essentially useless'?

Need and best way to determine normality of data

To simplify things, since non-parametric tests are reasonably good even for normal data, why not use them always for small samples.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.