2
$\begingroup$

I want to test by simulation that the Wilcoxon test is more robust than the Student test for non-normally distributed data.

For example, I'm testing the uniform distribution and the exponential distribution. I don't know if I simulated it wrong or if I missed something. But I can't find the robustness of the Wilcoxon test compared with the Student test?

I have simulated so that the average of population A is higher than that of population B with one point.

##UNIFORM------
populationA<-round(runif(100000, 73,75),1)
populationB<-round(runif(100000, 72,74),1)
t.test(populationA,populationB)

plot(density(populationA))
plot(density(populationB))

n<-5
n<-10

sub_popA<-sample(populationA,size = n)
sub_popB<-sample(populationB,size = n)

t.test(sub_popA,sub_popB)$p.value
wilcox.test(sub_popA,sub_popB)$p.value

With different sample sizes, I've found that the Student test is closer to the truth than the Wilcoxon test.

It is the same for the exponential distribution, I didn't find a superiority of the Wilcoxon test compared to Student test. Even after reproducing the sampling 100 time and counting the number

##
populationA<-rexp(100000,1)
populationB<-rexp(100000,1/2)
t.test(populationA,populationB)

sub_popA<-sample(populationA,size = 10)
sub_popB<-sample(populationB,size = 10)
t.test(sub_popA,sub_popB)
wilcox.test(sub_popA,sub_popB)
$\endgroup$
3
  • $\begingroup$ In fact, t-test is a quiet robust test for non normal data. Inferences have to be taken with cautious in this case but you always have bootstrap. $\endgroup$ Commented Apr 29 at 12:41
  • $\begingroup$ Maybe you could elaborate a bit more on what you did and what you expected to happen. For instance, in your code you seem to compare p-values, but that really only makes sense when you simulate a lot of samples. Then you talk about the t-test being "closer to the truth" and I think there might be multiple things you could mean by that, also things quite far apart from p-values. $\endgroup$ Commented Apr 29 at 14:38
  • 3
    $\begingroup$ The meaning of "robust ... for a specified distribution" is unclear, because it seems to merge two distinct concepts: robustness refers to performance when there are departures from distributional assumptions whereas referring to a distribution seems to be asking about the power of a test. Could you please clarify what you mean? $\endgroup$
    – whuber
    Commented Apr 29 at 14:49

4 Answers 4

9
$\begingroup$

With respect to your choices of distribution, the uniform distribution is not something you need to worry about being robust against. In the old days, a common way of simulating a Normal distribution was to average 12 uniform variates, which implies that a t-statistic based on two samples of size six would be close to the desired distribution under the null hypothesis. In fact, the asymptotic relative efficiency of the Wilcoxon to the t-test for the uniform distribution is... $1.0$. (For the Normal, it's $0.955$.)

Generally speaking (but not always), you want to protect against outliers relative to the "base" distribution (in this case, the Normal), which can be thought of as generated by a distribution with "fatter" tails than the Normal.

Let's do a more extensive simulation with 10,000 repeats of the procedure (100 is far too small a sample size to draw any conclusions except in the most egregious cases) with our base distribution being the fat-tailed $t(3)$ distribution:

library(data.table)
reject <- data.table(t = rep(0,10000), w = rep(0, 10000))

for (i in 1:nrow(reject)) {
    x1 <- rt(10, 3)
    x2 <- rt(10, 3) - 2
    
    reject$t[i] <- t.test(x1,x2)$p.value
    reject$w[i] <- wilcox.test(x1,x2)$p.value
}

reject[, .(t_reject = mean(t < 0.01), Wilcox_reject = mean(w < 0.01))]

which gives us the following:

> reject[, .(t_reject = mean(t < 0.01), Wilcox_reject = mean(w < 0.01))]
   t_reject Wilcox_reject
1:    0.551         0.625
> reject[, .(t_reject = mean(t < 0.05), Wilcox_reject = mean(w < 0.05))]
   t_reject Wilcox_reject
1:    0.766        0.8414

Clearly favoring the Wilcoxon test.

Now for your Exponential distribution test. Your two Exponential distributions differ in scale, not location; this makes it harder to detect changes in the mean. However, with a larger number of repeats of the experiment, we can still see a difference:

for (i in 1:nrow(reject)) {
    x1 <- rexp(10)
    x2 <- rexp(10)/3
    
    reject$t[i] <- t.test(x1,x2)$p.value
    reject$w[i] <- wilcox.test(x1,x2)$p.value
}

> reject[, .(t_reject = mean(t < 0.01), Wilcox_reject = mean(w < 0.01))]
   t_reject Wilcox_reject
1:   0.1176        0.2313
> reject[, .(t_reject = mean(t < 0.05), Wilcox_reject = mean(w < 0.05))]
   t_reject Wilcox_reject
1:   0.4536        0.4697

If, instead of rescaling the Exponential distributions, we add a location parameter and rerun the tests, testing for differences in location, we get the following:

for (i in 1:nrow(reject)) {
    x1 <- rexp(10)
    x2 <- rexp(10) + 0.5
    
    reject$t[i] <- t.test(x1,x2)$p.value
    reject$w[i] <- wilcox.test(x1,x2)$p.value
}

> reject[, .(t_reject = mean(t < 0.01), Wilcox_reject = mean(w < 0.01))]
   t_reject Wilcox_reject
1:   0.0708        0.1283
> reject[, .(t_reject = mean(t < 0.05), Wilcox_reject = mean(w < 0.05))]
   t_reject Wilcox_reject
1:    0.222         0.313

Note also that the t-test will fail when the underlying distributions do not have a finite variance, but the Wilcoxon will not.

$\endgroup$
6
$\begingroup$

It's worth noting that the Wilcoxon test is more "robust" (in terms of power) in that it is comparing a measurement that is less sensitive than the mean to outliers, much like how the median is more robust to outliers than the mean...but isn't helpful if you actually care about the mean.

To demonstrate, I will borrow @jbowman's code, except that instead of comparing two t's with different means, I will compare two different parametric families (normal + exponential) that actually do have the same mean.

library(data.table)
reject <- data.table(t = rep(0,10000), w = rep(0, 10000))

for (i in 1:nrow(reject)) {
  x1 <- rnorm(400, mean=1)
  x2 <- rexp(400, 1) 
  
  reject$t[i] <- t.test(x1,x2)$p.value
  reject$w[i] <- wilcox.test(x1,x2)$p.value
}

reject[, .(t_reject = mean(t < 0.05), Wilcox_reject = mean(w < 0.05))]

>    t_reject Wilcox_reject
      <num>         <num>
1:   0.0484        0.4639

Note that in this case, if we care about the mean, in this situation we have a type I error rate of ~0.464 for the Wilcox test, while the t-test has 0.0484, closer to the the target type 1 error rate.

With the data I often work with, we have so many extreme outliers that our t-tests are often underpowered...but switching it out with Wilcox tests would be measuring something completely different than the mean and likely mislead us given that we do care about means.

$\endgroup$
5
$\begingroup$
  1. When evaluating the performance characteristics of tests, you should look at type I and type II error probabilities (uniformity of p-values may be seen as an alternative to look at). Simulating these at any precision requires you to generate lots of data sets (at least 1000 I'd say) with the same true parameters and sample sizes. Your code seems to generate only a single one for each combination. So whatever you observe there could be due to random variation.

  2. Note in particular that you'd need to simulate both type I and type II error probabilities, because it's not really a great achievement of a test to have a large power (small type II error probability) if it is anticonservative, i.e., the type I error probability is too large.

  3. An exact problem definition would require to state what situation is supposed to count as the null hypothesis (H0) being true or violated (I call this "interpretative null hypothesis" as opposed to the formal one, which comes with restrictive model assumptions). This is not always trivial in general. In your examples it may seem clear (if the distributions in the two groups are not equal you want to reject, otherwise not), but one can imagine situations in which this is not the case (for example two normal distributions with different variances and same or very similar mean). In fact the two exponential distributions, as mentioned in another answer, are not related by a mean shift, and this situation may not be covered by any theory existing for the t-test (I'm not sure, I haven't looked for it). However I think it's realistic that a t-test in reality may be used in a situation like this, and that in this case we arguably would want to reject, so ultimately I think this is fine as a situation to try out, even though you should try to be clearer about the basis for doing so (for which I hope this helps).

  4. Note that the remark in some other answers that what Wilcoxon tests is not the same thing as what the t-test tests also is connected to item 3. What is of interest is to compare them in situations in which the "interpretative H0" is the same even if the formal one isn't, and this is probably the case in many practical situations but not necessarily all of them (one needs to have in mind in particular that in a practical situation we don't normally know what the true underlying distribution is, whether the two groups truly have equal variances, or even whether one distribution is stochastically larger than the other - this means it isn't so easy to nail down practical situations in which we would clearly be interested in one of these tests but not the other).

  5. Having said all this, what you report here agrees with my own experiments (some of which here: "Should we test the model assumptions before running a model-based test?"; some more such simulations are done in literature referenced there). In fact the t-test is better than the Wilcoxon in a good number of situations in which the formal (normality) null hypothesis for the t-test is not fulfilled. In some of these situations, the power gain of the t-test is in fact bigger than the power gain under the nominal model, i.e., assuming normality.

  6. What is referred to as "Robust Statistics" is somewhat different from nonparametric statistics, of which the Wilcoxon is an example. However, "robustniks" (people working on and promoting robust statistics) would normally say things such as "the Wilcoxon is more robust than the t-test". This is because robustniks are usually interested in worst case situations, and the t-test performance regarding power can badly break down under outliers and heavy tails (try simulating from a $t_1$ or $t_2$-distribution with mean shift to see this). The quality loss of Wilcoxon in the uniform and exponential situation is probably more harmless than the loss of the t-test in such extreme situations. (I think I have heard or seen but don't remember where that there is an upper bound to the quality loss of Wilcoxon compared to t, but not the other way round.)

  7. All this is not in disagreement with the nonparametric theory behind the Wilcoxon, as this theory only states that the Wilcoxon keeps its level and is unbiased against a large class of alternatives, however it does not state that it is optimal (except in a very particular special case), and neither is there theory that states that the t-test is generally bad under non-normality, or worse than the Wilcoxon. In fact the t-test can be justified for many non-normal situations by appeal to the Central Limit Theorem (but not where variances don't exist such as $t_1$ and $t_2$, where the t-test is indeed worse than Wilcoxon).

$\endgroup$
1
$\begingroup$

I am not sure what the purpose of your simulation is, and how you expect that Wilcoxon Mann Whitney U (MW-U) would be more robust?

  1. Student t and MW-U do not test for the same thing. One is a test of difference of means, the other of stochastic order. $H_0$ for one is $\mu_X=\mu_Y$, for the other $P(X>Y)+1/2P(X=Y)=.5$. So, depending on how you select your 2 simulated samples, you will get very different power, unsurprisingly. And indeed, because stocashtic superiority is a much more "complex" property, involving not only a single parameter (mean), but the whole shape of the distribution, it is indeed easier, in many cases, to detect a difference.
    FYI, one can easily create 2 samples A & B, where $\mu_A>\mu_B$ at whatever significance you choose, but $P(B>A)>.5$, at the same significance level. That is, they will give you "apparently contradictory" results (which is not a contradiction, because they do not test the same Null). So "comparing" them may be a bit of an apples/oranges exercise.
  2. "Robusteness" is usually used to describe the "lack of misbehavior" of a test against departure from its distributional assumption. Because the MW-U test, in its fundamental form, makes absolutely no assumption about the 2 parent distributions, nor any other assumption of any sort, it is, by definition, robust. Not so much for the student t: it needs some "normality" (or more correctly,non too extreme non-normality), but also equality of variances (but then there is the Welch t-test, which avoids this assumption). Note however that the Student t test suffers from $\alpha$-error inflation when the variances are different (problem addressed by Welch test), and so does the MW-U (see Zimmerman, D. W. (1999). Type I Error Probabilities of the Wilcoxon-Mann-Whitney Test and Student T Test Altered by Heterogeneous Variances and Equal Sample Sizes. Perceptual and Motor Skills, 88(2), 556-558. https://doi.org/10.2466/pms.1999.88.2.556).
$\endgroup$
8
  • $\begingroup$ This answer is, unfortunately, incorrect in several ways and also mildly incoherent. First, in your opening sentence, you have "...how you expect that Wilcoxon Mann Whitney U (MW-U) would be more robust?", whereas in the second sentence of part 2, you state "...MW-U test, in its fundamental form, ... is, by definition, robust." Hmmm... However, the second statement is false; distribution-free in no way implies robustness. Consider the sample mean as a counterexample to this. As Peter Huber observed, robust, distribution-free, and nonparametric are actually not closely related properties. $\endgroup$
    – jbowman
    Commented Apr 29 at 21:33
  • $\begingroup$ You then contradict yourself later in the paragraph when you state that "...the Student t test suffers from $\alpha$-error inflation when the variances are different ... and so does the MW-U", which implies the test is not in fact robust against differences in variances. Furthermore, your initial statement in part 1 ignores the fact that the two null hypotheses are equivalent when there is only a shift in location between the two distributions, or, in the case of the Exponential, a shift in scale. In these cases, and others that can easily be constructed, the nulls are equivalent. $\endgroup$
    – jbowman
    Commented Apr 29 at 21:40
  • $\begingroup$ Can you provide your definition of "robustness"? How do you evaluate it? Then maybe we can discuss whether a test is, or not, robust. My statements stem from the fact that the OP seems to be comparing the power of the 2 tests (since he creates samples which are not compatible with the Null), and not the robustness. And a test which makes no distributional assumptions is by definition, robust. It may have other flaws etc., but it does not care about breaking the assumptions because it makes none... If a test is robuts by definition, how can one test that robustness; simple logic. $\endgroup$
    – jginestet
    Commented Apr 29 at 21:47
  • 1
    $\begingroup$ Power is indeed part of it, with the level being the other part. If we didn't care about power when the assumptions of a test are violated, we'd save ourselves the effort and just generate a $U(0,1)$ variate, rejecting when $u \leq 0.05$ or some such, because the assumptions are (almost) always violated to some degree. Studying what happens to the power of a test as assumptions are violated has a long history; I find it in Robust Statistics (Hampel et. al.), my 1986 edition, and I'm quite sure it wasn't novel then. $\endgroup$
    – jbowman
    Commented Apr 29 at 22:54
  • 1
    $\begingroup$ 1. Someone learning statistics from Wikipedia has much more learning to do. 2. You are changing the subject to one of my definition of robustness, and throughout this thread, and the other thread we are engaged on, you fail to address any of the points I make. 3. A trivial example of applying the Wikipedia definition to a distribution-free test is a violation of the i.i.d. assumption; consider the usual distribution-free test of the location of the median based on the order statistics. Its performance breaks down as the correlation between observations increases. $\endgroup$
    – jbowman
    Commented May 2 at 18:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.