63
$\begingroup$

I have data from an experiment that I analyzed using t-tests. The dependent variable is interval scaled and the data are either unpaired (i.e., 2 groups) or paired (i.e., within-subjects). E.g. (within subjects):

x1 <- c(99, 99.5, 65, 100, 99, 99.5, 99, 99.5, 99.5, 57, 100, 99.5, 
        99.5, 99, 99, 99.5, 89.5, 99.5, 100, 99.5)
y1 <- c(99, 99.5, 99.5, 0, 50, 100, 99.5, 99.5, 0, 99.5, 99.5, 90, 
        80, 0, 99, 0, 74.5, 0, 100, 49.5)

However, the data are not normal so one reviewer asked us to use something other than the t-test. However, as one can easily see, the data are not only not normally distributed, but the distributions are not equal between conditions: alt text

Therefore, the usual nonparametric tests, the Mann-Whitney-U-Test (unpaired) and the Wilcoxon Test (paired), cannot be used as they require equal distributions between conditions. Hence, I decided that some resampling or permutation test would be best.

Now, I am looking for an R implementation of a permutation-based equivalent of the t-test, or any other advice on what to do with the data.

I know that there are some R-packages that can do this for me (e.g., coin, perm, exactRankTest, etc.), but I don't know which one to pick. So, if somebody with some experience using these tests could give me a kick-start, that would be ubercool.

UPDATE: It would be ideal if you could provide an example of how to report the results from this test.

$\endgroup$
0

7 Answers 7

54
$\begingroup$

It shouldn't matter that much since the test statistic will always be the difference in means (or something equivalent). Small differences can come from the implementation of Monte-Carlo methods. Trying the three packages with your data with a one-sided test for two independent variables:

DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
library(coin)                    # for oneway_test(), pvalue()
pvalue(oneway_test(DV ~ IV, alternative="greater", 
                   distribution=approximate(B=9999)))
[1] 0.00330033

library(perm)                    # for permTS()
permTS(DV ~ IV, alternative="greater", method="exact.mc", 
       control=permControl(nmc=10^4-1))$p.value
[1] 0.003

library(exactRankTests)          # for perm.test()
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.003171822

To check the exact p-value with a manual calculation of all permutations, I'll restrict the data to the first 9 values.

x1 <- x1[1:9]
y1 <- y1[1:9]
DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
pvalue(oneway_test(DV ~ IV, alternative="greater", distribution="exact"))
[1] 0.0945907

permTS(DV ~ IV, alternative="greater", exact=TRUE)$p.value
[1] 0.0945907

# perm.test() gives different result due to rounding of input values
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.1029412

# manual exact permutation test
idx  <- seq(along=DV)                 # indices to permute
idxA <- combn(idx, length(x1))        # all possibilities for different groups

# function to calculate difference in group means given index vector for group A
getDiffM <- function(x) { mean(DV[x]) - mean(DV[!(idx %in% x)]) }
resDM    <- apply(idxA, 2, getDiffM)  # difference in means for all permutations
diffM    <- mean(x1) - mean(y1)       # empirical differencen in group means

# p-value: proportion of group means at least as extreme as observed one
(pVal <- sum(resDM >= diffM) / length(resDM))
[1] 0.0945907

coin and exactRankTests are both from the same author, but coin seems to be more general and extensive - also in terms of documentation. exactRankTests is not actively developed anymore. I'd therefore choose coin (also because of informative functions like support()), unless you don't like to deal with S4 objects.

EDIT: for two dependent variables, the syntax is

id <- factor(rep(1:length(x1), 2))    # factor for participant
pvalue(oneway_test(DV ~ IV | id, alternative="greater",
                   distribution=approximate(B=9999)))
[1] 0.00810081
$\endgroup$
16
  • $\begingroup$ Thanks for your great answer! 2 more questions: Does your second example mean, that coin in fact does provide all possible permutation and is an exact test? Is there any benefit of not providing an exact test in my case? $\endgroup$
    – Henrik
    Commented Jan 10, 2011 at 13:45
  • 13
    $\begingroup$ (+1) It is no surprise that the (unpaired) t-test yields essentially the same p-value, 0.000349. Despite what the reviewer said, the t-test is applicable to these data. The reason is that the sampling distributions of the means are approximately normal, even though the distributions of the data are not. Moreover, as you can see from the results, the t-test actually is more conservative than the permutation test. (This means that a significant result with the t-test indicates the permutation test is likely to be significant, too.) $\endgroup$
    – whuber
    Commented Jan 10, 2011 at 14:09
  • 2
    $\begingroup$ @Henrik For certain situations (chosen test and numerical complexity), coin can indeed calculate the exact permutation distribution (without actually going through all permutations, there are more elegant algorithms than that). Given the choice, the exact distribution seems preferable, but the difference to a Monte-Carlo approximation with a high number of replicates should be small. $\endgroup$
    – caracal
    Commented Jan 10, 2011 at 14:26
  • 1
    $\begingroup$ @Caracal Thanks for the clarification. One question remains: The data I presented is paired. Hence, I need the equivalent to the paired t-test. Is oneway_test the accurate function? And if so, which one is the correct for non-paired data? $\endgroup$
    – Henrik
    Commented Jan 10, 2011 at 14:34
  • 2
    $\begingroup$ @Henrik The coin author wrote me that oneway_test() cannot calculate the exact distribution for the dependent case, you have to go with the MC approximation (only wilcoxsign_test() is suitable for the exact test). I didn't know this and would prefer an error in that case, but MC should be accurate enough with a high number of replicates. $\endgroup$
    – caracal
    Commented Jan 11, 2011 at 19:30
33
$\begingroup$

A few comments are, I believe, in order.

1) I would encourage you to try multiple visual displays of your data, because they can capture things that are lost by (graphs like) histograms, and I also strongly recommend that you plot on side-by-side axes. In this case, I do not believe the histograms do a very good job of communicating the salient features of your data. For example, take a look at side-by-side boxplots:

boxplot(x1, y1, names = c("x1", "y1"))

alt text

Or even side-by-side stripcharts:

stripchart(c(x1,y1) ~ rep(1:2, each = 20), method = "jitter", group.names = c("x1","y1"), xlab = "")

alt text

Look at the centers, spreads, and shapes of these! About three-quarters of the $x1$ data fall well above the median of the $y1$ data. The spread of $x1$ is tiny, while the spread of $y1$ is huge. Both $x1$ and $y1$ are highly left-skewed, but in different ways. For example, $y1$ has five (!) repeated values of zero.

2) You didn't explain in much detail where your data come from, nor how they were measured, but this information is very important when it comes time to select a statistical procedure. Are your two samples above independent? Are there any reasons to believe that the marginal distributions of the two samples should be the same (except for a difference in location, for example)? What were the considerations prior to the study that led you to look for evidence of a difference between the two groups?

3) The t-test is not appropriate for these data because the marginal distributions are markedly non-normal, with extreme values in both samples. If you like, you could appeal to the CLT (due to your moderately-sized sample) to use a $z$-test (which would be similar to a z-test for large samples), but given the skewness (in both variables) of your data I would not judge such an appeal very convincing. Sure, you can use it anyway to calculate a $p$-value, but what does that do for you? If the assumptions aren't satisfied then a $p$-value is just a statistic; it doesn't tell what you (presumably) want to know: whether there is evidence that the two samples come from different distributions.

4) A permutation test would also be inappropriate for these data. The single and often-overlooked assumption for permutation tests is that the two samples are exchangeable under the null hypothesis. That would mean that they have identical marginal distributions (under the null). But you are in trouble, because the graphs suggest that the distributions differ both in location and scale (and shape, too). So, you can't (validly) test for a difference in location because the scales are different, and you can't (validly) test for a difference in scale because the locations are different. Oops. Again, you can do the test anyway and get a $p$-value, but so what? What have you really accomplished?

5) In my opinion, these data are a perfect (?) example that a well chosen picture is worth 1000 hypothesis tests. We don't need statistics to tell the difference between a pencil and a barn. The appropriate statement in my view for these data would be "These data exhibit marked differences with respect to location, scale, and shape." You could follow up with (robust) descriptive statistics for each of those to quantify the differences, and explain what the differences mean in the context of your original study.

6) Your reviewer is probably (and sadly) going to insist on some sort of $p$-value as a precondition to publication. Sigh! If it were me, given the differences with respect to everything I would probably use a nonparametric Kolmogorov-Smirnov test to spit out a $p$-value that demonstrates that the distributions are different, and then proceed with descriptive statistics as above. You would need to add some noise to the two samples to get rid of ties. (And of course, this all assumes that your samples are independent which you didn't state explicitly.)

This answer is a lot longer than I originally intended it to be. Sorry about that.

$\endgroup$
6
  • $\begingroup$ I'm curious if you'd consider the following an appropriate quasi-visualized approach: bootstrap estimates for the moments of the two groups (means, variances, and higher moments if you so desire) then plot these estimates and their confidence intervals, looking for the degree of overlap between groups on each moment. This lets you talk about potential differences across the variety of distribution characteristics. If the data are paired, compute the difference scores and bootstrap the moments of this single distribution. Thoughts? $\endgroup$ Commented Jan 10, 2011 at 17:21
  • 2
    $\begingroup$ (+1) Good analysis. You're perfectly right that the results are obvious and one doesn't need to press the point with a p-value. You may be a little extreme in your statement of (3), because the t-test does not require normally distributed data. If you are concerned, there exist adjustments for skewness (e.g., Chen's variant): you could see whether the p-value for the adjusted test changes the answer. If not, you're likely ok. In this particular situation, with these (highly skewed) data, the t-test works fine. $\endgroup$
    – whuber
    Commented Jan 10, 2011 at 17:58
  • $\begingroup$ (+1) Nice catch! And very good comments. $\endgroup$
    – chl
    Commented Jan 10, 2011 at 18:26
  • $\begingroup$ We seem to be accepting the notion that the underlying distribution is "similar" to the random instantiation. So couldn't one pose the question: are these both from beta(0.25, 0.25) and then the test would be whether they have the same (non-)centrality parameter. And wouldn't that justify using either a permutation test or Wilcoxon? $\endgroup$
    – DWin
    Commented Jan 10, 2011 at 21:13
  • 5
    $\begingroup$ There is a lot of good info here, but it is very strongly worded & mixed w/ some things that don't make much sense to me. Eg, re #3, my understanding of the relationship b/t the z-test & the t-test is that they're essentially the same, but z is used when the SD is known a-priori & t is used when the SD is estimated from the data. I don't see how if the CLT guarantees the normality of the sampling dists this licenses the z-test while leaving the t-test invalid. I also don't think $\neq$ SD's invalidates t if the CLT covers you, so long as an adjustment (eg, Welch–Satterthwaite) is used. $\endgroup$ Commented May 24, 2012 at 16:58
7
$\begingroup$

As this question did pop up again, I may add another answer inspired by a recent blog post via R-Bloggers from Robert Kabacoff, the author of Quick-R and R in Action using the lmPerm package.

However, this methods produces sharply contrasting (and very unstable) results to the one produced by the coin package in the answer of @caracakl (the p-value of the within-subjects analysis is 0.008). The analysis takes the data preparation from @caracal's answer as well:

x1 <- c(99, 99.5, 65, 100, 99, 99.5, 99, 99.5, 99.5, 57, 100, 99.5, 
        99.5, 99, 99, 99.5, 89.5, 99.5, 100, 99.5)
y1 <- c(99, 99.5, 99.5, 0, 50, 100, 99.5, 99.5, 0, 99.5, 99.5, 90, 
        80, 0, 99, 0, 74.5, 0, 100, 49.5)

DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
id <- factor(rep(1:length(x1), 2)) 

library(lmPerm)

summary(aovp( DV ~ IV + Error(id)))

produces:

> summary(aovp( DV ~ IV + Error(id)))
[1] "Settings:  unique SS "

Error: id
Component 1 :
          Df R Sum Sq R Mean Sq
Residuals 19    15946       839


Error: Within
Component 1 :
          Df R Sum Sq R Mean Sq Iter Pr(Prob)  
IV         1     7924      7924 1004    0.091 .
Residuals 19    21124      1112                
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

If you run this multiple times, the p-values jumps around between ~.05 and ~.1.

Although it is an answer to the question let me allow to pose a question at the end (I can move this to a new question if desired):
Any ideas of why this analysis is so unstable and does produce a so diverging p-values to the coin analysis? Did I do something wrong?

$\endgroup$
9
  • 2
    $\begingroup$ It may be better to ask this as a separate question, if it really is a question you want answered, rather than another possible solution you want to list w/ the rest. I notice that you specify an error strata, but @caracal does not; that would be my first guess at the difference b/t this output & his. Also, when simulating, values typically jump around; for reproducibility, you specify the seed, eg set.seed(1); for greater precision in the MC estimate, you increase the number of iterations; I'm unsure if either of those are the 'right' answer to your question, but they are probably relevant. $\endgroup$ Commented May 24, 2012 at 17:22
  • 2
    $\begingroup$ Again, I suggest benchmarking the MC results against the manual calculations using the full permutation (re-randomization) test. See the code for your example for a comparison of oneway_anova() (always close to correct result) and aovp() (typically far off from the correct result). I don't know why aovp() gives wildly varying results, but at least for this case here they're implausible. @gung the last call to oneway_test(DV ~ IV | id, ...) in my original answer specified the error strata for the dependent case (different syntax than used by aov()). $\endgroup$
    – caracal
    Commented May 24, 2012 at 19:11
  • $\begingroup$ @caracal, you're right. I didn't look at the last code block after the edit. I was looking at the top code block--sloppy on my part. $\endgroup$ Commented May 24, 2012 at 19:25
  • $\begingroup$ I don't really need the answer. It is just another possibility that is worth mentioning here. Unfortunately it is far off from the other results which I also worth noticing. $\endgroup$
    – Henrik
    Commented May 25, 2012 at 9:23
  • 1
    $\begingroup$ @Henrik run aovp with maxExact=1000. If it takes too long, set iter=1000000 and Ca=0.001. The calculation terminates when the estimated standard error of p is less than Ca*p. (Lower values give more stable results.) $\endgroup$
    – xmjx
    Commented Jul 19, 2013 at 7:57
7
$\begingroup$

My comments are not about implementation of the permutation test but about the more general issues raised by these data and the discussion of it, in particular the post by G. Jay Kerns.

The two distributions actually look quite similar to me EXCEPT for the group of 0s in Y1, which are much different from the other observations in that sample (next smallest is about 50 on the 0-100 scale) as well as all those in X1. I would first investigate whether there was anything different about those observations.

Second, assuming those 0s do belong in the analysis, saying the permutation test isn't valid because the distributions appear to differ begs the question. If the null were true (distributions are identical), could you (with reasonable probability) get distributions looking as different as these two? Answering that's the whole point of the test, isn't it? Maybe in this case some will consider the answer obvious without running the test, but with these smallish, peculiar distributions, I don't think I would.

$\endgroup$
3
  • 1
    $\begingroup$ It appears that this should be one or more comments, not an answer. If you click the small gray "add comment", you can place your thoughts in the conversation below the question or a particular answer, where they belong. You do make substantive points here, but it is not clear that this is not the appropriate place for them. $\endgroup$ Commented Aug 26, 2012 at 2:32
  • 1
    $\begingroup$ @gung It takes a little reputation to be able to post a comment ;-). $\endgroup$
    – whuber
    Commented Aug 26, 2012 at 20:51
  • 4
    $\begingroup$ This is a good point about applicability of the permutation test. As to the obviousness of the difference between the groups, perhaps it's a matter of experience :-). For intuition, because it's apparent the key difference is in the small values, we might inquire about the chances that the seven smallest in a set of 40 values happened to fall in one random subset of 20. Roughly, each has about a 1/2 chance of being in the subset or its complement, so all seven will be in the same group with chances around $2(1/2)^7 \approx .01$. This mental arithmetic provides quick initial guidance. $\endgroup$
    – whuber
    Commented Aug 26, 2012 at 21:00
2
$\begingroup$

Are these scores proportions? If so, you certainly shouldn't be using a gaussian parametric test, and while you could go ahead with a non-parametric approach like a permutation test or bootstrap of the means, I'd suggest that you'll get more statistical power by employing a suitable non-gaussian parametric approach. Specifically, any time you can compute a proportion measure within a unit of interest (ex. participant in an experiment), you can and probably should use a mixed effects model that specifies observations with binomially distributed error. See Dixon 2004.

$\endgroup$
3
  • $\begingroup$ The scores are not proportions but estimates by participants on a 0 to 100 scale (the presented data are means of estimates on several items with that scale). $\endgroup$
    – Henrik
    Commented Jan 10, 2011 at 13:57
  • $\begingroup$ Then non-parametrics would seem the traditional way to go. That said, I've wondered if such scale data might usefully be inferred to derive from a binomial process and thereby analyzed as such. That is, you say that each score is the mean of several items, and let's say each item is a 10 point scale, in which case I'd represent a response of, say, "8" as a series of trials, 8 of which have the value 1 and two of which have the value 0, all labelled with the same label in an "item" variable. With this expanded/binomial-ized data, you could then compute the binomial mixed effects model. $\endgroup$ Commented Jan 10, 2011 at 14:24
  • $\begingroup$ Following from my previous comment, I should note that in the expanded/binomial-ized data, you could model the "item" variable as either a fixed or random effect. I think I'd lean toward modelling it as a fixed effect because presumably you might be interested in not only accounting for but also assessing the item differences and any possible interaction between item and other predictor variables. $\endgroup$ Commented Jan 10, 2011 at 14:26
1
$\begingroup$

Just adding another approach, ezPerm of ez package:

> # preparing the data
> DV <- c(x1, y1)
> IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
> id <- factor(rep(1:length(x1), 2))
> df <- data.frame(id=id,DV=DV,IV=IV)
>
> library(ez)
> ezPerm( data = df, dv = DV, wid = id, within = IV, perms = 1000)
|=========================|100%              Completed after 17 s 
  Effect     p p<.05
1     IV 0.016     *

This seems to be consistent to the oneway_test of the coin package:

> library(coin)
> pvalue(oneway_test(DV ~ IV | id,  distribution=approximate(B=999999)))
[1] 0.01608002
99 percent confidence interval:
 0.01575782 0.01640682

However, notice that this is not the same example provided by @caracal. In his example, he includes alternative="greater", therefore the difference in p-values ~0.008 vs ~0.016.

The aovp package suggested in one of the answers produce suspiciously lower p-values, and runs suspiciously fast even when I try high values for the Iter, Ca and maxIter arguments:

library(lmPerm)
summary(aovp(DV ~ IV + Error(id/IV), data=df,  maxIter = 1000000000))
summary(aovp(DV ~ IV + Error(id/IV), data=df,  Iter = 1000000000))
summary(aovp(DV ~ IV + Error(id/IV), data=df,  Ca = 0.00000000001))

That said, the arguments seems to be slightly reducing the variations of p-values from ~.03 and ~.1 (I got a bigger range thant the reported by @Henrik), to 0.03 and 0.07.

$\endgroup$
1
$\begingroup$

One more example: MKinfer::perm.t.test(). It's quite fast.

I don't know which one of the above it matches, because neither of you set the seed, so the results will never be well comparable. I use 1000.

> set.seed(1000)
> MKinfer::perm.t.test(x1, y1)

    Permutation Welch Two Sample t-test

data:  x1 and y1
(Monte-Carlo) permutation p-value = 0.007 
permutation difference of means (SE) = 28.1 (10.7) 
95 percent (Monte-Carlo) permutation percentile confidence interval:
  7.35 49.15

Results without permutation:
t = 3, df = 22, p-value = 0.009
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.67 48.63
sample estimates:
mean of x mean of y 
     95.1      67.0 

and for the second data:

> set.seed(1000)
> MKinfer::perm.t.test(DV~IV, alternative="greater")

    Permutation Welch Two Sample t-test

data:  DV by IV
(Monte-Carlo) permutation p-value = 0.004 
permutation difference of means (SE) = 28.1 (10.7) 
95 percent (Monte-Carlo) permutation percentile confidence interval:
 10.5  Inf

Results without permutation:
t = 3, df = 22, p-value = 0.005
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 11.2  Inf
sample estimates:
mean in group A mean in group B 
           95.1            67.0 
$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.