44
$\begingroup$

I use "boot" package to compute an approximated 2-sided bootstrapped p-value but the result is too far away from p-value of using t.test. I can't figure out what I did wrong in my R code. Can someone please give me a hint for this

time = c(14,18,11,13,18,17,21,9,16,17,14,15,
         12,12,14,13,6,18,14,16,10,7,15,10)
group=c(rep(1:2, each=12))
sleep = data.frame(time, group)

require(boot)
diff = function(d1,i){
    d = d1[i,]
    Mean= tapply(X=d$time, INDEX=d$group, mean)
    Diff = Mean[1]-Mean[2]
    Diff
}

set.seed(1234)
b3 = boot(data = sleep, statistic = diff, R = 5000, strata=sleep$group)

pvalue = mean(abs(b3$t) > abs(b3$t0))
pvalue 

The 2-sided bootstrapped p-value (pvalue) = 0.4804 but the 2-sided p-value of t.test is 0.04342. Both p-values are around 11 times difference. How can this happen?

$\endgroup$
3
  • $\begingroup$ how comes b3$t0 has two entries? $\endgroup$
    – Xi'an
    Commented Jan 7, 2012 at 7:07
  • 1
    $\begingroup$ it’s a colname! $\endgroup$
    – Elvis
    Commented Jan 7, 2012 at 7:07
  • 2
    $\begingroup$ You are calculating a $p$-value incorrectly. The documentation says that $t0$ is the observed statistic, not the null distribution as the notation would suggest. You need to come up with an estimate of the sampling dist-n under the null. See my answer for more info. Try mean(abs(b3$t0) < abs(b3$t-mean(b3$t))) for a bias uncorrected test. $\endgroup$
    – AdamO
    Commented May 5, 2017 at 17:05

3 Answers 3

36
$\begingroup$

You are using bootstrap to generate data under the empirical distribution of the observed data. This can be useful to give a confidence interval on the difference between the two means:

> quantile(b3$t,c(0.025,0.975))
     2.5%     97.5% 
0.4166667 5.5833333 

To get a $p$-value, you need to generate permutations under the null hypothesis. This can be done eg like this:

diff2 = function(d1,i){
    d = d1; 
    d$group <- d$group[i];  # randomly re-assign groups
    Mean= tapply(X=d$time, INDEX=d$group, mean)
    Diff = Mean[1]-Mean[2]
    Diff
}

> set.seed(1234)
> b4 = boot(data = sleep, statistic = diff2, R = 5000)
> mean(abs(b4$t) > abs(b4$t0))
[1] 0.046

In this solution, the size of groups is not fixed, you randomly reassign a group to each individual by bootstraping from the initial group set. It seems legit to me, however a more classical solution is to fix the number of individuals of each group, so you just permute the groups instead of bootstraping (this is usually motivated by the design of the experiment, where the group sizes are fixed beforehand):

> R <- 10000; d <- sleep
> b5 <- numeric(R); for(i in 1:R) { 
+    d$group <- sample(d$group, length(d$group)); 
+    b5[i] <- mean(d$time[d$group==1])-mean(d$time[d$group==2]); 
+ }
> mean(abs(b5) > 3)
[1] 0.0372
$\endgroup$
8
  • 11
    $\begingroup$ This is technically the permutation test, not a bootstrap p-value. $\endgroup$
    – AdamO
    Commented May 3, 2017 at 18:03
  • 1
    $\begingroup$ @AdamO I agree that what is presented in this answer is permutation test (and its slightly modified variant); this is because during the resampling the groups are pooled. In contrast, in a bootstrap-based test, values for each group should be sampled using only the data for that same group. Here is one answer explaining how to do it: stats.stackexchange.com/a/187630/28666. $\endgroup$
    – amoeba
    Commented May 3, 2017 at 19:06
  • $\begingroup$ @amoeba I think the answer you link is also a permutation test, related to the bootstrap only insofar as they involve resampling. It's perfectly fine to report, but for reporting it is two methods that are being used. A nonparametric bootstrap is technically unable to generate data under a null hypothesis. See my answer for some of how p-values are generated from a bootstrap distribution. $\endgroup$
    – AdamO
    Commented May 3, 2017 at 19:20
  • $\begingroup$ @AdamO I guess it's a question of terminology, but I don't see how the procedure described in the linked answer can be called a "permutation" test because nothing is permuted there: resampled values for each group are generated using the data from that group only. $\endgroup$
    – amoeba
    Commented May 3, 2017 at 19:31
  • 1
    $\begingroup$ Elvis, I think the first piece of code in your answer is permutation test too. When you resample, you pool the groups together! This is what defines permutation test. $\endgroup$
    – amoeba
    Commented May 4, 2017 at 18:41
48
$\begingroup$

There are numerous ways of calculating bootstrap CIs and p-values. The main issue is that it is impossible for the bootstrap to generate data under a null hypothesis. The permutation test is a viable resampling based alternative to this. To use a proper bootstrap you must make some assumptions about the sampling distribution of the test statistic.

A comment about lack of invariance of testing: it is entirely possible to find 95% CIs not inclusive of the null yet a p > 0.05 or vice versa. To have better agreement, the calculation of bootstrap samples under the null must be dope as $\beta^*_0 = \hat{\beta} - \hat{\beta}^*$ rather than $\beta^*_0 = \hat{\beta}^* - \hat{\beta} $. That is to say if the density is skewed right in the bootstrap sample, the density must be skewed left in the null. It is not really possible to invert tests for CIs with non-analytical (e.g. resampling) solutions such as this.

normal bootstrap

One approach is a normal bootstrap where you take the mean and standard deviation of the bootstrap distribution, calculate the sampling distribution under the null by shifting the distribution and using the normal percentiles from the null distribution at the point of the estimate in the original bootstrap sample. This is a reasonable approach when the bootstrap distribution is normal, visual inspection usually suffices here. Results using this approach are usually very close to robust, or sandwich based error estimation which is robust against heteroscedasticity and/or finite sample variance assumptions. The assumption of a normal test statistic is a stronger condition of the assumptions in the next bootstrap test I will discuss.

percentile bootstrap

Another approach is the percentile bootstrap which is what I think most of us consider when we speak of the bootstrap. Here, the bootstrapped distribution of parameter estimates an empirical distribution of the sample under the alternative hypothesis. This distribution can possibly be non-normal. A 95% CI is easily calculated by taking the empirical quantiles. But one important assumption is that such a distribution is pivotal. This means that if the underlying parameter changes, the shape of the distribution is only shifted by a constant, and the scale does not necessarily change. This is a strong assumption! If this holds, you can generate the "distribution of the statistic under the null hypothesis" (DSNH or $\mathcal{F}_0^*$) by subtracting the bootstrap distribution from the estimates, then calculating what percentage of the DSNH is "more extreme" than your estimate by using $2 \times \min (\mathcal{F}_0^*(\hat{\beta}), 1-\mathcal{F}_0^*(\hat{\beta}))$

Studentized bootstrap

The easiest bootstrap solution to calculating $p$-values is to use a studentized bootstrap. With each bootstrap iteration, calculate the statistic and its standard error and return the student statistic. This gives a bootstrapped student distribution for the hypothesis which can be used to calculate cis and p-values very easily. This also underlies the intuition behind the bias-corrected-accelerated bootstrap. The t-distribution shifts much more easily under the null since outlying results are downweighted by their corresponding high variance.

Programming example

As an example, I'll use the city data in the bootstrap package. The bootstrap confidence intervals are calculated with this code:

ratio <- function(d, w) sum(d$x * w)/sum(d$u * w)
city.boot <- boot(city, ratio, R = 999, stype = "w", sim = "ordinary")
boot.ci(city.boot, conf = c(0.90, 0.95),
        type = c("norm", "basic", "perc", "bca"))

and produce this output:

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates

CALL : 
boot.ci(boot.out = city.boot, conf = c(0.9, 0.95), type = c("norm", 
    "basic", "perc", "bca"))

Intervals : 
Level      Normal              Basic         
90%   ( 1.111,  1.837 )   ( 1.030,  1.750 )   
95%   ( 1.042,  1.906 )   ( 0.895,  1.790 )  

Level     Percentile            BCa          
90%   ( 1.291,  2.011 )   ( 1.292,  2.023 )   
95%   ( 1.251,  2.146 )   ( 1.255,  2.155 )  
Calculations and Intervals on Original Scale

The 95% CI for the normal bootstrap is obtained by calculating:

with(city.boot, 2*t0 - mean(t) + qnorm(c(0.025, 0.975)) %o% sqrt(var(t)[1,1]))

The p-value is thus obtained:

> with(city.boot, pnorm(abs((2*t0 - mean(t) - 1) / sqrt(var(t)[1,1])), lower.tail=F)*2)
[1] 0.0315

Which agrees that the 95% normal CI does not include the null ratio value of 1.

The percentile CI is obtained (with some differences due to methods for ties):

quantile(city.boot$t, c(0.025, 0.975))

And the p-value for the percentile bootstrap is:

cvs <- quantile(city.boot$t0 - city.boot$t + 1, c(0.025, 0.975))
mean(city.boot$t > cvs[1] & city.boot$t < cvs[2])

Gives a p of 0.035 which also agrees with the confidence interval in terms of the exclusion of 1 from the value. We cannot in general observe that, while the width of the percentile CI is nearly as wide as the normal CI and that the percentile CI is further from the null that the percentile CI should provide lower p-values. This is because the shape of the sampling distribution underlying the CI for the percentile method is non-normal.

$\endgroup$
6
  • $\begingroup$ It's a very interesting answer @AdamO, but could you provide with some examples? On R, you can use function boot.ci and use argument "type" to choose a studentized CI (you can also choose a BCA CI). However, how can you calculate p-values? Are you using the estimate or the test statistic? I had a similar question which reply would be greatly appreciated. $\endgroup$ Commented Aug 14, 2017 at 13:11
  • 1
    $\begingroup$ +1 for clear explanation of the benefits of the studentized bootstrap. $\endgroup$ Commented Aug 14, 2017 at 18:34
  • $\begingroup$ @KevinOunet I gave two examples of replicating p-values from CIs in the boot package. Does this help? $\endgroup$
    – AdamO
    Commented Feb 7, 2018 at 16:06
  • 2
    $\begingroup$ Thank you @AdamO, that helps indeed! Could you provide a last example for studentized bootstrap? $\endgroup$ Commented Feb 15, 2018 at 10:02
  • $\begingroup$ I have tried to use the percentile method from boot.pval r package and the results I found using the percentile approach here was different. I am more inclined to think that @AdamO approach is more reliable. But could you please provide a reference for your method of percentile bootstrap p-value calculation? $\endgroup$
    – StatsBio
    Commented Oct 26, 2021 at 11:27
40
$\begingroup$

The answer of Elvis relies on permutations but in my opinion it does not make clear what is wrong with the original bootstrap approach. Let me discuss a solution based solely on bootstrap.

The crucial problem of your original simulation is that bootstrap always provides you with the TRUE distribution of the test statistic. However, when computing the p-value you have to compare the obtained value of the test statistic to its distribution UNDER H0, i.e. not with the true distribution!

[Let's make it clear. For example, it is known that the test statistic T of the classical t-test has the classical "central" t-distribution under H0 and a noncentral distribution in general. However, everyone is familiar with the fact that the observed value of T is compared to the classical "central" t-distribution, i.e. one does not try to obtain the true [noncenral] t-distribution to make the comparison with T.]

Your p-value 0.4804 is so large, because the observed value "t0" of the test statistic Mean[1]-Mean[2] lies very close to the centre of the bootstrapped sample "t". It is natural and typically it is always so [i.e. irrespective of the validity of H0], because the bootstrapped sample "t" emulates the the ACTUAL distribution of Mean[1]-Mean[2]. But, as noted above [and also by Elvis], what you really need is the distribution of Mean[1]-Mean[2] UNDER H0. It is obvious that

  1. under H0 the distribution of Mean[1]-Mean[2] will be centered around 0,

  2. its shape does not depend on the validity of H0.

These two points imply that the distribution of Mean[1]-Mean[2] under H0 can be emulated by the bootstrapped sample "t" SHIFTED so that it is centered around 0. In R:

b3.under.H0 <- b3$t - mean(b3$t)

and the corresponding p-value will be:

mean(abs(b3.under.H0) > abs(b3$t0))

which gives you a "very nice" value of 0.0232. :-)

Let me note that the the point "2)" mentioned above is called "translation equivariance" of the test statistic and it does NOT have to hold in general! I.e. for some test statistics, shifting of the bootstrapped "t" does not provide you with a valid estimate of the distribution of the test statistic under HO! Have a look at this discussion and especially at the reply of P. Dalgaard.

Your testing problem does yields a perfectly symmetric distribution of the test statistic, but keep in mind that there are some problems with obtaining TWO-SIDED p-values in case of skewed bootstrapped distribution of the test statistic. Again, read the above link.

[And finally, I would use the "pure" permutation test in your situation; i.e. the second half of Elvis answer. :-)]

$\endgroup$
1
  • 3
    $\begingroup$ Even if permutation test is preferable to bootstrap here, this answer directly answers the question of what went wrong with the original approach and really clarifies the underlying conceptual issue. I made essentially the same mistake, finding theoretical and bootstrapped p-values in major disagreement, and google having matched this question to my query well, I found this most helpful. $\endgroup$
    – rmwenz
    Commented May 31, 2021 at 0:40

Not the answer you're looking for? Browse other questions tagged or ask your own question.