I have been trying to find a simple way to use the bootstrap for a hypothesis test that involves more than two samples. The motivation for using the bootstrap is for the usual reasons: the test statistic is complicated; we don’t want to make parametric assumptions. One method that I think would work for my purpose is stated on bootstrapping and is called the basic bootstrap and cites the textbook Bootstrap methods and their application (Davison and Hinkley 1997, equ. 5.6 p. 194).
Problem formulation: We have 16 independent observations
$$ \{ x_i \}_{i = 1,}^{16} $$
where $\{ x_1, x_2, x_3, x_4 \}$, $\{ x_5, x_6, x_7, x_8 \}$, $\{ x_9, x_{10}, x_{11}, x_{12} \}$, $\{ x_{13}, x_{14}, x_{15}, x_{16} \}$ are four random samples drawn from four different populations. We denote the means of the respective populations as $\mu_1, \mu_2, \mu_3, \mu_4$. I want to test
$$ H_0: (\mu_1 - \mu_2) - (\mu_3 - \mu_4) = 0 \\ H_1: (\mu_1 - \mu_2) - (\mu_3 - \mu_4) \neq 0 $$
The test statistic is
$$ t = \left(\frac{x_1 + x_2 + x_3 + x_4}{4} - \frac{x_5 + x_6 + x_7 + x_8}{4}\right) - \left(\frac{x_9 + x_{10} + x_{11} + x_{12}}{4} - \frac{x_{13} + x_{14} + x_{15} + x_{16}}{4}\right) $$
Bootstrap: I resample each of the 4 sets independently. That is, use functions $\sigma : \{ 1, \dots, 16 \} \to \{ 1, \dots, 16 \}$ such that
$$ \sigma(\{1, 2, 3, 4\}) \subseteq \{1, 2, 3, 4\} \\ \sigma(\{5, 6, 7, 8\}) \subseteq \{5, 6, 7, 8\} \\ \sigma(\{9, 10, 11, 12\}) \subseteq \{9, 10, 11, 12\} \\ \sigma(\{13, 14, 15, 16\}) \subseteq \{13, 14, 15, 16\} $$
Then the resampled statistic would we
$$ t^* = \left(\frac{x_{\sigma(1)} + x_{\sigma(2)} + x_{\sigma(3)} + x_{\sigma(4)}}{4} - \frac{x_{\sigma(5)} + x_{\sigma(6)} + x_{\sigma(7)} + x_{\sigma(8)}}{4}\right) - \left(\frac{x_{\sigma(9)} + x_{\sigma(10)} + x_{\sigma(11)} + x_{\sigma(12)}}{4} - \frac{x_{\sigma(13)} + x_{\sigma(14)} + x_{\sigma(15)} + x_{\sigma(16)}}{4}\right) $$
Assume we have $N = 999$ resamples $t^*_i$ with order statistics $t^*_{(i)}$. Then using the basic bootstrap method, we would have the $95\%$ (or $\alpha = 0.05$) two-sided confidence interval
$$ \left[ 2 t - t^*_{((N+1)(1-\alpha/2))}, 2 t - t^*_{((N+1)(\alpha/2))} \right] = \left[ 2 t - t^*_{(975)}, 2 t - t^*_{(25)} \right] $$
or one-sided confidence intervals
$$ \left[ 2 t - t^*_{((N+1)(1-\alpha))},\infty\right) = \left[ 2 t - t^*_{(950)}, \infty \right) \\ \left(\infty, 2 t - t^*_{((N+1)(\alpha))}\right] = \left(\infty, 2 t - t^*_{(50))}\right] $$
Thus, we can reject $H_0$ at 5% significance if $0$ is not in this interval. Although not stated in the reference above, I believe we can also use this process to determine a one-sided P-values for the test statistic $t$ using (respectively)
$$ p = \frac{1 + \sum_{i=1}^{999} \mathbf{1}\{ 2t - t^*_i \geq 0 \}}{1000} \\ p = \frac{1 + \sum_{i=1}^{999} \mathbf{1}\{ 2t - t^*_i \leq 0 \}}{1000} $$
or a two-sided P-value
$$ p = 2 \left(\frac{1 + \min\left(\sum_{i=1}^{999} \mathbf{1}\{ 2 t - t^*_i \geq 0 \}, \sum_{i=1}^{999} \mathbf{1}\{ 2 t - t^*_i \geq 0 \}\right)}{1000}\right) $$
(Note: the last value above can be $> 1$, in which case I would set it to $1$).
Question: Does the above procedure for determining the confidence intervals and P-values seem correct, even though it uses a difference of four means instead of the usual two shown in most examples?