78
$\begingroup$

I wonder if someone knows any general rules of thumb regarding the number of bootstrap samples one should use, based on characteristics of the data (number of observations, etc.) and/or the variables included?

$\endgroup$
6
  • 3
    $\begingroup$ I've been curious about this too, as I'm planning a simulation analysis. Is there any reason not to go for as many samples as are feasible/practicable? Aside from environmental concerns (e.g., electricity expenditure) and personal concerns (e.g., exceeding critical thresholds for sustainable nerdiness, transitioning into pure geekdom), I'm not seeing any contraindications in the answers so far (+1s all around BTW)... $\endgroup$ Commented Feb 10, 2014 at 20:02
  • 6
    $\begingroup$ @Nick I largely agree -- I generally use as many as I can afford to wait for (usually topping out at a million, though not always), but typically regard 1000 as a pretty clear lower bound. As a first try I often do 1K to get timing information, and then work out how many multiples of that I'm prepared to wait for the actual answer. $\endgroup$
    – Glen_b
    Commented Feb 11, 2014 at 3:58
  • 1
    $\begingroup$ If the time-consuming part of the process is generating simulations, and observations from them can be aggregated easily (as they often can with a little extra coding), it seems like there's little excuse to not err on the side of overachieving. I guess it could escalate out of hand over time if people all did this and forgot why, but since that's probably never going to be the case...Having a minimal threshold that people aim for needlessly seems a bit counterproductive, if the alternative—just going for more until there's really no remaining room for doubt—is thereby discouraged implicitly. $\endgroup$ Commented Feb 11, 2014 at 4:29
  • 2
    $\begingroup$ I just bootstrap until I see a clear convergence. If you want to ease concerns of reviewers, I would just include a visualization of bootstrap iterations vs resulting estimate to illustrate the convergence. $\endgroup$
    – user35780
    Commented Apr 28, 2017 at 13:13
  • $\begingroup$ North et al. 2002 provides some guidelines I have found helpful DOI: 10.1086/341527 [ncbi.nlm.nih.gov/pmc/articles/PMC379178/pdf/AJHGv71p439.pdf] $\endgroup$ Commented Nov 11, 2019 at 22:11

7 Answers 7

53
$\begingroup$

My experience is that statisticians won't take simulations or bootstraps seriously unless the number of iterations exceeds 1,000. Monte Carlo (MC) error is a big issue that's a little under appreciated. For instance, this paper used Niter=50 to demonstrate LASSO as a feature selection tool. My thesis would have taken a lot less time to run had 50 iterations been deemed acceptable! I recommend that you should always inspect the histogram of the bootstrap samples. Their distribution should appear fairly regular. I don't think any plain numerical rule will suffice, and it would be overkill to perform, say, a double-bootstrap to assess MC error.

Suppose you were estimating the mean from a ratio of two independent standard normal random variables, some statistician might recommend bootstrapping it since the integral is difficult to compute. If you have basic probability theory under your belt, you would recognize that this ratio forms a Cauchy random variable with a non-existent mean. Any other leptokurtic distribution would require several additional bootstrap iterations compared to a more regular Gaussian density counterpart. In that case, 1000, 100000, or 10000000 bootstrap samples would be insufficient to estimate that which doesn't exist. The histogram of these bootstraps would continue to look irregular and wrong.

There are a few more wrinkles to that story. In particular, the bootstrap is only really justified when the moments of the data generating probability model exist. That's because you are using the empirical distribution function as a straw man for the actual probability model, and assuming they have the same mean, standard deviation, skewness, 99th percentile, etc.

In short, a bootstrap estimate of a statistic and its standard error is only justified when the histogram of the bootstrapped samples appears regular beyond reasonable doubt and when the bootstrap is justified.

$\endgroup$
3
  • 8
    $\begingroup$ I've always seen large bootstrap samples as well. However, in "An Introduction to the Bootstrap" (1994) by Efron and Tibshirani, they report that you can get a decent estimate with B=25, and B=200 you approach a similar coefficient of variation as infinity. They provide a table of the coefficients of variation for various B (p. 52-53, both pages are available on google books). $\endgroup$ Commented Nov 24, 2015 at 20:14
  • 2
    $\begingroup$ What does MC error mean? $\endgroup$ Commented Mar 14, 2023 at 14:36
  • 2
    $\begingroup$ "MC error" refers to Monte Carlo error. This is essentially deviation arising from the random sampling process. Functions in R like sample() generate pseudorandom numbers which themselves arise through a deterministic and reproducible process. $\endgroup$ Commented Apr 23, 2023 at 23:34
24
$\begingroup$

I start by responding to something raised in another answer: why such a strange number as "$599$" (number of bootstrap samples)?

This applies also to Monte Carlo tests (to which bootstrapping is equivalent when the underlying statistic is pivotal), and comes from the following: if the test is to be exact, then, if $\alpha$ is the desired significance level, and $B$ is the number of samples, the following relation must hold:

$$\alpha \cdot (1+B) = \text{integer}$$

Now consider typical significance levels $\alpha_1 = 0.1$ and $\alpha_2 = 0.05$

We have

$$B_1 = \frac {\text{integer}}{0.1} - 1,\;\;\; B_2 = \frac {\text{integer}}{0.05} - 1$$

This "minus one" is what leads to proposed numbers like "$599$", in order to ensure an exact test.

I took the following information from Davidson, R., & MacKinnon, J. G. (2000). Bootstrap tests: How many bootstraps?. Econometric Reviews, 19(1), 55-68. (the working paper version is freely downloadable).

As regards rule of thumb, the authors examine the case of bootstrapping p-values and they suggest that for tests at the $0.05$ the minimum number of samples is about 400 (so $399$) while for a test at the $0.01$ level it is 1500 so ($1499$).

They also propose a pre-testing procedure to determine $B$ endogenously. After simulating their procedure they conclude:

"It is easy to understand why the pretesting procedure works well. When the null hypothesis is true, B can safely be small, because we are not concerned about power at all. Similarly, when the null is false and test power is extremely high, B does not need to be large, because power loss is not a serious issue. However, when the null is false and test power is moderately high, B needs to be large in order to avoid loss of power. The pretesting procedure tends to make B small when it can safely be small and large when it needs to be large."

At the end of the paper they also compare it to another procedure that has been proposed in order to determine $B$ and they find that theirs performs better.

$\endgroup$
2
  • $\begingroup$ For what stands "integer"? $\endgroup$ Commented Sep 6, 2021 at 13:21
  • 1
    $\begingroup$ And by "bootstrap samples" we understand "drawing x numbers B times" from some distribution, correct? But, what minimal size x should be if dataset is, say, 1M examples? $\endgroup$ Commented Sep 6, 2021 at 13:33
23
$\begingroup$

edit:

If you are serious about having enough samples, what you should do is to run your bootstrap procedure with, what you hope are, enough samples a number of times and see how much the bootstrap estimates "jump around". If the repeated estimates does not differ much (where "much" depends on your specific situation) your are most likely fine. Of course you can estimate how much the repeated estimates jump around by calculating the sample SD or similar.

If you want a reference and a rule of thumb Wilcox(2010) writes "599 is recommended for general use." But this should be considered only a guideline or perhaps the minimum number of samples you should consider. If you want to be on the safe side there is no reason (if it is computationally feasible) why you should not generate an order of magnitude more samples.

On a personal note I tend to run 10,000 samples when estimating "for myself" and 100,000 samples when estimating something passed on to others (but this is quick as I work with small datasets).

Reference

Wilcox, R. R. (2010). Fundamentals of modern statistical methods: Substantially improving power and accuracy. Springer.

$\endgroup$
4
  • 31
    $\begingroup$ 599? Five hundred ninety-nine? What on Earth could be an argument in favour of this number? $\endgroup$
    – amoeba
    Commented Feb 10, 2014 at 19:07
  • $\begingroup$ Ask Wilcox (2010), I guess...I'm curious too though; maybe Rasmus would grace us with a little more context surrounding the quote? $\endgroup$ Commented Feb 10, 2014 at 19:54
  • $\begingroup$ unclear for me where 599 comes from too... added some better advice to the answer though... $\endgroup$ Commented Feb 10, 2014 at 22:44
  • 11
    $\begingroup$ @amoeba You can read the "passage" for yourself. This is an example of exceptionally unclear writing in statistics, and in particular is only applied to inference on the trimmed mean with Windsorized standard error estimates. $\endgroup$
    – AdamO
    Commented Feb 11, 2014 at 3:24
16
$\begingroup$

There are a some situations where you can tell either beforehand or after a few iterations that huge numbers of bootstrap iterations won't help in the end.

  • You hopefully have an idea beforehand on the order of magnitude of precision that is required for meaningful interpretation of the results. If you don't maybe it is time to learn a bit more about the problem behind the data analysis. Anyways, after a few iterations you may be able to estimate how many more iterations are needed.

  • Obviously, if you have extremely few cases (say, the ethics committee allowed 5 rats) you don't need to think about tens of thousands of iterations. Maybe it would be better to look at all possible draws. And maybe it would be even better to stop and think how certain any kind of conclusion can (not) be based on 5 rats.

  • Think about the total uncertainty of the results. In my field, the part of uncertainty that you can measure and reduce by bootstrapping may only be a minor part of the total uncertainty (e.g. due to restrictions in the design of the experiments important sources of variation are often not covered by the experiment - say, we start by experiments on cell lines although the final goal will of course be patients). In this situation it doesn't make sense to run too many iterations -- it anyways won't help the final result and moreover it may indroduce a false sense of certainty.

  • A related (though not exactly the same) issue occurs during out-of-bootstrap or cross validation of models: you have two sources of uncertainty: the finite (and in my case usually very small number of independent cases) and the (in)stability of the bootstrapped models. Depending on your set up of the resampling validation, you may have only one of them contributing to the resampling estimate. In that case, you can use an estimate of the other source of variance to judge what certainty you should achieve with the resampling, and when it stops to help the final result.

  • Finally, while so far my thoughts were about how to do fewer iterations, here's a practical consideration in favor of doing more:
    In practice my work is not done after the bootstrap is run. The output of the bootstrap needs to be aggregated into summary statistics and/or figures. Results need to be interpreted the paper or report to be written. Much of these can already be done with preliminary results of a few iterations of the bootstrap (if the results are clear, they show already after few iterations, if they are borderline they'll stay borderline). So I often set up the bootstrapping in a way that allows me to pull preliminary results so I can go on working while the computer computes. That way it doesn't bother me much if the bootstrapping takes another few days.

$\endgroup$
14
$\begingroup$

TLDR. 10,000 seems to be a good rule of thumb, e.g. p-values from this large or larger of bootstrap samples will be within 0.01 of the "true p-value" for the method about 95% of the time.

I only consider the percentile bootstrap approach below, which is the most commonly used method (to my knowledge) but also admittedly has weaknesses and shouldn't be used with small samples.

Reframing slightly. It can be useful to compute the uncertainty associated with results from the bootstrap to get a sense for the uncertainty resulting from the use of the bootstrap. Note that this does not address possible weaknesses in the bootstrap (e.g. see the link above), but it does help evaluate if there are "enough" bootstrap samples in a particular application. Generally, the error related to the bootstrap sample size n goes to zero as n goes to infinity, and the question asks, how big should n be for the error associated with small bootstrap sample size to be small?

Bootstrap uncertainty in a p-value. The imprecision in an estimated p-value, say pv_est is the p-value estimated from the bootstrap, is about 2 x sqrt(pv_est * (1 - pv_est) / N), where N is the number of bootstrap samples. This is valid if pv_est * N and (1 - pv_est) * N are both >= 10. If one of these is smaller than 10, then it's less precise but very roughly in the same neighborhood as that estimate.

Bootstrap error in a confidence interval. If using a 95% confidence interval, then look at how variability of the quantiles of the bootstrap distribution near 2.5% and 97.5% by checking the percentiles at (for the 2.5th percentile) 2.5 +/- 2 * 100 * sqrt(0.025 * 0.975 / n). This formula communicates the uncertainty of the lower end of the 95% confidence interval based on the number of bootstrap samples taken. A similar exploration should be done at the top end. If this estimate is somewhat volatile, then be sure to take more bootstrap samples!

$\endgroup$
2
  • $\begingroup$ Pick any $n$ and any multiplicative value (2 fold, 10 fold?), I can give you a probability model for which maximum likelihood has that value as relative efficiency to the bootstrap. $\endgroup$
    – AdamO
    Commented Feb 12, 2014 at 21:18
  • $\begingroup$ The bootstrap uncertainty formula is intriguing. What is its basis? $\endgroup$
    – Tripartio
    Commented Jul 21, 2023 at 4:09
6
$\begingroup$

Most bootstrapping applications I have seen reported around 2,000 to 100k iterations. In modern practice with adequate software, the salient issues with bootstrap are the statistical ones, more so than time and computing capacity. For novice users with Excel, one could perform only several hundreds before requiring the use of advanced Visual Basic programming. However, R is much simpler to use and makes generation of thousands of bootstrapped values easy and straightforward.

$\endgroup$
1
  • $\begingroup$ Bootstrapping is easy to code from scratch in almost any programming language, especially modern scripting languages, and many languages also have implementations available already. The same can be said for sampling permutations. $\endgroup$
    – Galen
    Commented Nov 27, 2021 at 3:48
5
$\begingroup$

Data-driven theory-backed procedure

If you want a formal treatment of the subject, a good method comes from a pioneering paper by Andrews & Buchinsky (2000, Econometrica): do some small number of bootstrap replications, see how stable or noisy the estimator is, and then, based on some target accuracy measure, increase the number of replications until you are sure that this resampling-related error has reached a certain lower bound with a chosen certainty. Our helper here is the Weak Law of Large Numbers where the asymptotics are in B. To be more specific, B is chosen depending on the user-chosen bound on the relative deviation measure of the Monte-Carlo approximation of the quantity of interest based on B simulations. This quantity can be standard error, p-value, confidence interval, or bias correction. The closeness is the relative deviation $R^*$ of the B-replication bootstrap quantity from the infinite-replication quantity (or, to be more precise, the one that requires $n^n$ replications): $R^* := (\hat\lambda_B - \hat\lambda_\infty)/\hat\lambda_\infty$. The idea is, find such B that the actual relative deviation of the statistic of interest be less than a chosen bound (usually 5%, 10%, 15%) with a specified high probability $1-\tau$ (usually $\tau = 5\%$ or $10\%$). Then,

$$\sqrt{B} \cdot R^* \xrightarrow{d} \mathcal{N}(0, \omega),$$

where $\omega$ can be estimated using a relatively small (usually 200–300) preliminary bootstrap sample that one should be doing in any case.

Here is the general formula for the number of necessary bootstrap replications $B$:

$$ B \ge \omega \cdot (Q_{\mathcal{N}(0, 1)}(1-\tau/2) / r)^2,$$

where r is the maximum allowed relative discrepancy (i.e. accuracy), $1-\tau$ is the probability that this desired relative accuracy bound has been achieved, $Q_{\mathcal{N}(0, 1)}$ is the quantile function of the standard Gaussian distribution, and $\omega$ is the asymptotic variance of $R$*. The only unknown quantity here is $\omega$ that represents the variance due to simulation randomness.

The general 3-step procedure for choosing B is like this:

  1. Compute the approximate preliminary number $B_1 := \lceil \omega_1 (Q_{\mathcal{N}(0, 1)}(1-\tau/2) / r)^2 \rceil$, where $\omega_1$ is a very simple theoretical formula from Table III in Andrews & Buchinsky (2000, Econometrica).
  2. Using these $B_1$ samples, compute an improved estimate $\hat\omega_{B_1}$ using a formula from Table IV (ibid.).
  3. With this $\hat\omega_{B_1}$ compute $B_2 := \lceil\hat\omega_{B_1} (Q_{\mathcal{N}(0, 1)}(1-\tau/2) / r)^2 \rceil$ and take $B_{\mathrm{opt}} := \max(B_1, B_2)$.

If necessary, this procedure can be iterated to improve the estimate of $\omega$, but this 3-step procedure as it is tends to yield already conservative estimates that ensure that the desired accuracy has been achieved. This approach can be vulgarised by taking some fixed $B_1 = 1000$, doing 1000 bootstrap replications in any case, and then, doing steps 2 and 3 to compute $\hat\omega_{B_1}$ and $B_2$.

Example (Table V, ibid.): to compute a bootstrap 95% CI for the linear regression coefficients, in most practical settings, to be 90% sure that the relative CI length discrepancy does not exceed 10%, 700 replications are sufficient in half of the cases, and to be 95% sure, 850 replications. However, requiring a smaller relative error (5%) increases B to 2000 for $\tau=10\%$ and to 2700 for $\tau=5\%$.

This agrees with the formula for B above. If one seeks to reduce the relative discrepancy r, by a factor of k, the optimal B goes up roughly by a factor of $k^2$, whilst increasing the confidence level that the desired closeness is reached merely changes the critical value of the standard normal (1.96 → 2.57 for 95% → 99% confidence).

Concise practical advice

This being said, we should realise that not everyone is a theoretical econometrician with deep bootstrap knowledge, so here is my quick rule of thumb.

  1. B >= 1000, otherwise your paper will be rejected with something like ‘We are not in the Pentium-II era’ from Referee 2.
  2. Ideally, B >= 10000; try to do it if your computer can handle it.
  3. You could check if your B yields the desired probability $1-\tau$ of achieving the desired relative accuracy $r$ for the values thereof that are psychologically comfortable for you (e.g. $r= 5\%$ and $\tau=5\%$).
  4. If not, increase B to the value dictated by the A&B 3-stage procedure described above.
  5. In general, for any actual accuracy of your bootstrapped quantity, to increase the desired relative accuracy by a factor of k, increase B by a factor of $k^2$.

Happy bootstrapping!

$\endgroup$
4
  • $\begingroup$ Hello! Is r in your formula the same as alpha in Andrews & Buchinsky's Table 3? $\endgroup$ Commented Feb 16 at 16:21
  • $\begingroup$ @GabrielDeOliveiraCaetano r? Maybe tau? I tried following the notation of A&B2000 as closely as possible, and yes, their Table 3 should be the right reference. $\endgroup$ Commented Feb 19 at 22:46
  • $\begingroup$ I meant that on your formula you use an r to represent the accuracy ("where r is the maximum allowed relative discrepancy (i.e. accuracy)"), but on Table 3 of A&B2000 they use an alpha, which I think represents the same, but I am not sure if I understood it correctly. Sorry, I'm not trying to nitpick your notation, it is just that I'm trying to implement this in my analysis and I want to make sure I'm doing it right. $\endgroup$ Commented Feb 20 at 12:40
  • $\begingroup$ @GabrielDeOliveiraCaetano Sorry for the delay. AB2000 are using the notation pdb for the % deviation bound (relative deviation ⋅ 100), and in the post above I am not multiplying; R* is the relative deviation. In their Table 3, depending on the quantity being bootstrapped, α may appear in the desired α-quantile (an unknown value that one is estimating via bootstrap). If you are interested in conf. intervals, then, α is the quantile. However, if you are using standard errors, you have no α. In general R* := (^λ_B − ^λ_∞) / ^λ_∞ ~ N(0, ω). But ω of SE, bias, and p-value does not depend on α. $\endgroup$ Commented Mar 6 at 15:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.