7
$\begingroup$

Say I have a sample, of finite size $N$, and I compute some statistic $\theta$ from it. I want to plot this sample estimate, $\hat{\theta}$, with an error-bar.

To compute the error, I am using bootstrapping, i.e. taking samples with replacement from my original sample, each of the same size $N$. I then compute the statistic on each bootstrap sample, $\theta^*$, and bin these values into a bootstrap distribution.

If the bootstrap assumption holds true, meaning my sample is representative of the underlying population, then

  1. The bootstrap standard error will approximate the actual standard error. In other words, the standard deviation of the bootstrap distribution will approximate the standard deviation of the sampling distribution of $\theta$, built by taking many samples of size $N$ with replacement from the population, and binning all their sample estimates $\hat{\theta}$. Mathematically, $\sigma(\theta^*)\approx \sigma(\hat{\theta})$.
  2. The bias of the bootstrap distribution will approximate the bias of the sampling distribution. In other words, the distance between the mean of the bootstrap distribution and my original sample estimate will approximate the distance between the mean of the sampling distribution and the true value (computed from the population). Mathematically, $\overline{\theta^*} -\hat{\theta} \approx \overline{\hat{\theta}}-\theta$.

Coming back to my aim: I do not have access to the population, only to a sample of it, and I want to show a sensible error-bar on my sample estimate, $\hat{\theta}$.

If the bootstrap assumption holds true and I find that, after building the bootstrap distribution from my sample, the bias is negligible $(\overline{\theta^*} -\hat{\theta} \approx 0)$, then the bootstrap standard error, $\sigma(\theta^*)$, is a measure both of

  1. The precision of my sample estimate, as $\sigma(\theta^*)\approx\sigma(\hat{\theta})$, which is the variability of sample estimates around the mean.
  2. The accuracy of my sample estimate, as $\overline{\theta^*} -\hat{\theta} \approx \overline{\hat{\theta}}-\theta\approx 0$, which means that the standard error measures the variability of samples estimates around the true (or "correct") value.

This is in line with the wikipedia page on errorbar, which says

Error bars are graphical representations of the variability of data and used on graphs to indicate the error or uncertainty in a reported measurement. They give a general idea of how precise a measurement is, or conversely, how far from the reported value the true (error free) value might be.

As I understand it, the last part is a statement of accuracy.

Problem

Now, all of that is fine, but the question is what to do when there is actually a non-zero bias. In this case, point 1 above still holds: the bootstrap standard error measures the precision of my sample estimate. However, point 2 does not hold, i.e. the variability that the standard error measures is no longer around the true value (as it is displaced from the mean), so the standard error is no longer related to the accuracy of my sample estimate.

I am not happy with that because I would like my errors to indicate, as the wikipedia article states, how far from the reported value the true (error free) value might be.

Potential solutions

I can think of two different ways of achieving what I want, but I am not sure they are sensible, as I am not an expert in statistics. Let me call the non-zero bias $\alpha = \overline{\theta^*} -\hat{\theta} \approx \overline{\hat{\theta}}-\theta$. The potential solutions are

  1. Given the last equality above, $\alpha$ is the average accuracy of the sample estimates. A simple, perhaps naïve, way to deal with it could be to "correct" for it, substracting it from my sample estimate, $\hat{\theta}-\alpha$. The result would be an unbiased estimate, and I would just show it with the bootstrap standard error as error-bar.
  2. Instead of computing the bootstrap standard error, $\sigma(\theta^*)$, i.e. measure the (square root of the) mean squared deviation of bootstrap estimates from the mean, $\overline{\theta^*}$, I could instead compute the deviation from my sample estimate, $\hat{\theta}$. Calling this quantity $\epsilon(\hat{\theta})$, and given $\hat{\theta}=\overline{\theta^*}-\alpha$, we would have $$\begin{align} \epsilon^2(\hat{\theta})=& \frac{1}{B}\sum_{b=1}^{B}(\theta^*_b-\hat{\theta})^2\\ =& \frac{1}{B}\sum_{b=1}^{B}(\theta^*_b+\alpha-\overline{\theta^*})^2\\ =& \frac{1}{B}\sum_{b=1}^{B}\left[(\theta^*_b-\overline{\theta^*})^2+\alpha^2-2\alpha(\theta^*_b-\overline{\theta^*})\right]\\ =& \frac{1}{B}\sum_{b=1}^{B}(\theta^*_b-\overline{\theta^*})^2+ \frac{1}{B}\sum_{b=1}^{B}\alpha^2- \frac{2\alpha}{B}\sum_{b=1}^{B}\theta^*_b+\frac{2\alpha}{B}\sum_{b=1}^{B}\overline{\theta^*}\\ =& \sigma^2(\theta^*)+ \alpha^2- 2\alpha\overline{\theta^*} + 2\alpha\overline{\theta^*}\\ =& \sigma^2(\theta^*)+ \alpha^2. \end{align}$$ So this quantity incorporates the bias in quadrature relative to the bootstrap standard error, $$\epsilon(\hat{\theta})=\sqrt{\sigma^2(\theta^*)+\alpha^2},$$ and, as a consequence, it is a measure of both the precision and accuracy of my sample estimate $\hat{\theta}$, as I would like. As the bias approaches zero, it converges to the bootstrap standard error, $\epsilon(\hat{\theta})=\sigma(\theta^*)$.

Question(s)

Are any of my potential solutions sensible? Have I made any wrong assumptions? Any ideas would be highly appreciated.

Edit

In response to @EdM, my statistic is the following: $$\theta=\frac{1}{2}\arctan\left(\frac{\sigma_{xy}^2}{|\sigma_{xx}^2-\sigma_{yy}^2|}\right),$$

where, given a 2D sample of $x$ and $y$ coordinates, $\sigma_{ij}^2=\langle ij \rangle - \langle i \rangle \langle j \rangle$.

This variable $\theta$ measures the angle by which the direction of highest dispersion in the plane is tilted with respect to the coordinate axes. In Galactic astronomy we call it the vertex deviation, with symbol $l_\mathrm{v}$. One important thing to note is that its range is limited to $[-45,45]˚$.

I am simulating the type of bias this variable exhibits below. The left and right panels were produced starting from two different populations. Each panel shows several sampling distributions for different sample sizes, and the legend indicates their mean value of $l_\mathrm{v}$. Lower $N$ distributions develop a positive bias (i.e. the magnitude of their mean $l_\mathrm{v}$ decreases) relative to the true value. The limited range of the variable seems to be at least part of the issue.

enter image description here

$\endgroup$
4
  • $\begingroup$ Welcome to Cross Validated! Please edit the question to say more about what X, Y and Z represent in your particular study, and the examples of bias that your have found. In general, if there is bias between a statistic calculated from the mean of bootstrapped samples and the same statistic calculated on the full data set, then the statistic itself is probably biased to start with and needs to be evaluated further. There are bootstrap methods that take bias and skew into account. $\endgroup$
    – EdM
    Commented May 9 at 18:14
  • $\begingroup$ Thanks for your comment @EdM. I have edited my question to add those details in. Could you give me some more information/links on the bootstrap methods that take bias and skew into account? $\endgroup$
    – Luismi98
    Commented May 11 at 2:39
  • 1
    $\begingroup$ Can’t give a complete response right now. Look at this answer for some discussion of different types of bootstrap. Your problem seems similar to what’s found with Shannon entropy, discussed on this page. $\endgroup$
    – EdM
    Commented May 11 at 2:49
  • $\begingroup$ Model assumptions generally don't perfectly hold ("all models are wrong but some are useful"), so you will generally have the problem that whatever you do in statistics will not be exactly valid in the actual situation. I don't see a problem with having an error bar based on standard bootstrap saying what it is and what potential shortcoming you see. In some situations you may be able to improve, but this has to rely on additional assumptions, which are fine if they correspond to knowledge you actually have. Otherwise I don't see much improvement. $\endgroup$ Commented May 11 at 11:40

2 Answers 2

4
$\begingroup$

This page and its many links provide an introduction to the issues that arise in bootstrapping once you get beyond the simplest cases.

It's easy to forget that the reliability of bootstrapping to evaluate a statistic depends (among other things) on its being a pivotal quantity. That means that the shape and scale of its distribution doesn't change as its value changes; the whole distribution just shifts by a constant. With the fixed limits to your statistic that clearly can't be the case, as your plots show.

As a result, your assumption that the distribution of its values among multiple bootstrapped samples represents its actual distribution doesn't hold. That's particularly the case for your bootstrapping, which seems to use the "percentile" bootstrap. (See Wikipedia for different types of bootstrapping.)

One thing to try is to evaluate the distribution of your bootstrapped estimates of the statistic around the original value of the statistic in the data sample. That's often called the "empirical" or "basic" bootstrap. It most directly follows the bootstrap principle: sampling with replacement from the observed data represents the process of taking the original data from the underlying distribution. I think that's close to what you're suggesting in your proposed solution, but I haven't thought it through completely. There's no need to re-invent things, as there are well established other methods implemented, for example, in the boot package in R.

Even with the basic bootstrap, you can't always be assured that your "error bars" have the nominal coverage. For example, 95% of the nominal 95% confidence intervals among the bootstrapped samples won't necessarily contain the true value.

The "bias-corrected and accelerated" bootstrap was designed to improve on those simpler methods, to deal with both bias and the skew that can affect coverage. That's probably the simplest choice in your situation. In extreme cases (like sampling from a log-normal distribution), however, even that isn't adequate.

$\endgroup$
1
$\begingroup$

Standard error doesn't tell you anything at all about the accuracy of the experiment. Say you want to measure the weight of some population of mice, but accidentally leave a 1kg weight on the scale - nothing about the spread of the data or the sample size can indicate that all your values are 1kg too high, even if your standard error shrinks arbitrarily close to 0. You can estimate the mean weight with arbitrary precision by collecting more mice from the same population, but whether that value is accurate comes down to your methodology and whether the measurement is biased or not.

In such a case, it wouldn't make any sense to try to fit your observed value to the true unbiased value (ignoring the fact that you don't know the true value to begin with). The standard error tells you the likely range of a parameter given some observations. You can't just ignore the data, and draw an error bar centered around some arbitrary "true" value that you found in some other way but which was never indicated by your data. If you find that your mice weigh an average of 1030 grams with low standard error in your "extra kg" experiment, there is no reason to draw an error bar respecting the true value of 30 grams instead. Nothing about your data indicated that 30 grams was a feasible mean weight - as far as the data tells you, you never even saw a mouse weighing less than 1kg.

To do this would be to observe a bunch of data, compute the standard error range in which the parameter could feasibly lie, and then throw all that out and declare that you know better than the data anyway, and that the feasible range for the parameter should be forced to stretch toward the value that you already knew was correct anyhow - which might lie well outside the range that the data told you was feasible. A standard error purely describes variability within the data observed in an experiment, but says nothing about whether your data actually reflects reality, and should not try to account for variability in data you did not observe.

See If "Standard error" and "Confidence intervals" measure precision of measurement, then what are the measurements of accuracy?

$\endgroup$
2
  • $\begingroup$ I have edited my question to make it clearer. Could you let me know if the new version changes your answer in any way? $\endgroup$
    – Luismi98
    Commented May 11 at 2:40
  • $\begingroup$ To be clearer, I think this does not answer my question because I am talking about the bias of a biased estimator, not a measurement bias due to some systematic/human error in an experiment. My bias shows up when performing bootstrapping from my sample, while in the scenario you are referring to this would not be the case. $\endgroup$
    – Luismi98
    Commented May 11 at 13:04

Not the answer you're looking for? Browse other questions tagged or ask your own question.