10
$\begingroup$

I have data that is reasonably assumed to be iid samples from some distribution. Our goal is to put a confidence interval on the population variance. Notationally, we have IID $X_i, i = 1, ..., n$ with mean $\mu$, variance $\sigma^2$ unknown. Sample size $n$ varies from 200 to 20,000. Let $S_n^2 = \frac{\sum_i (X_i -\bar{X}_n )^2}{n-1}$, the sample variance.

Plotting my data, it is trimodal, so definitely does not seem to be coming from a normal distribution. Computing confidence intervals on the sample mean $\bar{X_n}$ is no problem, and I have planned diagnostics to make sure the sample mean distribution is normal with variance near $S_n^2/n$.

How can I create a confidence interval around $\sigma$ based on $S_n$ and what are the diagnostics to gain confidence that it is correct?


Related questions:

  1. What is the name of the distribution of unbiased sample variance for a sample from Gaussian distribution?
  2. Chi-squared confidence interval for variance

Both 1., 2. are concerned with the case where $X_i$ are IID normally distributed, but I don't have that almost certainly.

  1. How to prove $(\hat{X}-\mu)/(\hat{S}/\sqrt{n})$ is student t with $n-1$ degrees of freedom if $X_i$ are iid $N(\mu, \sigma)$? Answer to q.3. has an interesting technique in the answer proof of writing $S_n^2 = \sum_{i = 1}^{n-1}Y_i^2$ for iid positive $Y_i$. Seems like applying CLT to this fact will provide a solution.

$\endgroup$
1
  • 4
    $\begingroup$ +1. I wouldn't trust the CLT for this unless the distributional tails are short -- sample variances are much more variable than sample means and always have skewed distributions. Your sample sizes are adequate for bootstrapping a confidence interval. $\endgroup$
    – whuber
    Commented Jun 2, 2023 at 15:06

3 Answers 3

14
+200
$\begingroup$

Let's take a look at a bootstrap based approach, and compare the results to the CLT based confidence interval. First, I'll define a population distribution which is trimodal, skewed, and heavy tailed (compared to Normal).

rmydist <- function(n){
  i <- sample(3, n, TRUE)
  x1 <- rnorm(n)
  x2 <- rgamma(n, 1.2, 0.5) +2
  x3 <- rbeta(n, 1.8, 0.5)*3 - 4
  x <- (i==1)*x1 + (i==2)*x2 + (i==3)*x3
  return(x)
}

# Plot histogram with huge n
x <- rmydist(1e8)
hist(x, breaks=30)

enter image description here

The "true" variance, based on this really huge sample from the population, is $\sigma^2 \approx 8.612$ (this matches the exact variance, which can be computed using the Law of Total Variance).

Now we can compute 95% confidence intervals using (i) the bootstrap, (ii) the accelerated bootstrap (iii) accelerated bootstrap on the log-sd scale (see @whuber's comment) and (iv) chi-square approximation for $n=200$.

# Generate data
set.seed(12345)
n <- 200
x <- rmydist(n)

# Confidence interval with (percentile) bootstrap
B <- 10000
boot <- rep(NA, B)
for(i in 1:B){
  xnew <- sample(x, n, TRUE)
  boot[i] <- var(xnew)
}
CI_1 <- quantile(boot, probs=c(0.025, 0.975))

# Confidence interval with accelerated bootstrap
#devtools::install_github("knrumsey/quack")
library(quack)
a <- est_accel(x, var)
CI_2 <- quack::boot_accel(x, var, alpha=0.05, a=a)

# Confidence interval with accelerated bootstrap (log-scale)
logsd <- function(xx) log(sd(xx))
a <- est_accel(x, logsd)
tmp  <- quack::boot_accel(x, logsd, alpha=0.05, a=a)
CI_3 <- exp(tmp)^2

# Confidence interval based on CLT
chi <- qchisq(c(0.025, 0.975), n-1)
CI_4 <- (n-1)*var(x)/rev(chi)

The confidence interval for these four cases comes out to be

Method Lower bound Upper bound
Bootstrap $6.52$ $9.52$
Accelerated bootstrap $6.74$ $9.76$
Accelerated boot (log-sd) $6.74$ $9.80$
CLT $6.68$ $9.91$

Note that all 3 methods capture the "true value" of $8.612$ here. But one dataset isn't very interesting, so lets perform a simulation study.


Simulation study

We can repeat the R analysis conducted above one thousand times for each of the four methods. We are interested in (i) empirical coverage (the number of times each method captures the true value) and (ii) the width of the confidence interval.

Edit: Thanks to @COOLSerdash, who extended this simulation study to include the methods discussed by Bonett (2016) and O'Neill (2014) (see also @Ben's answer). I have added these methods to the table below for ease of comparison.

Method Empirical coverage Interval width (average)
Percentile Bootstrap 91.3% 4.22
Accelerated bootstrap 93.0% 4.57
Accelerated boot. (log-sd) 93.0% 4.58
CLT 86.8% 3.44
Bonett (2006) 93.5% 4.64
O'Neill (2014) 91.9% 4.39

It is interesting to note that all of these methods undercover compared to nominal (95%), but the CLT based method is especially over-confident, yielding precise intervals which fail to capture the true value more often that it should.

$\endgroup$
10
  • 3
    $\begingroup$ +1. In case you ever wish to go further, I would be tempted to bootstrap the logarithm of the SD. In many ways that's a more natural way to express it and ought to yield nicer sampling distributions. $\endgroup$
    – whuber
    Commented Jun 2, 2023 at 17:26
  • $\begingroup$ Wow! 5 stars. This is like a great experiment to empirically see how to move forward! Are there theoretical foundations for these bootstrap methods for which this empirical experiment is providing an example? $\endgroup$ Commented Jun 5, 2023 at 14:58
  • 1
    $\begingroup$ Yes, the bootstrap algorithm has nice theoretical guarantees under relatively minor assumptions. The original work is due to Efron, but there are lots of introductions/tutorials/references to be found online. I will try to add an example of bootstrapping on the log-scale (per @whuber's comment) as soon as I can find a little time. $\endgroup$
    – knrumsey
    Commented Jun 5, 2023 at 19:01
  • 1
    $\begingroup$ The point of suggesting bootstrapping the log is that this shouldn't need acceleration. $\endgroup$
    – whuber
    Commented Jun 21, 2023 at 13:47
  • 1
    $\begingroup$ @whuber, I might be misunderstanding something in your suggestion. If I perform a standard percentile bootstrap using boot[m] <- log(sd(xnew)), I can extract a confidence interval with CI <- quantile(boot, probs=c(0.025, 0.975)). But converting the CI back to an interval for $\sigma^2$, i.e. CI <- exp(CI)^2 yields the same answer as before (for $n=200$, at least). Am I interpreting you correctly or did you have something different in mind? $\endgroup$
    – knrumsey
    Commented Jun 21, 2023 at 21:20
7
$\begingroup$

Theory

Bonett (2006) proposed an approximate confidence interval for the variance and standard deviation of nonnormal distributions. It's based on a variance-stabilizing transformation for $\hat{\sigma}^2$ that is $\log(\hat{\sigma}^2)$ and the application of the delta method. Specifically, $\operatorname{Var}(\log(\hat{\sigma}^2))\approx\{\gamma_{4} - (n-3)/n\}/(n - 1)$ where $\gamma_4 = \mu^4/\sigma^4$ and $\mu^4$ is the population fourth central moment. In practice, $\gamma_4$ is unknown and has to be estimated. Bonett suggested $$ \bar{\gamma_4}=n\sum(Y_i - m)^4/\left(\sum(Y_i - \hat{\mu})^2\right)^2 $$ where $m$ is a trimmed mean with trim-proportion of $1/\left\{2(n - 4)^{1/2}\right\}$. Curto (2021) proposed to modify the above estimator by replacing $m$ with the median $\hat{\mu}_{med}$.

Bonett's 100(1-$\alpha$)% confidence interval for $\sigma^2$ is given by $$ \exp\left\{\log(c\hat{\sigma}^2)\pm z_{1-\alpha/2}\operatorname{se}\right\}\tag{1} $$ where $z_{1-\alpha/2}$ is the two-sided critical $z$-value from the standard normal distribution, $\operatorname{se} = c\left[\left\{\bar{\gamma_4} - (n - 3)/n\right\}/(n - 1)\right]^{1/2}$ and $c=n/(n - z_{1-\alpha/2})$ is an empirically determined small-sample adjustment factor. Taking the square root of the endpoints of $(1)$ gives a confidence interval for $\sigma$.

Burch (2017) suggested yet another method which is more complicated than the one detailed here.


Example

Using the trimodal distribution given by @Ben, we have

set.seed(142857)
x   <- c(rnorm(100, mean = 8,  sd = 4),
         rnorm( 50, mean = 12, sd = 3),
         rnorm(150, mean = 16, sd = 2))
ci_sigma2(x, alpha = 0.05)
[1] 19.46593 26.81422

So an approximate 95% confidence interval for the population variance is $(19.47;\,26.81)$.


Simulation

Let's see how well the interval $(1)$ performs. I will use the same distribution (rmydist) as @knrumsey did with $n=200$. Here's the code for the simulations

set.seed(142857)
res <- replicate(1e4, {
  x <- rmydist(200)
  ci_tmp <- ci_sigma2(x, 0.05)
  c(ifelse(ci_tmp[1] < 8.612 && ci_tmp[2] > 8.612, 1, 0),
    diff(ci_tmp))
})
rowMeans(res)
[1] 0.935500 4.641079

The coverage is $0.935$ which is similar to the other methods while the average width is a bit larger.

The confidence interval proposed by @Ben has a simulated coverage of $0.919$ with an average width of $4.39$.

library(stat.extend)
set.seed(142857)
res <- replicate(1e4, {
  x <- rmydist(200)
  ci_tmp <- as.data.frame(CONF.var(alpha = 0.05, x = x))
  c(ifelse(ci_tmp$Lower < 8.612 && ci_tmp$Upper > 8.612, 1, 0),
    ci_tmp$Upper - ci_tmp$Lower)
})
rowMeans(res)
[1] 0.919100 4.387456

R code

The following R code implements Bonett's method

gamma4 <- function(y) {
  n <- length(y)
  m <- mean(y, trim = 1/(2*(n - 4)^(1/2)))
  # m <- median(y) # Suggested by Curto (2021)
  n*sum((y - m)^4)/sum((y - mean(y))^2)^2
}

ci_sigma2 <- function(x, alpha = 0.05) {
  n <- length(x)
  z <- qnorm(1 - alpha/2)
  cfac <- n/(n - z)
  se <- cfac*((gamma4(x) - (n - 3)/n)/(n - 1))^(1/2)
  exp(log(cfac*var(x)) + c(-1, 1)*z*se)
}

The function VarCI from the DescTools package implements a number of different methods, including the one from Bonett described above.


References

Bonett, D. G. (2006). Approximate confidence interval for standard deviation of nonnormal distributions. Computational Statistics & Data Analysis, 50(3), 775-782. (link)

Burch, B. D. (2017). Distribution-dependent and distribution-free confidence intervals for the variance. Statistical Methods & Applications, 26, 629-648. (link)

Curto, J. D. (2021). Confidence intervals for means and variances of nonnormal distributions. Communications in Statistics-Simulation and Computation, 1-17. (link)

$\endgroup$
5
$\begingroup$

Even though your underlying distribution isn't normal (I note your advice that it is trimodal) if $n$ is sufficiently large, and the kurtosis of the underlying distribution is finite, you can rely on the central limit theorem to ensure that the sample variance is asyptotically chi-squared (see this related answer). This allows you to construct a confidence interval for the true variance of the underlying distribution or for a finite population.

O'Neill (2014) examines a range of useful moment results in sampling problems, and this paper includes the construction of confidence intervals for the variance, including intervals for the true variance of the underlying distribution or a finite population variance. These confidence intervals are based on the asymptotic result (Result 14, p. 285):

$$\frac{S_n^2}{\sigma^2} \overset{\text{Approx}}{\sim} \text{ChiSq}(DF_n) \quad \quad \quad \quad \quad DF_n \equiv \frac{2n}{\kappa - (n-3)/(n-1)},$$

where $\kappa$ is the kurtosis of the underlying distribution. Using this statistic as a quasi-pivotal quantity yields a confidence interval for the variance parameter $\sigma^2$ given by (p. 287):

$$\text{CI}(1-\alpha) \equiv \bigg[ \frac{DF_n}{\chi^2_{\alpha-\theta, DF_n}} , \frac{DF_n}{\chi^2_{1-\theta, DF_n}} \bigg] s_n^2.$$

In this confidence interval the value $0 \leqslant \theta \leqslant \alpha$ is a control parameter that can be optimised to produce a shortest confidence interval (see also O'Neill 2022 for discussion of the optimisation). Alternatively, you can set $\theta=\alpha/2$ if you want an equal-tail confidence interval.

Since the degrees-of-freedom parameter in this approach depends on the kurtosis parameter $\kappa$, obtaining the confidence interval requires you to either stipulate the kurtosis or substitute an estimator for the kurtosis. In my view, a reasonable approach is to substitute the sample kurtosis as an estimator (there are a few versions of these so pick a favourite) and then optimise to obtain the shortest confidence interval as a default method.


Computing the confidence interval: The above method requires optimisation of the control parameter $\theta$ if you want the shortest confidence interval. Fortunately, this is already done for you in the CONF.var function in the stat.extend package. This function allows you to compute a shortest confidence interval for the variance by specifying either your full sample data or the sample variance. By default, the function will use the sample kurtosis as an estimator in computing the degrees-of-freedom, but you can override this value if you prefer (here we give the default interval).

#Produce a sample from a trimodal distribution
set.seed(142857)
X   <- c(rnorm(100, mean = 8,  sd = 4),
         rnorm( 50, mean = 12, sd = 3),
         rnorm(150, mean = 16, sd = 2))

#Compute a 99% confidence interval for the variance
library(stat.extend)
CONF.var(alpha = 0.01, x = X)

        Confidence Interval (CI) 
 
99.00% CI for variance parameter for infinite population 
Interval uses 300 data points from data X with sample variance = 22.6973 
and sample kurtosis = 2.8546
Computed using nlm optimisation with 5 iterations (code = 1) 

[18.5251761558458, 27.8496448203003]

The above example shows a simple computation for the confidence interval of the underlying variance parameter of the distribution. The CONF.var function also allows you to get a confidence interval for a finite population variance or even for the finite unsampled part of a population.

$\endgroup$
3
  • $\begingroup$ +1. Just a heads up: I just tried downloading the stat.extend package but it was removed from CRAN. $\endgroup$ Commented Jun 21, 2023 at 6:51
  • $\begingroup$ @COOLSerdash: Okay, I'll have a look at that. I think they may have changed the package requirements recently, so it might need an update. You can probably get it from the archive for now. $\endgroup$
    – Ben
    Commented Jun 21, 2023 at 10:51
  • $\begingroup$ Looks awesome! if i have time to get a python implementation up, i may post. $\endgroup$ Commented Jul 8, 2023 at 0:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.