Shouldn't we consider the difference in variance between population and a sample while calculating confidence intervals?

Question

To comprehend the concept of confidence intervals, I came up with an example. I want to share it here for your better understanding what my question is all about.

Suppose, we want to figure out what is the average height in a town M. We randomly selected $n$ people and measured their height. We got a sample average $\bar{X}$. A standard statistical question - what if the population mean $μ$ is higher or lower than our sample mean?

Fortunately, we know that the distribution of heights looks similar to the normal distribution.

If we got lucky and $\bar{X}$ = $μ$, then our sample mean $\bar{X}$ will be right in the center of the distribution. But, most likely, it won't. Our $\bar{X}$ could be here:

Or even here:

In other words, our sample mean lies in some interval and because we know properties of normal distribution, it is possible to calculate that interval. We calculate the margin of error at a given α by formula $MOE_α = z_α \times \sqrt{\frac{\sigma^2}{n}}$, so that we could "capture" possible left-or-right deviation from unknown population mean. Then we compute the confidence interval: $\bar{X}±MOE_α$.

As a result, we can say, for example, that there is a 95% probability that the 95% confidence interval will cover the true population mean.

It seems for me that I understand the logic of confidence intervals, when there are equal variances of both population and sample. But what if, let's say, the population variance is higher than sample variance or reverse? In this case, the distribution will be wider or narrower and, if I think correctly, the probability will not be 95%.

Do I understand the concept of confidence intervals correctly? Are there any flaws in my logic? If none, then how should we take into account the possible difference in sample and population variances?

In this particular formulation, you are assuming that you know the true value of $\sigma^2$ (the population variance). If, instead, $\sigma^2$ is unknown and you need to estimate it from your sample, you end up with a confidence interval based on the $t$-distribution. Take a look at the example here: en.wikipedia.org/wiki/Confidence_interval — Doctor Milt, Commented Sep 19, 2023 at 10:43
The standard error of the sample mean is $\sqrt{\frac{\sigma^2}n}$ and if you know the population variance $\sigma^2$ you can use a suitable multiple of this from the normal distribution for your confidence interval. If you do not, then you can estimate it with $\sqrt{\frac{s^2}n}$ using the sample variance $s^2$ though the suitable multiple becomes based on the $t$-distribution. — Henry, Commented Sep 19, 2023 at 12:42
A lot of the answers below focus on the t-statistic but for large enough samples, the difference doesn't really matter. Where you're really off base is here: "As a result, we can say, for example, that there is a 95% probability that the 95% confidence interval will cover the true population mean." A confidence interval does not allow you to say that (and it would almost always be false). — num_39, Commented Sep 19, 2023 at 18:57
@num39 Agree with that, when we speak about the example with unknown variance. But if the population variance is known, why would not we say that? On many resources, including Wikipedia, I saw this formulation. See, for example (Interpretation): en.wikipedia.org/wiki/Confidence_interval — Davie Blain, Commented Sep 20, 2023 at 6:08

Dave · Accepted Answer · 2023-09-19 11:12:25Z

1

If you could do this exactly, it would mean that you know the population variance, in which case, why are you guessing about something you know?

The way I think about t-testing when the variance is unknown is that your variance estimate might be too low, as you have written. Consequently, the distribution should have heavier tails to account for that possibility, which the t-distribution does have, compared to the normal distribution used in the case of a known variance. In this sense, the answer to the title is that, yes, we should account for the fact that our variance estimate might be wrong, and that is exactly what is happening when we use the t-test.

Simulation could be insightful here. It could be worth calculating a few thousand $95\%$ confidence intervals and seeing that, yes, $95\%$ of them contain the true mean.


set.seed(2023)
N <- 10
R <- 10000
contained <- rep(1, R)
for (i in 1:R){

    x <- rnorm(N)
    ci <- t.test(x)$conf.int

    if (ci[1] > 0){
        contained[i] <- 0
    }
    if (ci[2] < 0){
        contained[i] <- 0
    }
}
mean(contained)

I get that $94.73\%$ of the default $95\%$ confidence intervals contain the true mean of zero, despite the t-based confidence intervals being calculated with the estimated variance.

answered Sep 19, 2023 at 11:12

Dave

65k7 gold badges101 silver badges286 bronze badges

$\begingroup$ I might extend the simulation to look for a relationship between variance estimation error and if the confidence interval contains the true value, maybe a logistic regression on the estimation error (which we know in the simulation). $\endgroup$
– Dave
Commented Sep 19, 2023 at 11:16
$\begingroup$ "in which case, why are you guessing about something you know?" Well, that's because although I know population variance, I don't know about the population mean. It can be close to the estimated mean, but can be not. So, I'm guessing about whether that interval captures the population mean, or not. $\endgroup$
– Davie Blain
Commented Sep 19, 2023 at 11:31
$\begingroup$ @DavieBlain Then you’re guessing about the mean. Since you do not know the true value of the mean, you have to guess its value (“estimate” its value). However, if you know the variance, then you don’t have to guess what it is. $\endgroup$
– Dave
Commented Sep 19, 2023 at 11:36

Add a comment |

Glen_b · Accepted Answer · 2023-09-20 06:56:13Z

We should indeed worry about the distinction between the usual Bessel-corrected sample standard deviation, $s$, and the population standard deviation, $\sigma$, when we don't know $\sigma$ (in the case that we did know $\sigma$, we would typically want to use it, but nearly always we won't).

One standard way to construct a confidence interval (at least when it's possible to do so) is via a pivotal quantity (a.k.a. a pivot). See https://en.wikipedia.org/wiki/Pivotal_quantity

A pivotal quantity, ($Q$, say) is a function of the data and a parameter of interest ($\mu$ in your case), whose distribution doesn't depend on unknown parameters - so changing the value of $\mu$ wouldn't change the distribution of the pivotal quantity. Crucially, neither would it change if you altered any other unknown parameter (such as $\sigma$); the distribution of the pivotal quantity would be unaffected.

Speaking loosely, if you know the distribution of some pivotal quantity, $Q$, you can then construct a probabilistic interval for the $Q$ (it's a random variable), and then back out a confidence interval for the parameter ($\mu$ in this case).

When $X_i$, $i=1,2,...,n$ are independent and identically distributed $\operatorname{N}(\mu,\sigma^2)$, it's possible to show that $T=\frac{\bar{X}-\mu}{s/\sqrt{n}}$ has a $t$ distribution with $n-1$ degrees of freedom; that is, $T$ is a pivotal quantity. From an interval for $T$ (which is a function of $\mu$, you can see it the above formula explicitly), we can then obtain a confidence interval for $\mu$ by manipulating the algebraic expression for the probabilistic interval for $T$.

This approach avoids worrying directly about the error in estimating $\sigma$ by working directly from $s$ to standardize $\bar{X}-\mu$, the numerator of $T$, and then working with the resulting distribution of that statistic.

A number of answers on site discuss pivotal quantities (I recommend trying a search, which should turn up some additional helpful discussion). There's some notes at [1], though there are numerous other sets of notes that can be found. Many undergraduate statistics textbooks discuss this approach.

-

It is indeed the case that an interval based on $s$ would sometimes be narrower and sometimes wider than an interval based on the unknown $\sigma$. If we did know $\sigma$, we might indeed see that a higher proportion of the intervals based on $s$ would miss $\mu$ when $s<\sigma$. The problem is in practice we have no idea when this has happened -- we don't know $\sigma$ to judge when $s$ was 'small' or 'large'; so that conditional probability is not something we have access to.

This is not a problem for our interval, however, because by working directly with the distribution of the pivotal quantity $T$, we can make sure that our interval has the desired long run coverage rate, $1-\alpha$. The fact that an interval based on $\sigma$ would be different is true but not relevant to us; the long-run property (that under repeated sampling, a long-run proportion $1-\alpha$ of the intervals we construct this way will overlap $\mu$) is maintained, by construction.

Intuitively speaking, how is it that overall the $1-\alpha$ rate is maintained? The distribution of the pivot $T$, when compared to the pivot $Z$ based on a known $\sigma$, has a slightly larger variance, and is heavier tailed. The distribution of the sample standard deviation is right skew and more often a little smaller than $\sigma$ and sometimes larger (occasionally considerably so, at least for small $n$). Consequently, if we had used $z$ tables to construct our interval based on $s$, as we might have done if made the error of treating $s$ as if if were $\sigma$, the intervals would not quite attain the desired $1-\alpha$ coverage.

The actual intervals based on $s$ (via the pivot $T$) are therefore a little "wider" on average (if you look at a t-table, you'll see that the upper-tail critical value that leaves an area of $\alpha/2$ above it is larger than the corresponding value from a normal table). This extra bit of width from the $t$ on average exactly "adjusts" for the tendency for intervals based on $s$ to be too narrow if you had used $z$ tables instead.

For some more detail on the relation between $Z$, $T$, $s$ and $\sigma$, see https://stats.stackexchange.com/a/110365/805

[1]: C. J. Geyer, "Stat 5102 Notes: More on Confidence Intervals", Feb 24, 2003
https://www.stat.umn.edu/geyer/old03/5102/notes/ci.pdf
(internet archive link in case the original disappears --
http://web.archive.org/web/20220221153635/https://www.stat.umn.edu/geyer/old03/5102/notes/ci.pdf)

Yaoshiang · Accepted Answer · 2023-09-19 17:21:31Z

1

You have the right instinct - the sample variance will certainly not match the population variance. If the underlying population is roughly normal, you use the Student's t-distribution instead of the normal distribution when calculating standard error for "small" sample sizes, e.g., less than 30. The Student t-dist is wider and looser than the normal distribution.

answered Sep 19, 2023 at 17:21

Yaoshiang

2711 silver badge4 bronze badges

Add a comment |

Stack Exchange Network

Shouldn't we consider the difference in variance between population and a sample while calculating confidence intervals?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
confidence-interval
variance
standard-error
intuition
or ask your own question.

Linked

Hot Network Questions

Shouldn't we consider the difference in variance between population and a sample while calculating confidence intervals?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged confidence-intervalvariancestandard-errorintuition or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
confidence-interval
variance
standard-error
intuition
or ask your own question.