I know that when a sample is non-normally distributed, I need to use non-parametric alternatives, as when the assumptions of parametric methods are violated, its coverage will decrease.
Let's say that I want to compute the confidence interval variance of a non-normal data. I chose variance because variance is sensitive to non-normality than mean.
To my knowledge, the equation for the CI of variance for normal distribution is:
$$\frac{(n-1)s^{2}}{\chi^{2}_{\frac{\alpha}{2}, df}} \leq \sigma^2 \leq \frac{(n-1)s^{2}}{\chi^{2}_{1-\frac{\alpha}{2}, df}}$$
However, if I generate a skewed & exponential distribution multiple times, I get 100% coverage of population variance within the analytical CI of variance. Here's the Python code snippet:
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
count_norm = []
count_expon = []
count_skew = []
np.random.seed(42)
iteration = 1000
for i in range(iteration):
# generate samples
norm = stats.norm.rvs(size=1000, scale=100)
expon = stats.expon.rvs(size=1000, scale=100)
skew = stats.skewnorm.rvs(a=10, size=1000, scale=100)
# sample variance
var_norm = np.var(norm, ddof=1)
var_expon = np.var(expon, ddof=1)
var_skew = np.var(skew, ddof=1)
# analytical CI
lo_norm, hi_norm = (
(len(norm) - 1) * np.var(norm, ddof=1) / stats.chi2.ppf(1 - 0.05 / 2, len(norm) - 1),
(len(norm) - 1) * np.var(norm, ddof=1) / stats.chi2.ppf(0.05 / 2, len(norm) - 1)
)
lo_expon, hi_expon = (
(len(expon) - 1) * np.var(expon, ddof=1) / stats.chi2.ppf(1 - 0.05 / 2, len(expon) - 1),
(len(expon) - 1) * np.var(expon, ddof=1) / stats.chi2.ppf(0.05 / 2, len(expon) - 1)
)
lo_skew, hi_skew = (
(len(skew) - 1) * np.var(skew, ddof=1) / stats.chi2.ppf(1 - 0.05 / 2, len(skew) - 1),
(len(skew) - 1) * np.var(skew, ddof=1) / stats.chi2.ppf(0.05 / 2, len(skew) - 1)
)
# coverage test
if lo_norm <= var_norm <= hi_norm:
count_norm.append(1)
else:
count_norm.append(0)
if lo_expon <= var_expon <= hi_expon:
count_expon.append(1)
else:
count_expon.append(0)
if lo_skew <= var_skew <= hi_skew:
count_skew.append(1)
else:
count_skew.append(0)
print('coverage of normal dist =', sum(count_norm) / iteration)
print('coverage of exponential dist =', sum(count_expon) / iteration)
print('coverage of skewed dist =', sum(count_skew) / iteration)
>>> coverage of normal dist = 1.0
>>> coverage of exponential dist = 1.0
>>> coverage of skewed dist = 1.0
I am not sure how many people can understand this code, because most of the question answerers in this community are familiar with R...
Can I get an explanation of why I'm getting 100% coverage of population variance from a non-normal distribution using analytical confidence interval of variance, even when the assumption of normality is violated?
Or it would be great if anyone can disapprove my simulation result with his custom R simulation result... because what I'm getting is contradicting what I know.
Thanks