5
$\begingroup$

Assume we have a sample of size $n$ from an unspecified continuous distribution $F(\cdot)$. We wish to construct a tolerance interval to contain $(100\,\beta)\%$ of the population with a pre-specified confidence level $\gamma$ based on our sample. However, a conventional tolerance interval does not guarantee that the central $(100\,\beta)\%$ of the population is contained. In other words, a tolerance interval does not contain the $(1 - \beta)/2$ and $(1 + \beta)/2$ quantiles of the population distribution with confidence $\gamma$.

On the other hand: A tolerance interval that does contains the central $(100\,\beta)\%$ of the population with confidence $\gamma$ is called an equal-tailed$^{[1, 2]}$ or central tolerance interval. Meeker & Hahn (2017) call them "tolerance intervals to control both tails"$^{[3]}$ (section E5.2).

While Liu et al. (2021) detail the construction of a nonparametric tolerance interval, they explicitly omit the details of how to construct an equal-tailed nonparametric tolerance interval to save space in their paper.

Because tolerance intervals are synonymous with confidence intervals for percentiles, would one possibility be to calculate a one-sided lower and upper confidence interval for the $0.025$ and $0.975$ percentiles, respectively? I also found the paper by Hayter (2014)$^{[4]}$ that describes a method to calculate simultaneous nonparametric confidence intervals for percentiles but I was not able to implement the proposed algorithm in R and test it.

Question: How can the calculations shown here be modified so that the resulting nonparametric tolerance interval contains the central $(100\,\beta)\%$ of the unknown continuous population distribution with confidence $\gamma$?

References

$[1]$: Liu, W., Bretz, F., & Cortina-Borja, M. (2021). Reference range: Which statistical intervals to use?. Statistical methods in medical research, 30(2), 523-534. (link)

$[2]$: Jan, S. L., & Shieh, G. (2018). The Bland-Altman range of agreement: Exact interval procedure and sample size determination. Computers in biology and medicine, 100, 247-252. (link)

$[3]$: Meeker, W. Q., Hahn, G. J., & Escobar, L. A. (2017). Statistical intervals: a guide for practitioners and researchers. 2nd ed. John Wiley & Sons. (link)

$[4]$: Hayter, A. J. (2014). Simultaneous confidence intervals for several quantiles of an unknown distribution. The American Statistician, 68(1), 56-62. (link)

$\endgroup$
8
  • 1
    $\begingroup$ A tiny modification of the solution at stats.stackexchange.com/a/166839/919 (taken from Hahn & Meeker, first ed.) will do the trick. $\endgroup$
    – whuber
    Commented Jun 14, 2023 at 13:59
  • $\begingroup$ @whuber Thanks, that's reassuring. Could you give me a hint regarding the modification? I guess that the requirement ${\Pr}_F(F(X_u) - F(X_l) \lt \gamma)$ has to be modified to include the tails, right? $\endgroup$ Commented Jun 14, 2023 at 16:12
  • 1
    $\begingroup$ If I understand your needs, you want to split that probability so that $1-F(X_u)\approx \gamma/2$ and $F(X_l) \approx \gamma/2.$ Exactly how you do that depends on how closely you need to adhere to symmetry. $\endgroup$
    – whuber
    Commented Jun 14, 2023 at 18:19
  • $\begingroup$ "the restrictions that there is no more than (1−P)/2 of the sampled population below the lower tolerance limit and no more than (1−P)/2 of the sampled population above the upper tolerance limit." The tolerance interval will not always have exactly those percentages above/below the interval as there is some confidence γ. So how do you wish to measure these (1-P)/2 probabilities of the distribution being above/below the tolerance, in terms of equal probabilities on average, or in terms of equal probabilities of rates that the interval fails? $\endgroup$ Commented Jun 19, 2023 at 12:15
  • $\begingroup$ @SextusEmpiricus Upon a second reading, I think this sentence does not reflect what I want (I deleted this section now). I simply want that with confidence $\gamma$, the lower and upper tolerance limits contain both quantiles. I think the limits in my answer and the paper does exactly this. A few simulations seem to confirm this. $\endgroup$ Commented Jun 19, 2023 at 13:16

2 Answers 2

4
+200
$\begingroup$

For an ordered sample $X_{(1)}, X_{(2)}, \dots , X_{(n)}$ of iid $X \sim f(x)$, you want to pick ranks $1 \leq i < j \leq n$ such that, with a certain minimal probability/confidence $\gamma$, they both fall outside some region defined by two quantiles $Q_X(p_1) < Q_X(p_2)$ of the distribution.

$$P\left[X_{(i)} < Q_X(p_1) < Q_X(p_2)< X_{(j)} \right] \geq \gamma$$


This process is similar to drawing uniform distributed variables (representing the variables $X$ in terms of the quantiles of the distribution, we have that $F(X) \sim U(0,1)$) and observing how many are below $p_1$ and how many are above $p_2$.

If $i$ or more are below $p_1$ then this is similar to the $X_{(i)}$ order variable being below $Q_X(p_1)$. If $n+1-j$ or more are above $p_2$ then this is similar to the $X_{(j)}$ order variable being above $Q_X(p_2)$.

See for example in the image below, a sample of 88 cases sampled from a standard normal distribution. The following events are similar

  • There are $15$ or more cases of the normal distributed variable $X$ below $-0.8416$.
  • There are $15$ or more cases of the uniform distributed variable $F(X)$ below $0.2$.
  • The $15$-th order statistic $X_{(15)}$ is below $-0.8416$.
  • The $15$-th order statistic of the sample $F(X_{(15)})$ is below $0.2$.

example


Let's define the three variables

  • $K_A$ the number of items from the sample that are below $Q_X(p_1)$
  • $K_B$ the number of items from the sample that are between $Q_X(p_1)$ and $Q_X(p_1)$
  • $K_C$ the number of items from the sample that is above $Q_X(p_2)$

These numbers follow a multinomial distribution

$$P(k_A,k_B,k_C) = \frac{n!}{k_A!k_B!k_C!} p_1^{k_A} (p_2-p_1)^{k_B}(1-p_2)^{k_C} $$

And you are looking for values $i$ and $j$ such that the cumulative distribution is above $\gamma$

$$P[k_A \geq i \text{ and } k_C \geq n+1-j] \geq \gamma$$

alternatively you can compute the complement

$P[k_A < i \text{ or } k_C < n+1-j] = P[k_A < i] + P[k_C < n+1-j] - P[k_A < i \text{ and } k_C < n+1-j] < 1-\gamma$


Example computation when we have a symmetric situation $i = n+1-j$ and $p_1 = 1-p_2$.

Specific case: If we have iid standard normal distributed samples of size $n = 88$, and we choose an interval based on the order statistics $i=11$ and $j = n+1-11 = 78$, then the interval will cover both the 20-th and 80-th percentile (or the 60% of the center of the distribution) of the normal distribution in 95% of the cases. The computation gives $0.952507$ and a simulation with one million samples gives $0.952775$.

gamma = 0.95
p1 = 0.2
p2 = 0.2
n = 88
k = 0:n


### compute the joint distribution of the multinomial
Mmulti = function(p1,p2,n) {
  M = outer(0:n,0:n, FUN = Vectorize(function(x,y) {
      if ((x+y <= n)==1) {
         p1^x*p2^y*(1-p1-p2)^(n-x-y)*factorial(n)/factorial(x)/factorial(y)/factorial(n-x-y)
      } else {
         0
      }
  }))
  return(M)
}

### compute a cdf-type distribution for the multinomial
### P(A<=k and C<=k)
Pmulti = function(p1,p2,n) {
  M = Mmulti(p1,p2,n)
  sapply(1:(n+1), FUN = function(k) sum(M[1:k,1:k]))
}


### compute the levels and check the desired number in a curve
pboth = Pmulti(p1, p2 , n)
psingle = pbinom(k,n,p1)
peither = psingle+psingle-pboth
plot(peither)
lines(c(0,n),1-c(gamma,gamma))



### simulations

m = 10^6
kc = 11
lower = rep(NA,m)
upper = rep(NA,m)

q1 = qnorm(0.2)
q2 = qnorm(0.8)

set.seed(1)
for (i in 1:m) {
  x = rnorm(n)
  x=x[order(x)]
  lower[i] = x[kc]
  upper[i] = x[n+1-kc]
}

### compute cases that have coverage of the center
cover = (lower<q1)*(upper>q2)

### compare 
mean(cover)    ## 0.952775
1-peither[11]  ## 0.9525073

### not like this
1-mean(lower < q1)*mean(upper > q2) ## 0.04681587
$\endgroup$
8
  • $\begingroup$ So the main trouble is to compute the cumulative distribution of a multinomial distributed variable. $\endgroup$ Commented Jun 20, 2023 at 13:51
  • $\begingroup$ Note that, without the condition $i+j = n+1$, there can be multiple solutions $i$ and $j$ that have $\gamma$ confidence to cover the central $\beta$ and can not be improved (smaller interval) by increasing $i$ or decreasing $j$. $\endgroup$ Commented Jun 20, 2023 at 13:54
  • $\begingroup$ Thanks. You basically get the same solution as presented in my answer. I think the paper by Hayter 2014 gives an algorithm for calculating the multinomial probability in section 2.2. I didn't manage to implement it though. $\endgroup$ Commented Jun 20, 2023 at 14:37
  • 1
    $\begingroup$ I guess that your sum does it already a bit faster $$\sum_{i = 0}^{r - 1}b(i; n, p_1)B\left(n - s; n - i, \frac{(1-p_2)}{(1 - p_1)}\right)$$ Is there an algorithm that improves on it? There is an article that discusses a representation of the multinomial as a constrained multivariate Poisson distributed variable, but it has still some sort of sum expression that is needed to compute the final result (unless you use an approximation). A Representation for Multinomial Cumulative Distribution Functions Bruce Levin $\endgroup$ Commented Jun 20, 2023 at 14:41
  • $\begingroup$ The pmultinom package implements the algorithm by Levin. You could use that to calculate the multinomial cdf. $\endgroup$ Commented Jun 21, 2023 at 19:10
4
$\begingroup$

Theory

Let $X_1, X_2, \ldots, X_n$ be a random sample from a continuous distribution with distribution function $F(x)$ and density $f(x)$. Denote the order statistics $Y_1, Y_2, \ldots, Y_n$. Let $x_p$ be the quantile of order $p$ for the distribution of $X$. Wilks$^{[1]}$ defines an outer confidence/tolerance interval at level at least $\gamma$ of the quantile interval $(x_{p_1}, x_{p_2})$, $p_1<p_2$ as $$ \operatorname{Pr}(Y_r < x_{p_1} <x_{p_2}<Y_s)\geq \gamma\tag{1} $$ This is a tolerance interval that captures at least a fraction of $\beta = p_2 - p_1$ of the parent distribution while controlling the tails. These intervals are known in the literature by many names: strong tolerance intervals (TI), central TI or equal-tailed TI. Guenther$^{[2]}$ gives the formula for calculating the above probability $(1)$ as $$ \operatorname{Pr}(Y_r < x_{p_1} <x_{p_2}<Y_s) = 1 - B(r - 1; n, p_1) - B(n - s; n, 1 - p_2) + \sum_{i = 0}^{r - 1}b(i; n, p_1)B\left(n - s; n - i, \frac{(1-p_2)}{(1 - p_1)}\right) $$ where $b(d;n,p)$ and $B(d;n,p)$ denote the probability mass function and cdf of a binomial distribution with parameters $n$ and $p$, respectively. For symmetrically chosen order statistics, we can set $s = n - r + 1$.

On the other hand, inner tolerance intervals are defined$^{[1, 2]}$ as $$ \operatorname{Pr}(x_{p_1}<Y_r<Y_s<x_{p_2})\geq\gamma \tag{2} $$ which controls both tails of the distribution whit content no more than $p_2-p_1$ and tolerance coefficient at least $\gamma$. The probability $(2)$ can be calculated by $$ \operatorname{Pr}(x_{p_1}<Y_r<Y_s<x_{p_2}) = \operatorname{Pr}(Y_r < x_{p_1} <x_{p_2}<Y_s) - 1 + B(r - 1;n,p_1) + B(n - s;n, 1 - p_2) $$

Additional helpful papers on the topic are $[3]$ and $[4]$.

Example 1

Assume that we have a sample of size $n=29$ and want to find a symmetric $(\beta = 0.75, \gamma = 0.2)$ central tolerance interval so that the central $75\%$ of the population is captured with confidence $1-0.2 = 0.8$. The percentiles are chosen symmetrically, so $x_{p_1}=x_{(1 - 0.75)/2}=x_{0.125}$ and $x_{p_2}=x_{(1 + 0.75)/2}=x_{0.875}$. We wish to chose the interval that satisfies the requirements above while minimizing $s - r$.

Using the R function below, we find that the intervals $(Y_1, Y_{29})$, $(Y_1, Y_{28})$ and $(Y_2, Y_{29})$ have a probability of at least $0.8$. Only $(Y_1, Y_{29})$ is symmetric with $s = 29 - 1 + 1 = 29$ but has a probability of $0.959$ which is very conservative while the other two intervals have both probabilities of $0.874$.

Example 2

As above, assume that we have a sample of size $n=29$ and want to use the interval $(Y_1, Y_{29})$ as a tolerance interval with confidence $1 - \gamma = 1 - 0.2 = 0.8$. What central proportion of the population can we capture with that?

Using the R function below with a root finding algorithm, we find that $\beta = 0.85$. So the probability that $(Y_1, Y_{29})$ will include the $p_1 = 0.075, p_2 = 1 - 0.075 = 0.925$ quantiles is $0.8$.

R code

The following R function calculates the probability that the chosen order statistics $Y_s$ and $Y_r$ contain the central $x_{p_1}$ and $x_{p_2}$ percentiles of the population distribution. This is done using the option type = "outer". The type = "inner" option calculates the probability $(2)$, what Wilks$^{[1]}$ calls an inner confidence/tolerance interval.

p_tol <- Vectorize(function(r, s, n, p1, p2, type = c("inner", "outer")) {
  
  type <- match.arg(type)
  # Summation index
  i <- 0:(r - 1)
  
  # Probability of outer tolerance interval
  pouter <- 1 - pbinom(r - 1, n, p1) - pbinom(n - s, n, 1 - p2) + sum(dbinom(i, n, p1)*pbinom(n - s, n - i, (1 - p2)/(1 - p1)))

  if (type %in% "inner") {
    # Probability of inner tolerance interval
    pinner <- pouter - 1 + pbinom(r - 1, n, p1) + pbinom(n - s, n, 1 - p2)
    pinner
  } else {
    pouter
  }
  
}, c("r", "s"))

References

$[1]$: Wilks, S (1962): Mathematical statistics. John Wiley & Sons, New York.

$[2]$: Guenther, W. C. (1985). Two-sided distribution-free tolerance intervals and accompanying sample size problems. Journal of quality technology, 17(1), 40-43. (link)

$[3]$: Reiss, R. D., & Ruschendorf, L. (1976). On Wilks' distribution-free confidence intervals for quantile intervals. Journal of the American Statistical Association, 71(356), 940-944. (link)

$[4]$: Krewski, D. (1976). Distribution-free confidence intervals for quantile intervals. Journal of the American Statistical Association, 71(354), 420-422. (link)

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.