4
$\begingroup$

I was reading about order statistics on Wikipedia [retrieved 29 June 2022]:

enter image description here

Apparently, if we have a sample with $k$ elements (e.g., $x_1, x_2, ..., x_k$) and assume a probability distribution for each of these $k$ elements - we can also determine the probability distribution for each of these individual $k$ elements. It seems that the very definition of an order statistic is based on some underlying assumption of a probability distribution.

My Question: Does this mean that for some sample, it is fundamentally impossible to compute the distribution of the $k$th order statistics without some some choice of probability distributions (i.e., non-parametrically)?

$\endgroup$
4
  • 7
    $\begingroup$ Perhaps I am confused about what you mean by "for some sample", but if you have data and want to know the $k$th order statistic, you arrange the values in ascending order and take the point in position $k$. $\endgroup$
    – Dave
    Commented Jun 29, 2022 at 18:48
  • 1
    $\begingroup$ The material you've quoted gives the distribution of the order statistics. $\endgroup$
    – Sycorax
    Commented Jun 29, 2022 at 18:57
  • $\begingroup$ @ Dave : thank you for your reply! But what if I want the distribution of the k-th order statistic? Is this still possible in a non-parametric setting? Thank you! $\endgroup$
    – stats_noob
    Commented Jun 29, 2022 at 18:59
  • 2
    $\begingroup$ (Expanded to Answer format below.) For a sample of size n=2, there are two order statistics min and max. Obviously, the parent distribution influences the distributions of both. The distribution of the minimum of two standard normal random variables (which can take negative values) differs from the dist'n of the minimum of two standard exponential random (which is exponential with rate 2) and can never take negative values. $\endgroup$
    – BruceET
    Commented Jun 29, 2022 at 21:01

3 Answers 3

2
$\begingroup$

The formulae you have cited here are giving you the true distribution of the order statistics in the case where the undrlying sample values are IID random variables from the true distribution $F_X$. If the form of $F_X$ is unknown then the distribution of the order statistics is likewise unknown, so it is not possible to compute it exactly. However, is indeed possible to estimate the distribution of the order-statistics non-parametrically.

There are several ways you could go about non-parametric estimation of the distribution of order statistics for an IID sample. One simple method would be to use a standard non-parametric estimator for $F_X$ (e.g., a kernel density estimator) and then substitute the resulting estimate into the formula for the distribution of the order statistics to yield a "plug-in" estimator. In this case, you would choose an appropriate kernel CDF/density for your KDE (respectively denoted as $H$ and $h$) and then your estimator for the distribution of the $k$th order statistic would be:

$$\hat{f}_{X_{(k)}}(x) = \frac{n!}{(k-1)!(n-k)!} \hat{f}_n(x) [\hat{F}_n(x)]^{k-1} [1-\hat{F}_n(x)]^{n-k},$$

where:

$$\begin{align} \hat{F}_n(x) \equiv \frac{1}{n \hat{\lambda}} \sum_{i=1}^n H \Big( \frac{r-x_i}{\hat{\lambda}} \Big) \quad \quad \quad \quad \quad \hat{f}_n(x) \equiv \frac{1}{n \hat{\lambda}} \sum_{i=1}^n h \Big( \frac{x-x_i}{\hat{\lambda}} \Big), \end{align}$$

and $\hat{\lambda}$ is an estimated value for the KDE bandwidth $\lambda$. This simple "plug-in" estimator does not rely on any parametric assumptions about the true form of $F_X$ and it will give you a locally-consistent estimator for the true distribution of the order statistics.

$\endgroup$
16
$\begingroup$

"Hey, Dave, how likely is it that my second-largest measurement is at least five?"

"What are you measuring?"

"Could be anything!"

"If you're measuring the number of meters between planets, then I'd say it's pretty likely. If you're measuring the number of times one gives birth to triplets before age twenty, then I'd say it's pretty unlikely."

"...so how likely is it that my second-largest value is at least five?"

If you have no idea what the distribution is, then you shouldn't be able to know much about its statistics.

The same argument applies to simple statistics like the sample mean. The mean number of meters between planets is likely to be greater than five, while the mean number of times people give birth to triplets by age twenty is unlikely to be greater than five. The original distribution influences the distribution of the statistic.

$\endgroup$
1
10
$\begingroup$

This was formerly a comment, but per your comment, I expanded it a bit.

For a sample of size $n=2,$ there are two order statistics min and max. Obviously, the parent distribution influences the distributions of both. The distribution of the minimum of two standard normal random variables (which can take negative values) differs from the dist'n of the minimum of two standard exponential random (which is exponential with rate 2) and can never take negative values. (Using, R for brief simulation.)

set.seed(2021)
mn = replicate(10^6, min(rnorm(2))) # min of two std normals
summary(mn)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5.331726 -1.107641 -0.544854 -0.563720  0.000805  3.415120 

me = replicate(10^6, min(rexp(2)))  # min of two exponentials
summary(me)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.1442  0.3467  0.5000  0.6935  6.6056 
$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.