4
$\begingroup$

I have a dataset of 30,155 names and out of curiosity I verified that the longest name has 68 characters, which is quite big considering the mean and SD were 24.78 and 5.64, respectively. Based on this, I thought about this question: What is the most accurate way to estimate the probability of seeing a name with a length greater or equal than 68?

That is, what is the best way to estimate the probability of seeing a number greater or equal than the maximum value of a dataset, based on the dataset itself?

I've come with 3 methods:

Using frequency

Since there was 1 name of length greater or equal than 68 in 30,155 names, then the probability is around $1/30155 \approx 3\cdot10^{-5}$

Using standard deviation

Since 68 is 7.66 standard deviations away from the mean, one can calculate that, for a normal distribution (my dataset seems to visually be a normal distribution), the probability should be $\approx 9\cdot10^{-15}$. Which is obviously way off and wrong.

Using KDE

With Python, I used the gaussian_kde function from SciPy, sampled 100,000,000 name lengths from the KDE, and obtained that the probability is $\approx 1.65\cdot 10^{-5}$, which is smaller than the frequency method, but in the same order.

The KDE method seems, to me, to be the most reasonable way, but I'm not sure.

$\endgroup$
2
  • 5
    $\begingroup$ There's no basis, either empirical nor theoretical, to suppose a Normal distribution is a good approximation for the extreme tail of a distribution. Any KDE merely substitutes a location mixture of the tail of its kernel for the extreme tail and therefore is worthless as an estimator unless its tail is expected to be similar to that of the parent distribution -- and generally a Gaussian KDE doesn't suffice. Look, then, to non-parametric Binomial estimators or study extreme value distributions $\endgroup$
    – whuber
    Commented Jun 28 at 22:07
  • 2
    $\begingroup$ There are many ways to estimate this probability, but as long as you don't define 'accurately' and 'best way', there can be no definitive answer. $\endgroup$ Commented Jun 29 at 16:57

1 Answer 1

-1
$\begingroup$

First of all notice that the data are on a discrete scale, therefore a fair parametric assumption could have been a Pareto/Zipf distribution or even more simply and probably more adequate, a Poisson distribution ($X$ = length of the word). A bootstrap approach with continuous smoothing is as last a good choice... You should have the cdf of the gaussian_kde, since it should be equal to $F(t)=\frac{1}{n}\sum_{i=1}^n \Phi\left(\frac{t-y_i}{h}\right)$, so it is easy to calculate $1-F(67.5)$ (I did a slight continuity correction, I guess it is valid like with the CLT over the Bernoullis). Furthermore, notice that a larger $h$ could affect the weight of more observation on the tail, but it is not recommended to alter the smoothing parameter $h$ obtained by cross validation (or Sheater-Jones). However a general way to answer to this kind of questions is the importance sampling, which simulates for less rare events, but then corrects by the true probability density function. For example, if you want to calculate $\Phi(-10)$ you could generate $V_1,\dots,V_B$ from $\mathcal N(-10,1)$, then taking the mean of $I(v\le-10)\cdot\frac{\phi(v)}{\phi(v+10)}$. Check out if this approach helps you in your problem. Of course the same approach applies also to discrete probabilities.

$\endgroup$
1
  • 3
    $\begingroup$ Bootstrapping doesn't work well for extremes of distributions. In terms of this question, the data sample won't contain any values greater than the highest that was observed (68), so the data will have no information about rare names with "length greater... than 68." See several pages on this site: this one, or this one, or this one, among others. $\endgroup$
    – EdM
    Commented Jun 29 at 13:46

Not the answer you're looking for? Browse other questions tagged or ask your own question.