I have a dataset of 30,155 names and out of curiosity I verified that the longest name has 68 characters, which is quite big considering the mean and SD were 24.78 and 5.64, respectively. Based on this, I thought about this question: What is the most accurate way to estimate the probability of seeing a name with a length greater or equal than 68?
That is, what is the best way to estimate the probability of seeing a number greater or equal than the maximum value of a dataset, based on the dataset itself?
I've come with 3 methods:
Using frequency
Since there was 1 name of length greater or equal than 68 in 30,155 names, then the probability is around $1/30155 \approx 3\cdot10^{-5}$
Using standard deviation
Since 68 is 7.66 standard deviations away from the mean, one can calculate that, for a normal distribution (my dataset seems to visually be a normal distribution), the probability should be $\approx 9\cdot10^{-15}$. Which is obviously way off and wrong.
Using KDE
With Python, I used the gaussian_kde
function from SciPy, sampled 100,000,000 name lengths from the KDE, and obtained that the probability is $\approx 1.65\cdot 10^{-5}$, which is smaller than the frequency method, but in the same order.
The KDE method seems, to me, to be the most reasonable way, but I'm not sure.