3
$\begingroup$

I learned in statistics the first quartile, 2nd quartile, and 3rd quartile can be represented in the figure1 below

enter image description here

I came across this part of the article Step 4 - Feature Engineering.In this portion of the article, they used quantile

strain.append(np.quantile(X,0.01))
strain.append(np.quantile(X,0.05))
strain.append(np.quantile(X,0.95))
strain.append(np.quantile(X,0.99))

Can we represent these quantile0.01, quantile0.05, quantile0.95, quantile0.99 values in normal distribution curve like this enter image description here

Is it correct representation? What do these quantile0.01, quantile0.05, quantile0.95, quantile0.99 values define?

$\endgroup$
4
  • 1
    $\begingroup$ In short, yes!! $\endgroup$ Commented Jun 3, 2020 at 20:05
  • 1
    $\begingroup$ The probability that a random point from the distribution will be less than or equal to the X percentile is equal to X. So, for example, 0.95 of the density of the distribution is to the left of the 0.95 quantile. $\endgroup$
    – Ryan Volpi
    Commented Jun 3, 2020 at 20:17
  • 2
    $\begingroup$ No-one has pointed out that the first diagram is wildly inaccurate. The shaded areas supposedly make up 50% of the total area and the unshaded areas the other 50% but to my eye the shaded areas make up much more than the unshaded. Also, if the diagram is supposed to be a normal distribution, then the first or lower and third or upper quartiles lie between the mean (median) and the inflections, which are at one standard deviation from the mean, not outside them. These facts alone should lead to distrust in the source specified. $\endgroup$
    – Nick Cox
    Commented Jun 4, 2020 at 7:54
  • $\begingroup$ I glanced at the website mentioned and didn't study it in detail. I stopped after seeing another unreliable diagram, which purports to show a normal distribution and others with greater or lesser kurtosis, but in practice shows three similar distributions with different spread and approximately the same kurtosis. The fact that this is a common error naturally doesn't stop it being an error. $\endgroup$
    – Nick Cox
    Commented Jun 4, 2020 at 7:59

1 Answer 1

4
$\begingroup$

You may be confusing population quantiles with the sample quantiles that estimate them. Your population quantiles are appropriately represented in your figures.

Population quantiles. If random variable $X \sim \mathsf{Norm}(\mu = 100, \sigma = 15),$ then quantiles $.01, .05, .25, .50, .95, .99$ of the distribution can be found in R by using the quantile function qnorm. (The quantile function is sometimes called the 'inverse CDF` function.)

q = round(qnorm(c(.01,.05,.25,.50,.75,.95,.99), 100, 15),3);  q
[1]  65.105  75.327  89.883 100.000 110.117 124.673 134.895 

These quantiles (at vertical lines) can be displayed along with the density function of $\mathsf{Norm}(100, 15)$ as shown in the graph below.

 curve(dnorm(x, 100, 15), 50, 150, col="blue", lwd=2, ylab="PDF",
      main="Density of NORM(100, 15) with Various Quantiles")
   abline(h=0, col="green2");  abline(v=0, col="green2")
   abline(v=q, col="red", lty="dotted", lwd=2)

enter image description here

The total area (representing probability) under the density curve is $1.$ Areas to the left of the three left-most vertical lines are $.01,.05,$ and $.25,$ respectively.

Sample quantiles. If I have a sufficiently large sample from this distribution, then I can find the quantiles of the sample. For example, the 50th sample percentile (quantile .5) is the sample median. These sample quantiles estimate the corresponding population percentiles. Generally speaking, larger samples give better estimates. I will use $n = 1000$ in my example.

set.seed(2020) # for reproducibility
x = round(rnorm(1000, 100, 15), 3)

Here are some summary statistics of the sample, including the sample first quartile (quantile .25), the sample median, and the sample third quartile (quantile .75). The boxplot uses the quartiles [upper and lower edges of the box]and the median [center line inside box], so we show it also.

summary(x)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   52.51   89.30   99.14   99.60  109.58  155.54 

boxplot(x, col="skyblue2", horizontal=T,
        main="n=1000; Boxplot of Sample from NORM(100,15)")

enter image description here

Without extra arguments, the R procedure quantile shows the maximum and minimum values in the sample and the three quantiles shown in the summary.

quantile(x)
       0%       25%       50%       75%      100% 
 52.50800  89.30475  99.13750 109.57850 155.54300 

In order to get our full list of quantiles, we need to specify them individually.

samp.q = quantile(x, c(.01,.05,.25,.50,.75,.95,.99));  samp.q
       1%        5%       25%       50%       75%       95%       99% 
 63.76255  74.46450  89.30475  99.13750 109.57850 126.38775 136.60263 

In particular, notice that population quantile .05 (which is $75.327$ from earlier) is estimated by the sample quantile .05 (which is $74.465$ just above).

Finally, we show a histogram of the $n=1000$ observations along with the population density curve. Now the vertical dotted lines show the positions of our chosen sample quantiles.

hist(x, prob=T, col="skyblue2", main="Histogram of Sample")
 curve(dnorm(x, 100, 15), add=T, col="blue", lwd=2)
 abline(v=samp.q, col="purple", lty="dotted", lwd=2)

enter image description here

Numbers of observations at or to the left of the three left-most vertical lines are $10, 50,$ and $250,$ respectively, out of $1000.$

Note: All of the above is about quantiles for a normal distribution because your question deals only with normal distributions. But @Nick Cox makes a good point that quantiles are used similarly for other distributions. For example, here is a plot of an exponential distribution that has rate $\lambda = 0.1$ (hence mean $\mu = 10),$ with vertical lines at the same quantiles used above for the normal distribution.

q = round(qexp(c(.01,.05,.25,.50,.75,.95,.99), 0.1),3);  q
[1]  0.101  0.513  2.877  6.931 13.863 29.957 46.052

curve(dexp(x, 0.1), 0, 60, col="blue", lwd=2, ylab="PDF", n=10001,
      main="Density of EXP(mean=10) with Various Quantiles")
  abline(h=0, col="green2");  abline(v=0, col="green2")
  abline(v=q, col="red", lty="dotted", lwd=2)

enter image description here

$\endgroup$
1
  • 1
    $\begingroup$ (+1) As always this is a well-crafted and lucid answer. But somewhere there needs to be a flag that this is all about the normal distribution and that quantiles are defined much more widely. $\endgroup$
    – Nick Cox
    Commented Jun 4, 2020 at 8:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.