6
$\begingroup$

I understand the purpose of a z-score and how you calculate $z_1$ and $z_2$, given $x_1$ and $x_2$ for some normal random variable $X$:

$$Z = \frac{x-\mu}{\sigma}.$$

If $x-\mu = \sigma$, then $Z = 1.00$, which tells us that the sample point $x$ is precisely one standard deviation ($\sigma$) to the right of its mean ($\mu$).

Likewise, if $x-\mu = -\sigma $, then $Z = -1.00$, which tells us that the sample point $x$ is precisely one standard deviation ($\sigma$) to the left of its mean ($\sigma$).

However, what all sources I've come across fail to properly explain is where the values listed in a $z$-table come from.

I am interested in calculating: $$P(z_1<Z<z_2)$$ by hand. No matter what I've read online, everything tells me to just look up the value in a table. Wwhere do those values come from? How is the above probability computed? And how is it any simpler than $P(x_1 < X < x_2)$?

$\endgroup$
5
  • $\begingroup$ It comes from numerical integration of the pdf associated to the normal distribution. It can't really be done by hand (at least, not to a reasonable degree of precision in a reasonable amount of time). $\endgroup$
    – Xander Henderson
    Commented Feb 9, 2018 at 14:46
  • $\begingroup$ The profit does not lie in "being more simple" but in the fact that exactly one table is enough for all pairs $(\mu,\sigma)$. Where does it come from? It is just the evaluation of $\Phi(z)$ where $\Phi$ denotes the CDF of a random variable that has standard normal distribution. This evaluation has been done and is at our disposal, so what is the profit of doing it again? $\endgroup$
    – drhab
    Commented Feb 9, 2018 at 14:51
  • $\begingroup$ "This evaluation has been done and is at our disposal, so what is the profit of doing it again?" Well, I mean, without understanding how it was derived, can I really trust that "this is just how it is"? That's one of my major problems w/ statistics. A lot of things are mysteriously difficult to derive...yet someone did derive them. $\endgroup$ Commented Feb 9, 2018 at 15:12
  • $\begingroup$ @AleksandrH Yes we can really put our trust in it. IMHO it is more essential for us (workers on statistics and probability) to know what it is (not so much how it was calculated). In my former comment I allready told you what it is. The calculation is mathematics too of course, but has much less of my interest. On that area I simply live in good faith. $\endgroup$
    – drhab
    Commented Feb 9, 2018 at 15:21
  • $\begingroup$ @AleksandrH The problem is that most people who want to learn statistics want to learn how to use stats as a tool. They don't really need or want to know the details of the deep dark theory. And the theory itself requires some work&mdash;probably some measure theory and rigorous probability theory (both are typically advanced undergraduate or even graduate level courses of study), as well as some fairly rigorous statistics theory. The application vs theory of statistics is like the difference between learning to drive a car and learning build one. $\endgroup$
    – Xander Henderson
    Commented Feb 9, 2018 at 16:40

1 Answer 1

9
$\begingroup$

A normally distributed random variable $X$ has an associated probability distribution function (pdf) given by $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}},$$ where $\mu$ is the mean or expected value of the random variable, and $\sigma$ is the standard deviation. You might try picking some values for $\mu$ and $\sigma$ and graphing the result—in each case, you should see a bell-shaped curve, though the location and height of the curve will vary depending on the parameters you choose. The probability that the random variable falls between two values is the area under the curve between those two values. We use integrals to find these areas, so $$ P(x_1 < x < x_2) = \int_{x_1}^{x_2} \frac{1}{\sqrt{2\pi\sigma^2}} \mathrm{e}^{-\frac{(x-\mu)^2}{2\sigma^2}}\, \mathrm{d}x. $$ It turns out that this integral does not have an antiderivative in terms of elementary functions—I'm not going to go into details about what this means, but this basically means that it is very hard (possibly impossible) to compute this integral "exactly". The best that we can do is use numerical methods to approximate the values that this integral takes. We can get approximates that are as good as we like (if we spend enough time and/or computer power on it), but all we'll ever have are approximations. We can then print up tables and tables of these approximations, and use those tables for calculations.

But there is a problem with this idea! The value of the integral will depend on the parameters $\mu$ and $\sigma$! This means that if we change these values even a little, then we have to compute an entirely new table. This is clearly untenable, so we have to use some other tricks. The trick is to "standardize" our normal random variables. It turns out that if $X$ is a normal random variable with mean $\mu$ and standard deviation, then $$ Z = \frac{X-\mu}{\sigma} $$ is a standard normal random variable—it is normal with mean $0$ and standard deviation $1$. Note that this is the formula used to compute the $z$-score of a normal random variable!

Since we can turn any normal random variable into a standard normal random variable, we only need one table of values! Yay! The basic idea is that we compute a huge table of values for $P(z < z_0)$ (using computers at this point in history), then standardize a normal random variable when we want to work with it.

Long story short: Getting exact values for probabilities associated to normal random variables is generally not possible. Computers can be used to find very good approximations, but we don't want to have a different table for every set of parameters. Since we can standardize any normal random variable, we only need to generate one table in order to work with any normal random variable.

The rest of the story: All of the above basically harkens back to the pre-computer era, or the calculator-free classroom. Modern computers can perform calculations fast enough to get 7- or 15-digit approximations in a fraction of a second, and most statistical (and even spreadsheet!) software have normal distributions built-in. The user inputs the $x$-score, the mean, and standard deviation, and the computer turns the crank and spits out a numerical approximation in a fraction of a second. I would guess that the computer first standardizes the input and actually performs the approximation for a standard normal distribution, but I don't actually know the nitty-gritty of statistical or spreadsheet software, or calculator programming.

The moral of this story is that tables are an anachronism, and have been replaced by computers in real-world use.

$\endgroup$
2
  • $\begingroup$ Thank you! This made a lot of sense. But I imagine z-tables don't have all possible conceivable values, right? Since you run into the same problem as with the "X" table (changing $\mu$ and $\sigma$ even slightly gives a different answer). $\endgroup$ Commented Feb 9, 2018 at 15:11
  • $\begingroup$ @AleksandrH With a $z$-table, there is only one parameter to change---the upper bound of integration. This is typically done along some regular interval, say in increments of 0.01. When you add more parameters, the number of entries that you have to print become unmanageable&mdash;I am currently staring at a table for the binomial distribution that runs to 10 pages, and only considers experiments with fewer than 20 trials. Since we can normalize or standardize our normal variables, it makes more sense to do this. $\endgroup$
    – Xander Henderson
    Commented Feb 9, 2018 at 16:29

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .