6
$\begingroup$

I am reading the All of Nonparametric Statistics, by Larry Wasserman. At page 12, he defines the empirical distribution function as:

The empirical distribution function $\hat{F_n}$ is the CDF that puts mass $\frac{1}{n}$ at each data point $X_i$. Formally,

$$\hat{F_n}(x)=\frac{1}{n}\sum^{n}_{i=1}I(X_i\le x)$$

where

$$I(X_i\le x)=\left\{\begin{matrix} 1& if\ X_i \le x\\ 0 & if \ X_i>x \end{matrix}\right.$$

My questions are:

  1. Why is $\frac{1}{n}$ called mass?

  2. The CDF puts mass $\frac{1}{n}$ to each data point $X_i$, then, by my understanding, it should be $\frac{1}{n}X_1+\frac{1}{n}X_2+...+\frac{1}{n}X_n$.

Why is it $\hat{F_n}(x)=\frac{1}{n}\sum^{n}_{i=1}I(X_i\le x)$? I think this formula puts mass $\frac{1}{n}$ on each indicator function $I(X_i \le x)$ but not $X_i$.

What is the meaning of "puts" something "at each data point"?

$\endgroup$

1 Answer 1

11
$\begingroup$

Why is $\frac{1}{n}$ called mass?

The term "mass" refers to an amount of probability at a single discrete point, as distinct from "density" in relation to continuous distributions.

The CDF puts mass $\frac{1}{n}$ to each data point $X_i$, then, by my understanding, it should be $\frac{1}{n}X_1+\frac{1}{n}X_2+...+\frac{1}{n}X_n$.

That isn't a question, it's a statement -- but your understanding given there is mistaken in a couple of ways at once, so I can discuss that.

First the expression $\frac{1}{n}X_1+\frac{1}{n}X_2+...+\frac{1}{n}X_n$ is actually an expression for the sample mean (as a random variable) -- it literally means to average the values. I presume that you meant instead to write an expression for the empirical probability function here -- but keep in mind that we're meant to be dealing with a distribution function, not the probability function, so you need to find the proportion of the empirical probability that's at or to the left of each possible value of $x$ -- that's how a distribution function represents probability 1/n at each point:

empirical probability function and empirical cdf

These are two different representations of the same underlying object. You can see that the empirical pmf shows a mass of 1/n at each observed value, while the ecdf shows a height that increases by 1/n at each observed value (and that this corresponds to 1/n times the sum of indicator functions you mentioned)

What is the meaning of "puts" something "at each data point"?

I'm not quite sure exactly what causes the difficulty here, the words essentially take their ordinary meanings; see the images above which show a proportion of $1/n$ at each observed value $x_i$; if you treat the epmf and the ecdf as a pmf and a cdf respectively, those are probabilities. Possibly it is treating $\hat{F}$ as an active entity (one that can "put" things somewhere) that is confusing you -- would it be easier to understand if it said "has" rather than "puts"? If that doesn't help, you'll have to make it clearer what you need explained there.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.