3
$\begingroup$

I am using the following definition of the central limit theorem:

Suppose $X_1 ,X_2\dots,X_n$ are indepdent identical with $E(X_i)=\mu$ and $Var(X_i)=\sigma^2$.

Then as $n\to\infty$, $Z_n=\frac{X_1+X_2+..+X_N-n\mu}{\sigma\sqrt n}\sim Normal(0,1)$

My question is about the value of $n$.

Say we have a population of 20 students with the following ages

[[12 14 15 19 20
  13 15 16 17 18
  21 22 23 24 
  19 17 16 15 11]

So what is $n$? Is $n$ here the number of students you take for sample, say $n=5$ you have a sample of $[21, 22, 24, 12, 19]$, or could $n=5$ mean take 5 sample of size 2, for example

$[12 ,14]$
$[13, 17]$
$[15 , 11]$
$[17 ,16]$
[$19 ,18]$

So is $n$ the number of students in the sample or the number of times you take a sample??

$\endgroup$
5
  • 1
    $\begingroup$ n is a sample size $\endgroup$
    – Aksakal
    Commented Jun 5, 2021 at 18:49
  • $\begingroup$ So is the sample size and the number of sample the same thing $\endgroup$ Commented Jun 5, 2021 at 18:54
  • $\begingroup$ n is the number of students. The number of samples would be 1. $\endgroup$
    – Michael M
    Commented Jun 5, 2021 at 19:34
  • $\begingroup$ I see so the bigger the sample size the more student you take for 1 sample then the more likely Zn will be distributed normally as you take more samples. $\endgroup$ Commented Jun 5, 2021 at 19:39
  • $\begingroup$ I think I get each $Xi$ is a sample you if have $n$ you have n sample and sample size is n. $\endgroup$ Commented Jun 5, 2021 at 19:54

1 Answer 1

5
$\begingroup$

Your sample is an unusual one to use for an example of the Central Limit Theorem, but not an impossible choice. Before I discuss your finite population, let me give give two more-standard examples that may be easier to understand.

Continuous uniform population. Suppose your population is $\mathsf{Unif}(0,1)$ which has mean $\mu = 0.5,$ variance $\sigma^2 = 1/12,$ and standard deviation $\sigma = \sqrt{1/12} = 0.2887$ (to four places). This is a continuous population with infinitely many elements.

Suppose you take a sample $X_1, X_2, \dots, X_{20}$ of size $n=20$ from this population, using R to do the sampling. It has sample mean $\bar X = 0.5508$ and sample standard deviation $S = 0.3135.$ The sample mean estimates the population mean $\mu = 0.5$ and the sample SD estimates the population SD $\sigma = 0.288.$ With such a small sample size as $n=20,$ we cannot expect excellent estimates.

set.seed(123)
x = runif(20)
mean(x)
[1] 0.5508084
sd(x)
[1] 0.313471

Then one may wonder about the distribution of the random variable $\bar X.$ Simple statistical theory, which you may already know, says that the expected value $E(\bar X) = \mu - 0.5$ and $SD(\bar X) = \sigma/\sqrt{n} = \sigma/\sqrt{20} = 0.0645,$ The Central Limit Theorem says that, for large $n,$ the shape of the distribution of $\bar X$ will be nearly normal. Is $n = 20$ large enough for $\bar X$ to have a roughly normal distribution. That problem can be solved to give the exact density function of $\bar X,$ but we will take a large number of samples of size $n = 20,$ plot their histogram and see if it looks anything like normal. [In the computer code I use a.20 for a vector of 100,000 sample means.]

set.seed(124)
a.20 = replicate(10^5, mean(runif(20)))
mean(a.20)
[1] 0.4999757     # aprx 0.5
sd(a.20)
[1] 0.06460389    # aprx 0.0645

The distribution of the 100,000 values of $\bar X$ gives values close to the theoretical values mentioned earlier.

Now we look at the histogram of the standardized $\bar X$'s and see that it very nearly matches the density curve of $\mathsf{Norm}(0,1).$

z = (a.20 - 0.5)/0.0645
hdr = "n = 20: Means of Uniform Samples"
hist(z, prob=T, br=50, col="skyblue2", main=hdr)
curve(dnorm(x), add=T, lwd=2, col="red")

enter image description here

[For details of the convergence of sums (thus means) of observations from $\mathsf{Unif}(0,1)$ as sample size increases see Wikipedia on Irwin-Hall distributions.]

Continuous Exponential population. If we take samples of size $n = 20$ from an exponential population with rate $\lambda = 0.1$ and mean $\mu = 10,$ we see that the mean $E(\bar X) = 10$ and $SD(\bar X) = 10/\sqrt{20} = 2.2361$ are very nearly approximated by looking at 100,000 sample means $\bar X.$ However, means from a skewed exponential distribution converge to a normal shape more slowly than means from a uniform distribution.

set.seed(125)
a.20 = replicate(10^5, mean(rexp(20, 0.1)))
mean(a.20)
[1] 10.00275  # aprx 10
sd(a.20)
[1] 2.238413  # aprx 2.2361
z = (a.20 - 10)/2.2361
hdr = "n = 20: Means of Exponential Samples"
hist(z, prob=T, br=50, col="skyblue2", main=hdr)
curve(dnorm(x), add=T, lwd=2, col="red")

enter image description here

For sufficiently large $n,$ the shape of the distribution of $\bar X$ from an exponential population becomes very close to normal, but $n = 20$ is not sufficiently large for samples from an exponential. [The actual distribution of $\bar X$ in this situation is a somewhat right-skewed gamma distribution with shape parameter $20.]$

Discrete finite population. You propose the population in the vector pop below, which has mean $\mu $ and $\sigma $ [Notice the the denominator of the variance here is the population size $N = 19.]$

pop=c(12, 14, 15, 19, 20, 13, 15, 16, 17, 18, 
      21, 22, 23, 24, 19, 17, 16, 15, 11)
N = length(pop)
mu = mean(pop); mu
[1] 17.21053
vr = var(pop)*(N-1)/N
sg = sqrt(vr); sg
[1] 3.577399

Now suppose I want to take a random sample (necessarily, with replacement) of size $n = 50$ from this finite population. Again here the sample mean $\bar X$ has $E(\bar X) = \mu = 17.2105$ and $SD(\bar X) = 3.577399/\sqrt{50} = 0.5059.$ These values are reasonably well approximated by the simulation of 100,000 sample means.

set.seed(126)
a.20 = replicate(10^5, mean(sample(pop,50,rep=T)))
mean(a.20)
[1] 17.21085  # aprx 17.2105
sd(a.20)
[1] 0.5040913 # aprx 0.5059 

z = (a.20 - 17.2105)/0.5059
hdr = "n = 50: Means of Samples from Finite Population"
hist(z, prob=T, br=50, col="skyblue2", main=hdr)
curve(dnorm(x), add=T, lwd=2, col="red")

enter image description here

The match to a standard normal distribution is not bad; averaging $n = 50$ has begun to imitate a continuous normal distribution. While there are only $N = 19$ values in the population (14 of them unique), there are $204$ uniquely different averages of samples of size $n=50$ among $100,000.$ [There are about 280 unique means of samples of size 100; about 800 for $n = 1000.$ Very gradually, as $n$ increases, the distribution of the sample means is becoming more nearly like that of sample means from a continuous distribution.]

length(pop)
[1] 19
length(unique(pop))
[1] 14
length(unique(a.20))
[1] 204
$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.