ANOVA formulas not matching with correct answer

Question

I'm trying to compare the results of a by hand, one-way between-groups ANOVA with those given from R. Weirdly, I get completely different results.

I used formulas given by my professor:

\begin{align} MS_{between} &= \frac{\sum_{p=1}^{K}{(\bar{X}_p-\bar{X}_G)^2}}{K-1} \\ MS_{within} &= \frac{\sum_{p=1}^{K}{(nK-1)S^2_K}}{N-K} \\ F &= \frac{MS_{between}}{MS_{within}}, \end{align} where $K$ represents the number of levels, $N$ the number of observations, $\bar{X}_p$ the mean of the $p$th group, and $\bar{X_G}$ the grand mean, or the mean of the group means.

I was given \begin{align} K &=3 \\ N&=15 \\ \bar{X} &= \{38.2, 20.6, 14.8\} \\ \bar{X}_G & = 24.533 \\ S^2 &= \{66.7, 69.3, 61.7\} \end{align} and thus obtained \begin{align} SS_{between} &= 296.9867 \\ SS_{within} &= 8698.8 \\ df_{between} &= 2 \\ df_{within} &= 12 \\ MS_{between} &= 148.4933 \\ MS_{within} &= 724.9 \\ F &= .2048 \end{align}.

However, this is not the correct answer. I'm supposed to get

And I'm not quite sure what I have done wrong. I know this community isn't particularly for R programming, but I will attach formulas I used to calculate these just in case:

score = c(38, 47, 39, 25, 42, 22, 19, 8, 23, 31, 14, 26, 11,
18, 5)
lev = as.factor(rep(c("p", "l", "m"), each = 5))
dat = data.frame(score, lev)
xbar1 = mean(dat[dat$lev == "p", 1])
xbar2 = mean(dat[dat$lev == "l", 1])
xbar3 = mean(dat[dat$lev == "m", 1])
var1 = var(dat[dat$lev == "p", 1])
var2 = var(dat[dat$lev == "l", 1])
var3 = var(dat[dat$lev == "m", 1])
xbar = c(xbar1, xbar2, xbar3)
var =  c(var1, var2, var3)
xbarg = mean(xbar)
SSbet = sum((xbar-xbarg)^2)
SSwith = sum((15*3-1)*var)
dfbet = 3-1
dfwith = 15 -3
MSbet = SSbet / dfbet
MSwith = SSwith / dfwith
f = MSbet / MSwith

Haven't played with your code, but at a first glance, your formula for SS-within looks off. It's multiplying all your variances by 44, then summing them. Not sure why you'd want to multiply by 44 if you have a sample of 3 groups of 5? — Amaan M, Commented Dec 15, 2021 at 3:51
Ah, I think this is where the error is, thank you @Amaan! It's supposed to be $n_k$, not $nk$. — JerBear, Commented Dec 15, 2021 at 4:06

BruceET · Accepted Answer · 2021-12-17 00:09:34Z

Consider a one-factor ANOVA with $k$ levels of the factor, and $r$ replications at each level. One says that there are two estimates of the underlying population variance $\sigma^2$ common to each level.

(1) $s_w^2$ is valid whether or not the population means for the $k$ levels are equal. It is the mean of the $K$ sample variances for the levels.

mean(c(66.7,69.3,61.7))
[1] 65.9

(2) $s_a^2$ is valid only if all $k$ levels have equal population means $\mu,$ otherwise it tends to be too large. The sample mean of each of the $k$ samples of size $r$ can be taken as an observation from a normal population with mean $\mu$ and variance $\sigma^2/r.$ Thus $s_a^2$ estimated as $r$ times the sample variance of the $k$ level means.%

(3) The F-statistic is $s_b^2/s_w^2,$ so it tends to be about $1$ if the null hypothesis of equal means is true and greater than $1$ if the null hypothesis is false. In an ANOVA table, $s_b^2$ is called MS(Factor) and $s_w^2$ is called MS(Resid).

Now consider the following fictitious data for such a balanced one-way ANOVA with $k=3, n=15.$ [Computations in R.]

set.seed(1216)
x1 = rnorm(15, 100, 10)
x2 = rnorm(15, 105, 10)
x3 = rnorm(15, 110, 10)
x = c(x1,x2,x3)
g = as.factor(rep(1:3, each=15))

anova(lm(x~g))

Analysis of Variance Table

Response: x
          Df Sum Sq Mean Sq F value  Pr(>F)  
g          2  783.9  391.96  4.9402 0.01183 *
Residuals 42 3332.3   79.34                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Then MS(Resid) $=s_w^2 = 70.34$ can be computed as follows from the original data:

mean(c(var(x1), var(x2), var(x3)))
[1] 79.33974

Also, MS(Factor) $= s_a^2 = 391.96$ can be computed as:

15*var(c(mean(x1),mean(x2),mean(x3)))
[1] 391.9579

Now for specifics of your question:

I am not sure of the notation you use in your question, but it looks as if you are given the three level sample variances and the the three level sample means. So you should be able to use them to verify the numbers in your ANOVA table. The computation for MS(Resid) matches your ANOVA table, but not your earlier results.

mean(c(66.7,69.3,61.7))
[1] 65.9
5*var(c(38.2,20.6,14.8))
[1] 742.4667

I think the definitions of constants in your original displayed equations may be as unclear to you as they are to me.

I hope my demonstration with known data will help you figure out whether the displayed equations are unclear or wrong. (Or whether you have made a typographical or computational or coding error.)

Stack Exchange Network

ANOVA formulas not matching with correct answer

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
probability
statistics
statistical-inference
anova
.

Hot Network Questions

ANOVA formulas not matching with correct answer

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged probabilitystatisticsstatistical-inferenceanova.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
probability
statistics
statistical-inference
anova
.