2
$\begingroup$

The corrected sum of squares is the sum of squares of the deviations of a set of values about its mean.

$$ S = \sum_{i=1}^k\space\space(x_i - \bar x)^2 $$

We can calculate the mean in a streaming fashion, as follows:

$$ m_n = \frac{n-1}{n}m_{(n-1)} + \frac{1}{n}x_n $$

I understand the intuition behind this: the sum of the previous $n-1$ values was divided by $n-1$, so by multiplying those values by $\frac{n-1}{n}$, we down-weight the sum properly.

However, we can also calculate the full corrected sum of squares as follows:

$$ S_n = S_{n-1} + \frac{n-1}{n}\left( x_n - m_{n-1}\right)^2 $$

However, I don't have a good intuition for why this works. It looks like we use the previous corrected sum of squares value, and then add the square of the current value's deviation from the mean of all the previous values.

But, this algorithm doesn't make sense to me, even if it was derived logically.

These formulas are from "Note on a Method for Calculating Corrected Sums of Squares and Products" by B.P Welford.

$\endgroup$
3
  • $\begingroup$ I think something like $S_n -S_{n-1}= (x_n^2-m_{n-1}^2) - n(m_n^2 -m_{n-1}^2)$ may be a little more intuitive: you need to add something for the additional term, and adjust for the effect of changing the mean on all the terms. Then try tidying the right hand side $\endgroup$
    – Henry
    Commented Feb 26 at 1:58
  • $\begingroup$ @Henry, I tried re-arranging the formula, but it gave me a very messy expression that doesn't seem to simplify to yours: $S_n - S_{n-1} = \frac{n-1}{n}\left(x_n^2-2x_nm_{n-1}+m_{k-1}^2\right)$. Additionally, it's surprising to me that the update does not modify the old corrected sum of squares, $S_{n-1}$ at all. Given that the old value is based on an old mean. $\endgroup$
    – Foobar
    Commented Feb 26 at 3:26
  • $\begingroup$ I have added an answer $\endgroup$
    – Henry
    Commented Feb 26 at 12:01

1 Answer 1

2
$\begingroup$

Let $k_n = \frac1n \sum\limits_{i=1^n} x_i^2$, i.e. the mean of the sum of squares.

From your mean updating, you have $$k_n = \frac{n-1}{n}k_{n-1} + \frac{1}{n}x_n^2.$$

Recalling expressions for variance: $$k_n = m_n^2 + \frac1n S_n \text{ and so } k_{n-1} = m_{n-1}^2 + \frac1{n-1} S_n.$$

Substituting gives $$m_n^2 + \frac1n S_n = \frac{n-1}{n}(m_{n-1}^2 + \frac1{n-1} S_{n-1}) + \frac{1}{n}x_n^2$$ and multiplying through by $n$ and rearranging gives $$S_n = S_{n-1}+ (x_n^2-m_{n-1}^2) - n(m_n^2 -m_{n-1}^2)$$ which is approaching what you are asking for and may be a little more intuitive: you need to add something for the additional term, and adjust for the effect of changing the mean on all the squares of differences.

If you now use $m_n = \frac{n-1}{n}m_{n-1} + \frac{1}{n}x_n$, you get $S_n$ in terms of $S_{n-1}$, $x_n$ and $m_{n-1}$: $$S_n = S_{n-1}+ \left(x_n^2-m_{n-1}^2\right) - n\left(\left(\frac{n-1}{n}m_{n-1} + \frac{1}{n}x_n\right)^2 -m_{n-1}^2\right)$$ and expanding and tidying up the right hand side gives the original claim that $$S_n = S_{n-1}+ \frac{n-1}{n}\left(x_n-m_{n-1}\right)^2.$$

$\endgroup$
1
  • $\begingroup$ If you want to track and check calculations, it might be worth having some actual numbers. For example you might have $n=4$ and $x_i=6,12,18,36$, so $m_3=12$ and $m_4=18$, $S_3=72$ and $S_4=504$, with my $k_3=168$ and $k_4=450$. $\endgroup$
    – Henry
    Commented Feb 26 at 13:20

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .