5
$\begingroup$

I'm working with a Regression Through the Origin model as described in this reference. The model is $Y_i = \beta X_i + \epsilon$ and the least squares estimate of $\beta$ is $\hat\beta = \frac{\sum X_i Y_i}{\sum X_i}$

I'm confused about how the error term influences the confidence intervals of $\hat\beta$. The variance of $\hat\beta$ as described in the reference above is proportional to the SSE, computed as $\sum \left(Y_i-\hat\beta X_i\right)^2$.

What's confusing me here is that a datapoint $i$ with $Y_i \neq 0$ and $X_i = 0$ will have no effect on $\hat\beta$, but will have a large effect on the SSE. Accordingly, including or excluding points with $X_i = 0$ would have a large effect on the CIs of the fit, even though it has no effect on the fitted value.

Why does this happen, and doesn't this suggest there should be a better way of estimating $\text{Var}(\hat\beta)$? Shouldn't a data point with no effect on the estimate by construction have no effect on the variance of the estimate?

$\endgroup$
1
  • 2
    $\begingroup$ I got a partial understanding from the response to this. stats.stackexchange.com/questions/457383/… There's an underlying assumption of homoscedasticity, and so the error term at x=0 is as good of an estimate of the error term at any other point, so it's valid to include. $\endgroup$ Commented Jun 11 at 21:56

1 Answer 1

5
$\begingroup$

This type of datapoint is actually very helpful for estimating the variance

This is quite unsurprising when you think a bit more about it. Just as it is a very different thing to estimate a mean and a variance, estimating the slope of a line is a very different thing to estimating the variance of a deviation of values around that line. It is possible for information to be very helpful in estimating the mean of a distribution and very unhelpful in estimating its variance, or vice versa.

A datapoint at $x_i=0$ cannot affect the slope estimate because it tells you nothing about the behaviour of the true regression line that you don't already know from the fact that it passes through the origin. For this datapoint the deviation $y_i \neq 0$ must be pure "error" that is not attributable to the regression, so it does not contribute to the slope estimator. However, it is precisely because it represents error (i.e., deviation from the true regression line) that this point does give you information about the error variance in the regression, and therefore about the variance of the slope estimator.

As a thought experiment to confirm this intuition, imagine that you only had one datapoint with $x_i \neq 0$, so that your estimated slope would just be the slope of the line through the origin and that single point (which we will here call the "slope-informative point"). Now, imagine that all your other datapoints fall at $x_i = 0$. If these non-slope-informative datapoints are all tightly packed around $y_i \approx 0$ then that tells you that the error variance in the model is low, which means that the error variance for the slope-informative point is low, which means that the true regression value at that point is close to the observed value. Naturally, this will mean that your estimated slope is more accurate and your confidence interval for the true value of the slope will be narrow. Contrarily, if those non-slope-informative datapoints are all large deviations from zero (i.e., $|y_i| \gg 0$) then that tells you that the error variance in the model is high, which means that the error variance for the slope-informative point is high, which means that the true regression value at that point is potentially far away from the observed value. Naturally, this will mean that your estimated slope is less accurate and your confidence interval for the true value of the slope will be wide.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.