5
$\begingroup$

In page 4 of https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf, it states that the regressors have zero correlation with the residuals for OLS, but I don't think this is true.

The assertion is based on the fact that $$ X^Te = 0 $$ where $e$ are the residuals $y - \hat{y}$.

But why does this mean the regressor is uncorrelated with the residual?

I tried to derive this using the definition of covariance for 2 random variables. $X_p$ is the random variable corresponding to the p-th regressor. \begin{align} cov(X_p, e) = E[(X_p - \mu_{X_p})(e - \mu_e)] \\ cov(X_p, e) = E[(X_p - \mu_{X_p})(e - \mu_e)] \\ = E[X_p e - \mu_{X_p} e - \mu_e X_p + \mu_{X_p} \mu_e] \\ = E[X_p e] - \mu_{X_p} \mu_e \end{align}

We know that $E[X_p e] = 0$, but $X_p$ is only uncorrelated with $e$ if one of their means are zero.

Edit. I think there may be a mistake in my derivation. I do not believe $E[X_p e] = 0$.

$\endgroup$
3
  • $\begingroup$ Since you don't think this is true, what counterexample have you come up with? This will help us understand how you interpret the meaning of "correlation" in this context. The ambiguity of meaning lies in the fact that $X$ is explicitly not a random variable, but $e$ is. $\endgroup$
    – whuber
    Commented Jun 25, 2020 at 22:19
  • $\begingroup$ @whuber I just edited the OP with my derivation, which I think is a counterexample? I interpret correlation as the definition of correlation (covariance divided by the product of standard deviations of 2 random variables). Also, I believe $X$ is a random variable, or I should say the matrix $X$ consists of $M$ random variables where $M$ is the number of regressors. $\endgroup$ Commented Jun 25, 2020 at 22:27
  • $\begingroup$ See also stats.stackexchange.com/questions/207841/… $\endgroup$ Commented Jun 26, 2020 at 13:09

2 Answers 2

11
$\begingroup$

In any model with an intercept, the residuals are uncorrelated with the predictors $X$ by construction; this is true whether or not the linear model is a good fit and it has nothing to do with assumptions.

It's important here to distinguish between the residuals and the unobserved things often called the errors.

The covariance between residuals $R$ and $X$ is $$\frac{1}{n}\sum RX-\frac{1}{n}(\sum R)\frac{1}{n}(\sum X)$$ If the model includes an intercept $\sum R=0$, so the covariance is just $\frac{1}{n}\sum RX$. But the Normal equations to estimate $\hat\beta$ are $X(Y-\hat Y)=0$, ie, $\frac{1}{n}\sum XR=0$.

So the residuals and $X$ are exactly uncorrelated.

When there is actually a model $$Y = X\beta+e$$ the assumption that the errors $e$ are uncorrelated with $X$ is necessary to make $\hat\beta$ unbiased for $\beta$ (and we assume the errors have mean zero to make the intercept identifiable). So $E[X^Te]=0$ is an assumption, not a theorem.

The residuals typically are not uncorrelated with $Y$. Neither are the errors.

$\endgroup$
6
  • 1
    $\begingroup$ Couple of questions. (1) Regarding your last sentence. $\hat{Y}$ is uncorrelated with the residuals, but not $Y$, right? (2) In our $X$ matrix we have a column of 1s, so we end up with $\langle \boldsymbol{1}_n, r \rangle = 0$. For this to be true, $\sum_i r_i = 0$. But if our offset evaluates to zero, don't we still still have a column of 1s in the $X$ matrix? It's just that when you do $X\hat{beta}$, the first column of $X$ will end up being multiplied to $\hat{\beta}_0 = 0$. $\endgroup$ Commented Jun 26, 2020 at 2:46
  • $\begingroup$ That's right. $Y$ is correlated with the residuals, $\hat Y$ isn't. You would almost always have an intercept in the model, but you certainly can specify models without one, and they even have uses, such as for ratio estimation in surveys. The residuals in models like that don't add to zero. $\endgroup$ Commented Jun 26, 2020 at 10:03
  • $\begingroup$ +1: In ordinary least squares, the mean of the residuals $e_i=y_i-\hat{y}_i$ will be $0$. If it is not, and instead $\frac{1}{n}\sum (y_i-\hat{y}_i)= k \not = 0$, then $\hat{y}_i-k$ will be a better least squares estimate than $\hat{y}_i$ in the sense that $\sum(y_i-(\hat{y}_i-k))^2 = \sum(y_i-\hat{y}_i)^2 -k^2\lt \sum(y_i-\hat{y}_i)^2$. Indeed the constant term in the OLS result will deal with this automatically $\endgroup$
    – Henry
    Commented Jun 26, 2020 at 12:49
  • 1
    $\begingroup$ I think I'm still a little confused on the concept of "having an intercept." If you formulate the model so that it "has" an offset, but the $\hat{\beta}$ element corresponding to the offset, $\hat{\beta}_0$, happens to evaluate to zero, your residuals still sum to zero right? But if instead, you formulated the model without the $\hat{\beta}_0$ term, then your residuals are no longer guaranteed to zum to zero? Not sure if these questions make sense to others? $\endgroup$ Commented Jun 26, 2020 at 15:13
  • $\begingroup$ I see that in your proof, you assume that the model contains an intercept so $\sum R = 0$. Would the claim that the regressors are uncorrelated with the residuals still hold if the model did not have an intercept? $\endgroup$
    – timeinbaku
    Commented Mar 14 at 19:53
5
$\begingroup$

Consider the model $$Y_i = 3 + 4x_i + e_i,$$ where $e_i \stackrel{iid}{\sim} \mathsf{Norm}(0, \sigma=1).$

A version of this is simulated in R as follows:

set.seed(625)
x = runif(20, 1, 23)
y = 3 + 4*x + rnorm(20, 0, 1)

Of course, one anticipates a linear association between $x_i$ and $Y_i,$ otherwise there is not much point trying to fit a regression line to the data.

cor(x,y)
[1] 0.9991042

Let's do the regression procedure.

reg.out = lm(y ~ x)
reg.out

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      3.649        3.985  

So the true intercept $\beta_0= 3$ from by simulation has been estimated as $\hat \beta_0 = 3.649$ and the true slope $\beta_1 =4$ has been estimated as $\hat \beta_1 = 3.985.$ A summary of results shows rejection of null hypotheses $\beta_0 = 0$ and $\beta_1 = 0.$

summary(reg.out)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.42617 -0.61995 -0.04733  0.41389  2.63963 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.64936    0.52268   6.982 1.61e-06 ***
x            3.98474    0.03978 100.167  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9747 on 18 degrees of freedom
Multiple R-squared:  0.9982,    Adjusted R-squared:  0.9981 
F-statistic: 1.003e+04 on 1 and 18 DF,  p-value: < 2.2e-16

Here is a scatterplot of the data along with a plot of the regression line through the data.

plot(x,y, pch=20)
abline(reg.out, col="blue")

enter image description here

With $\hat Y = \hat\beta_0 + \hat\beta_1,$ the residuals are $r_i = Y_i - \hat Y_i.$ They are vertical distances between the the $Y_i$ and the regression line at each $x_i.$

We can retrieve their values as follows:

r = reg.out$resi
summary(r)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.42617 -0.61995 -0.04733  0.00000  0.41389  2.63963 

The regression procedure ensures that $\bar r = 0,$ which is why their Mean was not shown in the previous summary.

Also, geneally speaking, one expects that the residuals will not be correlated with either $x_i$ or $Y_i.$ If the linear model is correct, then the regression line expresses the linear trend, so the $r_i$ should not show association with either $Y_i$ or $x_i$

cor(r,x);  cor(r,y)
[1] -2.554525e-16
[1] 0.04231753

Because the errors are normally distributed, it is fair to do a formal test to see if the null hypothesis $\rho_{rY} = 0$ is rejected. It is not.

cor.test(r,y)

        Pearson's product-moment correlation

data:  r and y
t = 0.1797, df = 18, p-value = 0.8594
alternative hypothesis: 
  true correlation is not equal to 0
95 percent confidence interval:
 -0.4078406  0.4759259
sample estimates:
       cor 
0.04231753 

Maybe this demonstration helps you to see why you should not expect to see the correlations you mention in your question. If you are still puzzled, maybe you can clarify your doubts by making reference to the regression procedure above.

$\endgroup$
4
  • $\begingroup$ Thanks for this visualization. I was mostly looking for a theoretical perspective on why this is the case. Specifically, in that link, they state that $X^T e = 0$ means the the residuals are not correlated with the regressors, but I don't understand how that implication came about. $\endgroup$ Commented Jun 25, 2020 at 23:22
  • 1
    $\begingroup$ Hi: I'm not sure what you're confused about but, Ii it's an issue with your formula, the mean of the residuals is assumed to be zero. If it was not assumed to be zero, then the model itself wouldn't really make sense because the residuals would then be "explaining" the dependent variable to some extent. The residuals are not supposed to explain anything. They are what's left over after the other variables are used explain the dependent variable $Y$. $\endgroup$
    – mlofton
    Commented Jun 26, 2020 at 0:21
  • 1
    $\begingroup$ The mean of the residuals is zero if you have an intercept and isn't otherwise. It's the mean of the errors that is assumed to be zero. $\endgroup$ Commented Jun 26, 2020 at 1:21
  • $\begingroup$ @ThomasLumley: (1) Mean of errors assumed to be 0: as in my "$e_i \stackrel{iid}{\sim} \mathsf{Norm}(0, \sigma=1).$" (2) I have an intercept. Mean of residuals is zero: as in my "The regression procedure ensures that $\bar r=0.$" Exact purpose of your comment is unclear. $\endgroup$
    – BruceET
    Commented Jun 26, 2020 at 2:18

Not the answer you're looking for? Browse other questions tagged or ask your own question.