6
$\begingroup$

Let $y \sim N(X\beta, \Sigma)$. The residuals from OLS regression can be written as $$\hat{e} = (I-H)y$$ where $H = X(X^TX)^{-1}X^T$ is the unique projection onto the column space of $X$. The fitted values can be written as $$\hat{y} = Hy.$$

Since $\hat{y}$ and $\hat{e}$ are uncorrelated, then they're independent by the normality assumption on $y$.

My question is, what does it intuitively mean for them to be independent? Is it that changing the value of $\hat{y}$ does not change $\hat{e}$? However, since $\hat{e} = y - \hat{y}$, this does not seem to be true?

$\endgroup$

2 Answers 2

3
$\begingroup$

First consider what the definition of independence of two vector-valued random variables $x$ and $y$ comes down to: the probability that $x$ is in some event $\mathcal A$ and $y$ is in some event $\mathcal B$ is the product of the chances of these events.

It helps to recast this in terms of conditional probabilities: independence means the chance that $y \in \mathcal B,$ conditional on $x\in\mathcal A,$ does not depend on $\mathcal A,$ no matter what $\mathcal B$ might be. (For technical reasons it is best to restrict this criterion to events where there is a nonzero chance $x\in\mathcal A.$)

When $x$ and $y$ are $1$-vectors -- that is, they are ordinary numerical random variables -- we can check independence by drawing a sample $(x_i,y_i),i=1,2,\ldots, n$ from the joint distribution of $(x,y)$ and looking at the scatterplot of the sample. The foregoing characterization in terms of conditional distributions means that as you scan left to right across this scatterplot, the vertical distributions of the points do not change. Be careful, though: the visual impression of the distribution can change because in some of these vertical windows there won't be many points and you will have a hard time seeing the distribution. You need to pay attention to the relative densities of the points within their vertical strips.

The archetypal example of independence is an uncorrelated bivariate Normal variable $(x,y).$ Here is a scatterplot of such a variable.

Figure 1

(The axis labels will be explained below.)

I have drawn it with an aspect ratio that gives the $x$ values the same amount of spread (horizontally) as the $y$ values (vertically). The lack of correlation is evidenced by the relatively even, circular cloud. In any narrow vertical strip, the relative numbers of points are given by the same Normal distribution -- same mean, same variance -- regardless of where the strip might be located.

Independence of longer vectors $x$ and $y$ implies independence of any (measurable) functions of them, say $f(x)$ and $g(y).$ Taking $f$ to be the function $\pi_i$ giving the value of coordinate $i$ and $g$ to be the function $\pi_j$ giving the value of coordinate $j$ ("projection functions"), we may look for lack of independence by reviewing a scatterplot of $(\pi_i(x), \pi_j(y)).$ The logic is this: when the scatterplot reveals lack of independence, that implies $(x,y)$ cannot be independent, either. When all such scatterplots suggest independence, that does not demonstrate independence of $(x,y)$ (but it does suggest it).

The first figure was drawn in this fashion. The two vectors involved are $(\hat y, \hat e)$ for an ordinary least squares regression with the model $y = \varepsilon$ using four observations with a single explanatory variable set to the values $1,2,3,$ and $4.$ The errors $\varepsilon$ were iid standard Normal. That figure plots the first predicted value $\hat y_1$ and the first residual $\hat e_1$ for $2000$ random instances of this model. To show that its appearance is no fluke, here is the plot of $\hat e_4$ against $\hat y_3$ from the same sample:

Figure 2

The theory implies all such scatterplots look like these nice even circular clouds when (a) the errors are iid Normal and (b) the model is fit using ordinary least squares.

On the other hand, after repeating this exercise but taking the error distribution to be iid $\Gamma(1)-1$ (a shifted Exponential with mean $0$), I obtained this version of the first figure:

Figure 3

This is an example of obvious lack of independence. For instance, when $\hat y_1$ is close to $-2$ (at the left of the figure), the residual $\hat e_1$ tends to be close to $1;$ but when $\hat y_1$ is much larger than $-2,$ the distribution of the residual $\hat e_1$ is much more spread out. The analog of the second figure, comparing the fourth residual to the third predicted value, also exhibits obvious lack of independence:

Figure 4

I wish to emphasize that the explanatory variables, the least squares parameters (intercept and slope), and the fitting method (least squares) are the same for these two models: the only thing that changed between the first two and second two figures was the shape of the error distributions from Normal to a shifted Gamma. (See the code below.) That change alone destroyed the independence between the predictions $\hat y$ and residuals $\hat e.$

After studying such scatterplots for the second model, given any of the predicted values $\hat y_i,$ you would be able to make predictions about any of the residuals $\hat e_j$ and they would be more accurate, on average, then not knowing $\hat y_i.$ For the first (Normal) model, though, information about any (or even all!) of the predicted values $\hat y_i$ by themselves would not give you useful information about any of the residuals, and vice versa: the residuals are not informative about the predicted values associated with them.


If you would like to create figures like these, here is the R code that produced them.

n <- 4          # Amount of data
x <- seq.int(n) # Explanatory variable values
y.0 <- 0*x/n    # The model

k <- 1          # Gamma parameter (if used)
method <- "Normal"
# method <- paste0("Gamma(", k, ")")
#
# Sample the joint distribution.
#
n.sim <- 2000   # Sample size from the joint distribution of (explanatory, observed)
sim <- replicate(n.sim, {
  if (method != "Normal") {
    y <- y.0 + rgamma(length(y.0), k, k) - 1
  } else {
    y <- y.0 + rnorm(length(y.0))
  }
  fit <- lm(y ~ x )
  c(predict(fit), residuals(fit))
})
sim <- array(sim, c(n, 2, n.sim)) # Each `sim[,,k]` has 2 columns of predictions and residuals
#
# Make a scatterplot of predictions and residuals from this sample.
#
i <- 3 # Component of the prediction vector to plot
j <- 4 # Component of the residual vector to plot
plot(sim[i,1,], sim[j,2,], las=2, pch=21, bg="#00000010", col="#00000020",
     main=paste(method, "Errors"), cex.main=1.1,
     xlab = bquote(hat(y)[.(i)]), ylab = "") 
mtext(bquote(hat(e)[.(j)]), side=2, line=2.5, las=2)
$\endgroup$
4
$\begingroup$

Geometrically, it means the error is orthogonal to the prediction.

Roughly, OLS finds a point in the column space of $X$ which is closest to $y$ (assuming $y$ is not already in the column space). In the included picture, the vector in the grey plane is the prediction, and the vector outside the plane (connected to the prediction via the dotted line) is the data.

The dotted line is the error. It is quite literally $y - \hat{y}$ and you'll notice that the error is orthogonal to the prediction. Independence in this case manifests as orthogonality.

enter image description here

$\endgroup$
1
  • 5
    $\begingroup$ The same picture and the same explanation of orthogonality apply even when $y$ is not multivariate Normal--nothing about the OLS geometry changes--but then the conclusion would be incorrect, wouldn't it? At the very least, we have to conclude lack of independence would still "manifest as orthogonality." Thus, this explanation doesn't seem to capture what independence really means in this context. The point of Normality is that the random variation of $y$ can be decomposed into independent random variations in the projection $\hat y$ and the residual $y-\hat y.$ $\endgroup$
    – whuber
    Commented Mar 16, 2021 at 21:57

Not the answer you're looking for? Browse other questions tagged or ask your own question.