First consider what the definition of independence of two vector-valued random variables $x$ and $y$ comes down to: the probability that $x$ is in some event $\mathcal A$ and $y$ is in some event $\mathcal B$ is the product of the chances of these events.
It helps to recast this in terms of conditional probabilities: independence means the chance that $y \in \mathcal B,$ conditional on $x\in\mathcal A,$ does not depend on $\mathcal A,$ no matter what $\mathcal B$ might be. (For technical reasons it is best to restrict this criterion to events where there is a nonzero chance $x\in\mathcal A.$)
When $x$ and $y$ are $1$-vectors -- that is, they are ordinary numerical random variables -- we can check independence by drawing a sample $(x_i,y_i),i=1,2,\ldots, n$ from the joint distribution of $(x,y)$ and looking at the scatterplot of the sample. The foregoing characterization in terms of conditional distributions means that as you scan left to right across this scatterplot, the vertical distributions of the points do not change. Be careful, though: the visual impression of the distribution can change because in some of these vertical windows there won't be many points and you will have a hard time seeing the distribution. You need to pay attention to the relative densities of the points within their vertical strips.
The archetypal example of independence is an uncorrelated bivariate Normal variable $(x,y).$ Here is a scatterplot of such a variable.
![Figure 1](https://cdn.statically.io/img/i.sstatic.net/OULym.png)
(The axis labels will be explained below.)
I have drawn it with an aspect ratio that gives the $x$ values the same amount of spread (horizontally) as the $y$ values (vertically). The lack of correlation is evidenced by the relatively even, circular cloud. In any narrow vertical strip, the relative numbers of points are given by the same Normal distribution -- same mean, same variance -- regardless of where the strip might be located.
Independence of longer vectors $x$ and $y$ implies independence of any (measurable) functions of them, say $f(x)$ and $g(y).$ Taking $f$ to be the function $\pi_i$ giving the value of coordinate $i$ and $g$ to be the function $\pi_j$ giving the value of coordinate $j$ ("projection functions"), we may look for lack of independence by reviewing a scatterplot of $(\pi_i(x), \pi_j(y)).$ The logic is this: when the scatterplot reveals lack of independence, that implies $(x,y)$ cannot be independent, either. When all such scatterplots suggest independence, that does not demonstrate independence of $(x,y)$ (but it does suggest it).
The first figure was drawn in this fashion. The two vectors involved are $(\hat y, \hat e)$ for an ordinary least squares regression with the model $y = \varepsilon$ using four observations with a single explanatory variable set to the values $1,2,3,$ and $4.$ The errors $\varepsilon$ were iid standard Normal. That figure plots the first predicted value $\hat y_1$ and the first residual $\hat e_1$ for $2000$ random instances of this model. To show that its appearance is no fluke, here is the plot of $\hat e_4$ against $\hat y_3$ from the same sample:
![Figure 2](https://cdn.statically.io/img/i.sstatic.net/s5RPF.png)
The theory implies all such scatterplots look like these nice even circular clouds when (a) the errors are iid Normal and (b) the model is fit using ordinary least squares.
On the other hand, after repeating this exercise but taking the error distribution to be iid $\Gamma(1)-1$ (a shifted Exponential with mean $0$), I obtained this version of the first figure:
![Figure 3](https://cdn.statically.io/img/i.sstatic.net/Im7Om.png)
This is an example of obvious lack of independence. For instance, when $\hat y_1$ is close to $-2$ (at the left of the figure), the residual $\hat e_1$ tends to be close to $1;$ but when $\hat y_1$ is much larger than $-2,$ the distribution of the residual $\hat e_1$ is much more spread out. The analog of the second figure, comparing the fourth residual to the third predicted value, also exhibits obvious lack of independence:
![Figure 4](https://cdn.statically.io/img/i.sstatic.net/4Ptea.png)
I wish to emphasize that the explanatory variables, the least squares parameters (intercept and slope), and the fitting method (least squares) are the same for these two models: the only thing that changed between the first two and second two figures was the shape of the error distributions from Normal to a shifted Gamma. (See the code below.) That change alone destroyed the independence between the predictions $\hat y$ and residuals $\hat e.$
After studying such scatterplots for the second model, given any of the predicted values $\hat y_i,$ you would be able to make predictions about any of the residuals $\hat e_j$ and they would be more accurate, on average, then not knowing $\hat y_i.$ For the first (Normal) model, though, information about any (or even all!) of the predicted values $\hat y_i$ by themselves would not give you useful information about any of the residuals, and vice versa: the residuals are not informative about the predicted values associated with them.
If you would like to create figures like these, here is the R
code that produced them.
n <- 4 # Amount of data
x <- seq.int(n) # Explanatory variable values
y.0 <- 0*x/n # The model
k <- 1 # Gamma parameter (if used)
method <- "Normal"
# method <- paste0("Gamma(", k, ")")
#
# Sample the joint distribution.
#
n.sim <- 2000 # Sample size from the joint distribution of (explanatory, observed)
sim <- replicate(n.sim, {
if (method != "Normal") {
y <- y.0 + rgamma(length(y.0), k, k) - 1
} else {
y <- y.0 + rnorm(length(y.0))
}
fit <- lm(y ~ x )
c(predict(fit), residuals(fit))
})
sim <- array(sim, c(n, 2, n.sim)) # Each `sim[,,k]` has 2 columns of predictions and residuals
#
# Make a scatterplot of predictions and residuals from this sample.
#
i <- 3 # Component of the prediction vector to plot
j <- 4 # Component of the residual vector to plot
plot(sim[i,1,], sim[j,2,], las=2, pch=21, bg="#00000010", col="#00000020",
main=paste(method, "Errors"), cex.main=1.1,
xlab = bquote(hat(y)[.(i)]), ylab = "")
mtext(bquote(hat(e)[.(j)]), side=2, line=2.5, las=2)