I am currently studying the textbook Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. Chapter 1 Introduction says the following:
In this section we give graphical illustrations of how the second (Bayesian) method works on some simple regression and classification examples.
We first consider a simple 1-d regression problem, mapping from an input $x$ to an output $f(x)$. In Figure 1.1(a) we show a number of sample functions drawn at random from the prior distribution over functions specified by a particular Gaussian process which favours smooth functions. This prior is taken to represent our prior beliefs over the kinds of functions we expect to observe, before seeing any data. In the absence of knowledge to the contrary we have assumed that the average value over the sample functions at each $x$ is zero. Although the specific random functions drawn in Figure 1.1(a) do not have a mean of zero, the mean of $f(x)$ values for any fixed $x$ would become zero, independent of $x$ as we kept on drawing more functions. At any value of $x$ we can also characterize the variability of the sample functions by computing the variance at that point. The shaded region denotes twice the pointwise standard deviation; in this case we used a Gaussian process which specifies that the prior variance does not depend on $x$.
Suppose that we are then given a dataset $\mathcal{D} = \{(\mathbf{\mathrm{x}}_1,y_1),(\mathbf{\mathrm{x}}_2,y_2)\}$ consisting of two observations, and we wish now to only consider functions that pass though these two data points exactly. (It is also possible to give higher preference to functions that merely pass “close” to the datapoints.) This situation is illustrated in Figure 1.1(b). The dashed lines show sample functions which are consistent with $\mathcal{D}$, and the solid line depicts the mean value of such functions. Notice how the uncertainty is reduced close to the observations. The combination of the prior and the data leads to the posterior distribution over functions.
If more datapoints were added one would see the mean function adjust itself to pass through these points, and that the posterior uncertainty would reduce close to the observations. ...
I am confused by this part:
Notice how the uncertainty is reduced close to the observations.
What does it mean by "close to the observations"? I can see that the shaded region – twice the pointwise standard deviation – is minimal at points where the mean of the sample functions has the same value as the sample functions, but it isn't totally clear to me what the authors are referring to.