2
$\begingroup$

I am currently studying the textbook Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. Chapter 1 Introduction says the following:

enter image description here In this section we give graphical illustrations of how the second (Bayesian) method works on some simple regression and classification examples.

We first consider a simple 1-d regression problem, mapping from an input $x$ to an output $f(x)$. In Figure 1.1(a) we show a number of sample functions drawn at random from the prior distribution over functions specified by a particular Gaussian process which favours smooth functions. This prior is taken to represent our prior beliefs over the kinds of functions we expect to observe, before seeing any data. In the absence of knowledge to the contrary we have assumed that the average value over the sample functions at each $x$ is zero. Although the specific random functions drawn in Figure 1.1(a) do not have a mean of zero, the mean of $f(x)$ values for any fixed $x$ would become zero, independent of $x$ as we kept on drawing more functions. At any value of $x$ we can also characterize the variability of the sample functions by computing the variance at that point. The shaded region denotes twice the pointwise standard deviation; in this case we used a Gaussian process which specifies that the prior variance does not depend on $x$.

Suppose that we are then given a dataset $\mathcal{D} = \{(\mathbf{\mathrm{x}}_1,y_1),(\mathbf{\mathrm{x}}_2,y_2)\}$ consisting of two observations, and we wish now to only consider functions that pass though these two data points exactly. (It is also possible to give higher preference to functions that merely pass “close” to the datapoints.) This situation is illustrated in Figure 1.1(b). The dashed lines show sample functions which are consistent with $\mathcal{D}$, and the solid line depicts the mean value of such functions. Notice how the uncertainty is reduced close to the observations. The combination of the prior and the data leads to the posterior distribution over functions.

If more datapoints were added one would see the mean function adjust itself to pass through these points, and that the posterior uncertainty would reduce close to the observations. ...

I am confused by this part:

Notice how the uncertainty is reduced close to the observations.

What does it mean by "close to the observations"? I can see that the shaded region – twice the pointwise standard deviation – is minimal at points where the mean of the sample functions has the same value as the sample functions, but it isn't totally clear to me what the authors are referring to.

$\endgroup$
2
  • $\begingroup$ They mean close to $x_1$ or $x_2$, don’t they? The uncertainty in predicted $f(x_1 + e)$ is smaller for e closer to 0. How much smaller depends on the function variance hyperparameter if I recall correctly $\endgroup$
    – Jonathan
    Commented Dec 26, 2020 at 7:29
  • $\begingroup$ @Jonathan Hmm, I'm not sure. It sounds like they're saying that figure 1.1(b) shows that "the uncertainty is reduced close to the observations", but this isn't clear to me. $\endgroup$ Commented Dec 26, 2020 at 7:33

1 Answer 1

1
$\begingroup$

A function is a map $f: X \to Y$, Gaussian Process learns to approximate the functions given the data. The example description says that you are given two points $\mathcal{D} = \{(\mathbf{\mathrm{x}}_1,y_1),(\mathbf{\mathrm{x}}_2,y_2)\}$, presumably around $0.2$ and $0.55$, what could be guessed from the second plot showing the posterior predictive distribution. The uncertainty there goes close to zero, since we know what is the relation between $x$ and $y$ for those points. If the approximation of the learned function is to be correct, it needs to go through the points, so the functions sampled from Gaussian Process (the distribution over functions), need to go though them as well. There’s no uncertainty about what would be the values of $f(x)$ for those particular points. Moreover, since the functions are continuous, the function outputs for the values close to each other need to be somehow similar, so also the uncertainty close to the known points would somehow decrease. If you are using Gaussian Process that assumes noise-free data, it could go all the way down to zero, while with noisy data, there would be some non-zero uncertainty around the datapoints.

$\endgroup$
5
  • $\begingroup$ I'm having difficulty understanding your answer. "The uncertainty there goes close to zero" where are you referring to? The uncertainty from $0.2$ to $0.55$ is given by the shaded region, no? So how can it be said that it goes close to zero? $\endgroup$ Commented Dec 26, 2020 at 8:35
  • $\begingroup$ @ThePointer the shaded region, that shows the uncertainty, goes to zero width (on y-axis dimension) around those points. $\endgroup$
    – Tim
    Commented Dec 26, 2020 at 8:40
  • $\begingroup$ You're referring to the point where all the curves intersect at around $x = 0.55$? $\endgroup$ Commented Dec 26, 2020 at 8:48
  • $\begingroup$ @ThePointer yes. $\endgroup$
    – Tim
    Commented Dec 26, 2020 at 8:52
  • $\begingroup$ Ahh, ok, I understand now. $\endgroup$ Commented Dec 26, 2020 at 9:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.