0
$\begingroup$

When I was reading about this lecture, the concept of posterior predictive distribution is introduced as the following

$$ P(Y \mid D, X)=\int_{\mathbf{w}} P(Y, \mathbf{w} \mid D, X) d \mathbf{w}=\int_{\mathbf{w}} P(Y \mid \mathbf{w}, D, X) P(\mathbf{w} \mid D) d \mathbf{w} $$

Which is hard to compute, but one can first assume before we observe the training labels, the labels are drawn from the zero-mean prior Gaussian distribution:

$$ \left[\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \\ y_t \end{array}\right] \sim \mathcal{N}(0, \Sigma) $$

Notice $y_1,...,y_n$ are training, $y_t$ are test. So in this assumption I think the distribution that the test data were drawn can be different from the training (but making sure it is Gaussian).

Then by utilizing the property:

$$ \text { Let Gaussian random variable } y=\left[\begin{array}{l} y_A \\ y_B \end{array}\right] \text {, mean } \mu=\left[\begin{array}{l} \mu_A \\ \mu_B \end{array}\right] \text { and covariance matrix } \Sigma=\left[\begin{array}{c} \Sigma_{A A}, \Sigma_{A B} \\ \Sigma_{B A}, \Sigma_{B B} \end{array}\right] \text {. } $$

$$ y_A \mid y_B=y_B \sim \mathcal{N}\left(\mu_A+\Sigma_{A B} \Sigma_{B B}^{-1}\left(y_B-\mu_B\right), \Sigma_{A A}-\Sigma_{A B} \Sigma_{B B}^{-1} \Sigma_{B A}\right) $$

We can find the distribution $P\left(y_* \mid D, \mathbf{x}\right) \sim \mathcal{N}\left(\mu_{y_* \mid D}, \Sigma_{y_* \mid D}\right)$ where

$$ \begin{aligned} &\mu_{y_* \mid D}=K_*^T\left(K+\sigma^2 I\right)^{-1} y\\ &\Sigma_{y_* \mid D}=K_{* *}-K_*^T\left(K+\sigma^2 I\right)^{-1} K_* . \end{aligned} $$ $\text { where the kernel matrices } K_*, K_{* *}, K \text { are functions of } \mathbf{x}_1, \ldots, \mathbf{x}_n, \mathbf{x}_* \text {. }$

My question is - I know distribution shift is a hard problem and many classical Machine Learning algorithm theory are based on the assumption that the test data are drawn from the same distribution as the training data. Here, if I'm willing to assume that the test data is Gaussian (but different Gaussian than training data), can I use this method to get a robust prediction in terms of distribution shift?

As I notice here $\Sigma$ is very important, and in practice, I don't think I am able to access this. In the slide, the entries are approximated by the kernel which is computed using the corresponding features. It seems the choice of kernel and the covariates of my training and test data are very important. Intuitively this makes me think the prediction on $y_*$ depends on how similar it is with the training point and the similarity is measured by certain kernel and the covariates. Then this idea reminds me about matching methods based on features in general.

Any comment / thoughts are appreciated!

$\endgroup$

0

You must log in to answer this question.