0
$\begingroup$

This is a statistics/probability question formulated in the context of machine learning (problem 6.17 in Bishop's 'Deep Learning' book). We are modelling the conditional distribution $p(\mathbf{t}|\mathbf{x})$ (both $\mathbf{x}$ and $\mathbf{t}$ are vectors) using a mixture of Gaussians as follows: $$ p(\mathbf{t}|\mathbf{x}) = \sum_{k=1}^{K} \pi_k(\mathbf{x})\mathcal{N}(\mathbf{t}|\mathbf{\mu}_k(\mathbf{x}),\sigma_k^2(\mathbf{x})), $$ where $\mathcal{N}$ denotes the pdf of the normal distribution, and the mixing coefficients $\pi_k(x)$, the means $\mathbf{\mu}_k(x)$, and the variances $\sigma_k(x)$ are governed by the outputs of our neural network taking $\mathbf{x}$ as its input.

Now, for training vectors $\mathbf{t}_n$, $\mathbf{x}_n$, we introduce the following variables:

$$ \gamma_{nk} = \gamma_k(\mathbf{t}_n|\mathbf{x}_n) = \frac{\pi_k\mathcal{N}_{nk}}{\sum_{l=1}^K\pi_l\mathcal{N}_{nl}}, $$

where $\mathcal{N}_{nk}$ denotes $\mathcal{N}(\mathbf{t}_n|\mathbf{\mu}_k(\mathbf{x}_n),\sigma_k^2(\mathbf{x}_n))$.

The problem asks us to show that the variables $\gamma_{nk}$ can be viewed as the posterior probabilities $p(k|\mathbf{t})$ for the components of the mixture distribution $p(\mathbf{t}|\mathbf{x})$ in which the mixing coefficients $\pi_k(\mathbf{x})$ are viewed as $\mathbf{x}$-dependent prior probabilities $p(k)$.

Now, it seems to me that you could create a partition with respect to different components and by using the law of total probability write the total probability as the sum over $k$ of the probability of choosing $k$th component (i.e. $\pi_k$) times the conditional probability of $\mathbf{t}$ conditioned on choosing $k$th Gaussian. I am not sure however how to write everything down so that it is rigorous and clear. I think some of my confusion stems from the fact that the authors of the textbook are often quite loosey-goosey with their math, which leads to imprecise statements. I would appreciate if someone could write a comprehensive answer, perhaps correcting the notation above, if necessary.

$\endgroup$
1
  • $\begingroup$ This is the Expectation Maximization algorithm; $\pi_k$ is the probability a vector was generated from distribution $N(\mu_k,\sigma_k)=N_k$. With standard maximum likelihood if you observe $t_n$ the likelihood is $N(t_n,\mu,\sigma)$; but now the probability that you observed $t_n$ AND it was generated by distribution $N_k$ is $\pi_k N_{nk}$. So the numerator is a joint probability; the denominator is the probability that you observe $t_n$. So using Bayes' theorem we have $P(N_k & t_n)/P( t_n)=P(N_k | t_n)$. This is a posterior probability. $\endgroup$
    – Ted Black
    Commented Feb 2 at 0:44

0

You must log in to answer this question.