8
$\begingroup$

This paper gives a somewhat gentle introduction to Bayesian inference: http://www.miketipping.com/papers/met-mlbayes.pdf

I got to section 2.3 without much problems but got stuck in understanding that section onwards. It starts by presenting a probabilistic regression framework where the likelihood of all data is given as:

$$ p(t|x,w,\sigma^2) = \prod_{n}p\left(t_n|x_n,w,\sigma^2\right) $$ where $t_n=y(x_n;w)+\epsilon_n$ is the 'target' value. Next, given a set of parameters $w$ and a hyperparameter $\alpha$, the prior is given as: $$ p(w|\alpha)=\prod_{m}\left(\frac{\alpha}{2\pi}\right)^{1/2}\exp\left({-\frac{\alpha}{2}w_m^2}\right) $$

I can then compute the posterior $p\left(w|t,\alpha,\sigma^2\right)$. What I don't understand is the following:

  • In the first equation above, how should I interpret the product over the $N$ pairs of data $(t_n,x_n)$? Lets say I get two initial measurements from the real world, is $p\left(t|x,w,\sigma^2\right)$ supposed to give me a single real-valued probability? And how do I account for $w$ since it is not known yet?
  • As far as I got it, $w$ is supposed to be a vector of size $M$ where $w_i$ contains the $i$th estimated value. Now, how can a prior for $w$ have a reference to its own vector elements if I don't know them yet? Shouldn't a prior be an independent distribution such as a Gaussian or Beta? Also, shouldn't a prior be independent of hyperparameters?
  • Figure 4, on the article's page 8 has a plot from the prior and from the posteriors of an example using the $y=\sin(x)$ function with added Gaussian variance 0.2. How could I plot something similar in, say, Octave/Matlab or R?

I don't have a strong background in statistics so forgive me if this is too basic. Any help is appreciated.

Thanks in advance!

$\endgroup$
5
  • $\begingroup$ To answer your second question, the prior has $w$ as a variable because it is function of $w$. It maps all possible values of $w$ to a probability. Furthermore, it is a Gaussian. See footnote #3 in that paper...I think you're getting confused by the subtle distinction b/t a probability density and a likelihood function. $\endgroup$
    – jerad
    Commented Dec 2, 2012 at 22:56
  • $\begingroup$ @jerad Ok, would that answer the first question a little as well? Since tn and xn are known, is the first equation also a function of w? Thanks! $\endgroup$
    – jokerbrb
    Commented Dec 3, 2012 at 9:57
  • $\begingroup$ The first equation is a distribution over $t$ conditional on some $x,w,\sigma^2$. $\endgroup$
    – jerad
    Commented Dec 3, 2012 at 20:48
  • $\begingroup$ Thanks. This is part of my confusion. Take a look at this videolecture for instance: videolectures.net/mlss09uk_bishop_ibi, and jump to minute 10:33 (Bayesian inference). There he says that p(x-hat|theta) is a function over theta given the new observed values x-hat. $\endgroup$
    – jokerbrb
    Commented Dec 4, 2012 at 15:06
  • $\begingroup$ Yes, well as explained in the wikipedia article on likelihood functions, it is merely a matter of perspective. I think anytime you see $p(\cdot)$ you should try to visualize a plot with probability on the Y-axis and parameters on the X-axis. You can either evaluate that function for a parameter value and return a probability, or you can view it as a function of the variables, ie. the whole plot. $\endgroup$
    – jerad
    Commented Dec 4, 2012 at 15:22

1 Answer 1

2
$\begingroup$

First question:

The product is the joint probability of the sample, often also called the likelihood (see the footnote on page 5). Yes, it gives you a single probability. It is simply the individual probabilities multiplied together, since they are assumed independent. This equation is sort of like an intermediate step. From there on, they drop $x$ from the notation. Then they end up with equation (11), where this first equation is combined with a prior and the normalizing constant. This is sort of the essence of Bayesian inference: we don't know the parameter $w$, but we know that the data depends on it. Using Bayes' theorem, we can thus get a posterior distribution by having a prior distribution.

Second question:

The vector $\mathbf{w}=(w_1, w_2, \dots, w_M)$ does not contain estimates. It contains the random variables $w_1, w_2, \dots, w_M$, i.e. the parameters. Not sure how/where they reference themselves?

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .