1
$\begingroup$

In Probabilistic Machine Learning (Murphy, 2022, p. 8) I'm stuck in this part:

1.2.1.6 Maximum likelihood estimation When fitting probabilistic models, it is common to use the negative log probability as our loss function: $$ l(y, f(x; \theta )) = ��� \log p(y|f(x; \theta)) \tag{1.13} $$ The reasons for this are explained in Section 5.1.6.1, but the intuition is that a good model (with low loss) is one that assigns a high probability to the true output $y$ for each corresponding input $x$.

$\DeclareMathOperator*{\argmin}{arg\,min}$ I've seen MLE in the form $\argmin_\theta -\log p(\theta ;x)$ and it is intuitive in the sense that I'm trying to find the parameter $\theta$ such that the likelihood of observing the data $x$ is the highest. It is not hard to visualize that. For example, we could just think of normally distributed points on the $x$ axis and see that a normal distribution would 'fit' the data quite well.

Without getting into information theory, I would appreciate the answer for the following questions:

But what about conditional distributions? What is the intuition behind trying to find the parameter $\theta$ which makes the probability of $y$, the true distribution, given the predicted probability distribution, $f(x;\theta )$, the most likely? What guarantees that $y$ has anything to do with the predicted $\hat y$? How $y = \hat y$ minimizes this function? And finally, can someone explain how would one calculate that for, say, $y = [0, 1, 0]$ (a one-hot vector representing class 2), and $f(x;\theta ) = [0.15, 0.85, 0]$.

$\endgroup$
11
  • 1
    $\begingroup$ With MLE you assume that the data are drawn from a distribution $p$ that is defined by some parameters $\theta$ and then you try to estimate $\theta$ by minimizing $-\log p$. With probabilistic models you don't make any assumptions about $p$; instead you use some input data $x$ and a neural network characterized by parameters $\theta$ to derive the probability of $x$ belonging to some discrete class. The label in this case is $y=(0,1,0)$; the model returns $(0.15,0.85,0)$. So $p(y|f(x;\theta))=0.85$. $\endgroup$
    – Ted Black
    Commented Apr 9 at 9:17
  • $\begingroup$ So is it just notation? I mean, we are not using Bayes rule or directly calculating the conditional probability (P(AUB)/P(B)). How can one simply assume that the conditional probability of y given some prediction is just the “index” of y at the prediction? What if y was not a degenerate distribution but something like (0.10, 0.80, 0.10) $\endgroup$ Commented Apr 9 at 17:46
  • $\begingroup$ The neural network generates some vector $(\pi_1,\ldots,\pi_n)$ for $n$ classes; so the probability of $x$ belonging to class $i$ is $\pi_i$. A model that tends to mislabel will assign low probability to the correct label which means $-\log\pi_i$ will be a large positive number and this is a property that we want for loss functions in general. $\endgroup$
    – Ted Black
    Commented Apr 9 at 17:55
  • $\begingroup$ Sorry, I meant P(A and B) not P(AUB) $\endgroup$ Commented Apr 9 at 18:13
  • $\begingroup$ Yes, Ted. Intuitively that makes sense, but why is it a conditional probability if it is not calculated by any conditional probability formula? $\endgroup$ Commented Apr 9 at 18:32

1 Answer 1

0
$\begingroup$

What guarantees that 𝑦 has anything to do with the predicted 𝑦̂ ?

This is a great question. In general, you don't have guarantees. Consider for example the case where you are optimizing over all measurable functions. You would choose a function which maximizes the likelihood (and overfits). If your set of models is constrained (eg regularized) appropriately, then under some conditions the selected model will predict $y$ well.

How 𝑦=𝑦̂ minimizes this function?

I think there is some confusion here, the notation used in the book is unclear (at least to me, after just skimming a few pages). The notation $p(y|f(x;\theta))$ represents the probability of the label being equal to $y$ given the value of $f(x;\theta)$. I think you should look closely at the examples from the preceding pages to get an idea of what the notation should represent. In any case, I wouldn't get to hung up on this (it is still page 8 :) ), I don't think the notation is really formal or extra clean and explicit here.

For intuition, consider this case: If $\theta^*$ is the optimal choice, and gives perfect prediction accuracy, then $f(x;\theta^*)$ is always equal to $y$. So now $p(y|f(x;\theta^*))$ is just $p(y|y)$ which is of course 1.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .