In Probabilistic Machine Learning (Murphy, 2022, p. 8) I'm stuck in this part:
1.2.1.6 Maximum likelihood estimation When fitting probabilistic models, it is common to use the negative log probability as our loss function: $$ l(y, f(x; \theta )) = ��� \log p(y|f(x; \theta)) \tag{1.13} $$ The reasons for this are explained in Section 5.1.6.1, but the intuition is that a good model (with low loss) is one that assigns a high probability to the true output $y$ for each corresponding input $x$.
$\DeclareMathOperator*{\argmin}{arg\,min}$ I've seen MLE in the form $\argmin_\theta -\log p(\theta ;x)$ and it is intuitive in the sense that I'm trying to find the parameter $\theta$ such that the likelihood of observing the data $x$ is the highest. It is not hard to visualize that. For example, we could just think of normally distributed points on the $x$ axis and see that a normal distribution would 'fit' the data quite well.
Without getting into information theory, I would appreciate the answer for the following questions:
But what about conditional distributions? What is the intuition behind trying to find the parameter $\theta$ which makes the probability of $y$, the true distribution, given the predicted probability distribution, $f(x;\theta )$, the most likely? What guarantees that $y$ has anything to do with the predicted $\hat y$? How $y = \hat y$ minimizes this function? And finally, can someone explain how would one calculate that for, say, $y = [0, 1, 0]$ (a one-hot vector representing class 2), and $f(x;\theta ) = [0.15, 0.85, 0]$.