4
$\begingroup$

Suppose I have a data set $\{(x^i, t^i)\}_{i =1, \ldots, n}$ generated i.i.d. $t^{i} \in \{1, -1\}$ are binary targets.

We would like to run the logistic regression, which is based on maximizing the joint likelihood function, (conditional maximum likelihood estimation),

$$\theta^\star = \arg\max_\theta \thinspace p(t^1, \ldots, t^n| x^1, \ldots, x^n; \theta)$$

Using the iid. assumption, we get,

$$\theta^\star = \arg\max_\theta \thinspace \prod\limits_{i = 1}^n p(t^i| x^i; \theta)$$

which when we take the log, we obtain $$\theta^\star = \arg\max_\theta \thinspace \sum\limits_{i = 1}^n \log( p(t^i| x^i; \theta))$$


My question is that, why do we wish to maximize the parameter $\theta$ for the joint likelihood, when we are actually interested in is maximizing the probability that for all $i$, given $x^i$, we obtain $t^i$?

In other words, we need to solve the following problem instead,

$$ \theta^\star = \arg\max_\theta \thinspace p(t^i|x^i; \theta), \forall i = 1,\ldots, n$$

Why don't we solve this problem instead? Is it ill-posed?

$\endgroup$
0

3 Answers 3

2
+50
$\begingroup$

If I understand your question correctly, you are asking why we are looking at the joint distribution and not each data point separately. The answer is because we want to find the best fitting parameter for the dataset. A single point in the parameter space will not give maximum likelihood for all the points. But there would be some point in the parameter space that can give the highest overall likelihood.

$\theta^* = argmax_\theta \sum_{i=1}^{N} \log P(t^{(i)} \mid x^{(i)}; \theta)$ is the point in the parameter space of the function class that we are looking at, which gives the best fit for the dataset. Additionally, this method is known maximum-a-posteriori (MAP), where we are maximizing the probability of the true labels conditioned on the input points. There is another method called maximum likelihood, where you estimate the parameter which is most likely to generate that distribution of dataset (had it been used in the PDF). They are both similar, except MAP assumes uniform prior on $x^{(i)}$.

$\endgroup$
5
  • 1
    $\begingroup$ Yes precisely. It seems almost intuitive that what we want is a model that takes in an input and gives us the correct label FOR THAT input only and we do it for each pair of data+target for the entire dataset. So from your answer, you are saying that $\theta$ that is optimal for one pair of data may not work another pair, and we need something that works for the entire dataset and not just individual data point. $\endgroup$
    – Fraïssé
    Commented Jan 28, 2020 at 19:24
  • 1
    $\begingroup$ But I still wonder if $\theta^\star$ that maximizes the joint conditional likelihood is the correct one to use. $\endgroup$
    – Fraïssé
    Commented Jan 28, 2020 at 19:27
  • 1
    $\begingroup$ You cannot use statistical methods on a single data point. So I guess our best bet would be to see what works for most of the data and just use that, which is why we look at the joint distribution. $\endgroup$ Commented Jan 28, 2020 at 19:43
  • $\begingroup$ Upvoted for the clarity and excellence of the answer. $\endgroup$ Commented Jan 28, 2020 at 21:22
  • $\begingroup$ Thank you so much, @JamesPhillips $\endgroup$ Commented Jan 28, 2020 at 21:28
2
$\begingroup$

By the i.i.d. assumption, it follows that all observed data points come from the same bivariate distribution: hence, all are characterized by the same $\theta$.

This assumption allow us to use all the data points together to increase the statistical power of our estimation of the unique $\theta$.

Maximizing each observation likelihood separately is perfectly legitimate: it is as if we have $n$ separate single-observation samples. We will get $n$ estimates of $\theta$, possibly all different, if there are no ties in the sample. Assume no ties for simplicity.

But say we do this. What our estimator has learned? How will the estimator react when we feed him the $x^{n+1}$ observatio, which is different from all the first $n$ values, and want him to give us the estimated $t^{n+1}$ value? Which $\hat \theta$ of the $n$ available will the estimator use?

But assume now that the $x^{n+1}$ observation is identical to one that exists in the first $n$ observations, say $x^{n+1} = x^j$... in such a case it appears that the strategy of maximizing the observation likelihood and obtaining "individual" $\hat \theta^i$ pays off: the estimator should use $\hat \theta^j$ which is tailored-made for this value of $x$... but why use $\hat \theta^j$ at all, since if $x^{n+1} = x^j$ the best prediction will be that $t^{n+1} = t^j$?

$\endgroup$
1
$\begingroup$

Here is the problem with trying to learn $\theta^*$ by doing the below

$$ \theta^\star = \arg\max_\theta \thinspace p(t^i|x^i; \theta), \; \forall \; i = 1,\ldots, n.$$

Suppose your data set is as follows, and your parameter is one-dimensional

$$\begin{array}{|c|c|c|} \hline \text{Observation} & \text{x} & \text{t} \\ \hline \text{1} & 3 & 1\\ \hline \text{2} & 8 & 1 \\ \hline \text{3} & 3 & 0 \\ \hline \end{array}$$

So your model will try to learn a $\theta^*$ which both maximizes $p(t^1=1|x^1=1;\theta)$ and $p(t^3=1|x^3=0;\theta)$, but the values of $\theta^*$ will, for any reasonable model, be different. Therefore a $\theta^*$ which maximizes all of the individual $p(t^i|x^i;\theta)$ will not exist.

On the other hand, a unique $\theta^*$ which maximizes the joint likelihood, will exist for many models.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.