Shouldn't log likelihood always be normalized by data size in bayesian estimation?

Question

This is very interesting problem. I wonder if the whole bayesian statistics is neglecting it or if I am super confused. I will illustrate it on a bayesian Maximum a Posteriori (MAP) estimation and later we can discuss how would it be when full distribution is considered (using e.g. MCMC).

We are looking for MAP estimate for parameters $\theta$:

$$\theta_{MAP} = \mathrm{argmax}_\theta \ \ p(\theta|y) = \mathrm{argmax}_\theta \ \ {p(y|\theta) \cdot p(\theta) \over p(y)} = \mathrm{argmax}_\theta \ \ p(y|\theta) \cdot p(\theta) = \mathrm{argmax}_\theta \ \ \mathrm{log}(p(y|\theta) \cdot p(\theta)) = \mathrm{argmax}_\theta \ \ \mathrm{log} \ p(y|\theta) + \mathrm{log} \ p(\theta)$$ a $p(y|\theta)$ is the log likelihood, $p(\theta)$ is prior. Now, this is interesting: the magnitude of the log likelihood $p(y|\theta)$ will vary very much based on data size: I get values like -4414.6 for N=7576 and -59.4 for N=59. But the magnitude of the prior stays the same.

So, just based on data size, the log prior will get different weight in the estimation compared to the log likelihood! Which looks like a different model. Question 1: Am I right in this conclusion?

So (Question 2), wouldn't it make more sense to "normalize" the log likelihood according to data size, like this? :

$$\mathrm{argmax}_\theta \ \ {1 \over N} \mathrm{log} \ p(y|\theta) + \mathrm{log} \ p(\theta)$$

For example, in Yi et al (2011), they do this:

I already tried this approach and found that different values of $\lambda$ will yield different cross-validation results (which doesn't yet imply that the likelihood should be normalized by data size, but it shows that the relative scale of the prior to the likelihood is important).

There are some approaches like Regularization where this has been consciously taken care of. But, surprisingly, when I look at the papers and current practice in my area (in the context of Gaussian Processes), taking care of this important issue (like in the above paper) is very rare and most studies don't care about that at all and just stay with the bayesian MAP approach. Which is confusing me, because:

(Question 3) Isn't the MAP approach a priori wrong, because the weight of the log prior (i.e. penalization, regularization, however you interpret the prior) relative to log likelihood will change with data size?

And the interesting discussion would be whether problem does apply in the case when we don't just care about the maximum of the distribution as in MAP, but study the whole posterior distribution as e.g. in MCMC, so shouldn't we be looking at this instead? :

$${p(y|\theta)^{1 \over N} \cdot p(\theta) \over p(y)}$$

_{Mentioned references: Yi, G., Shi, J. Q., & Choi, T. (2011). Penalized gaussian process regression and classification for high-dimensional nonlinear data. Biometrics, 67(4), 1285–1294. doi: 10.1111/j.1541-0420.2011.01576.x}

That the influence of the prior vanishes as we collect more and more data is a well known result and makes intuitive sense. This property allows prior beliefs to change with enough data. — Demetri Pananos, Commented Apr 22, 2020 at 14:53
@DemetriPananos wow, thanks for interesting comment! But let's be realistic about how priors are used: most of the people do not have any "prior beliefs" (as priors are commonly advertised); most people use the priors in fact to penalize model compexity, which will shrink the parameters towards simple model. When you start thinking about priors like this, it becomes less intuitive; why should the weight of penalization term change with data size? — Tomas, Commented Apr 22, 2020 at 15:20
I'm not sure if that is how "most" people use priors, and frankly that seems like a sweeping generalization. Regularization certainly is a use of priors (a la Bayesian ridge regression for example), but priors are and certainly should be used to impart valuable information to the model prior to seeing data. — Demetri Pananos, Commented Apr 22, 2020 at 15:28
@DemetriPananos I read tens, maybe well over a hundred of papers and I've seen maybe just one or two which would actually have prior beliefs and used priors to model them... most of the people have no "beliefs" about the value of $\rho$ parameter of their PC Matérn prior should be... Most of the studies use weakly informative priors which have shrinkage properties, i.e. which have the only function of penalizing model complexity (whether the authors know it or not :-)) — Tomas, Commented Apr 22, 2020 at 15:40
What gets published and what gets used in, say, industry, can be two very different things. When you have access to domain experts, as I do - and have for most of my career - informative priors become the norm, not the exception. One does have to be wary of the trap of the experts' opinions being partially based on the same data that goes into the likelihood function, however! For publication, though, people prefer uninformative priors, because it makes it clear to the reader that the results were driven by the data, not by the prior. — jbowman, Commented Apr 22, 2020 at 16:05

Haotian Chen · Accepted Answer · 2020-04-22 15:50:16Z

Before answerinig you quesion, let's first make clear the Bayesian mindset:

All about Bayesian is to use posterior distribution to summarize everything we know about a certain random entity (such as $\theta$ in your case).

Where "everything we know" contains two parts:

part1: the prior information about $\theta$, which is represented by the prior distribution $p(\theta)$.
part2: the information from the observations, which is represented by the likelihood $p(y|\theta)$

To combine two parts of information, of course you need to assign a weight to each of them. There has to be one that is more pronouncing than the other.

And as you have noticed, sample size $n$ plays as the information weight for part2, which make sense because the more samples you observe, the more convincing the samples say about $\theta$, thus they should play a bigger role in the posterior (keep in mind that posterior is just a way to represent the combined information of part1 and part2).

As for part1, there will always be a prior strength been assigned when defining the prior (for example when using inverse-wishart distribution as the prior of a covariance matrix, the degree-of-freedom of the inverse-wishart distribution is the prior strength). And prior strength plays as the information weight for part1.

What if you assigin weight $1/n$ to the observations, by multiplying a $1/n$ to the log-likelihood as you did in the question. Then the weight for part1 will not increae with new data been observed, which means no matter how many samples you observe, your total knowledge about $\theta$ won't change.

For example:

If I want to model a coin toss, and using $Beta(2,2)$ as the prior distribution of the probability of getting a head(here the prior strength is $2$).

After 10 observations, say there are 4 heads and 6 tails, then the posterior distribution will be $$ \theta \sim Beta(4+2,6+2) $$ The variance of $\theta$ is $Var(\theta) = 0.0163$. The variance shows how uncertain you are about the distribution of $\theta$, the MAP estimate in this case is $\theta=0.4167$. If you increase your sample size to 1000, there are 400 heads and 600 tails, then the posterior will be $$ \theta \sim Beta(400+2,600+2) $$ The variance is $\theta$ is $Var(\theta) = 0.0002$, the MAP estimate in this case is $\theta=0.3998$. You can see that the variance is much smaller with more samples been observed. This is because increasing samples will increase the amount of information you gained in the posterior, the more information result in less uncertainty. Also, the MAP esitmate will be closer to the MLE when sample size is increased, this is because observations plays more weight in the posterior, thus has more say to the estimate.

But if you set weight to $1/n$, for no matter 10 or 1000 samples, the posterior will be the same: $$ \theta \sim Beta(0.4+2,0.6+2) $$ The variance is $Var(\theta)=0.0416$ and won't reduce when more samples are observed.The MAP esitimate will be $0.4667$ and won't show a tendency to MLE.

Stack Exchange Network

Shouldn't log likelihood always be normalized by data size in bayesian estimation?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
bayesian
map-estimation
or ask your own question.

Hot Network Questions

Shouldn't log likelihood always be normalized by data size in bayesian estimation?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learningbayesianmap-estimation or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
bayesian
map-estimation
or ask your own question.