31
$\begingroup$

Is it true that Bayesian methods don't overfit? (I saw some papers and tutorials making this claim)

For example, if we apply a Gaussian Process to MNIST (handwritten digit classification), but only show it a single sample, will it revert to the prior distribution for any inputs different from that single sample, however small the difference?

$\endgroup$
2
  • 1
    $\begingroup$ was just thinking - is there a mathematically precise way you can define "over fitting"? if you can, it is likely you can also build features into a likelihood function or a prior to avoid it happening. my thinking is that this notion sounds similar to "outliers". $\endgroup$ Commented May 5, 2019 at 13:49
  • $\begingroup$ Your example of a posterior "reverting" to the prior is known as posterior collapse. It's not a bug of the Bayesian method but rather an indication of insufficient information in your sample (at least for parts of your model). In a non-Bayesian setting your model would simply be unidentified. $\endgroup$
    – Durden
    Commented Mar 5 at 17:17

2 Answers 2

33
$\begingroup$

No, it is not true. Bayesian methods will certainly overfit the data. There are a couple of things that make Bayesian methods more robust against overfitting and you can make them more fragile as well.

The combinatoric nature of Bayesian hypotheses, rather than binary hypotheses allows for multiple comparisons when someone lacks the "true" model for null hypothesis methods. A Bayesian posterior effectively penalizes an increase in model structure such as adding variables while rewarding improvements in fit. The penalties and gains are not optimizations as would be the case in non-Bayesian methods, but shifts in probabilities from new information.

While this generally gives a more robust methodology, there is an important constraint and that is using proper prior distributions. While there is a tendency to want to mimic Frequentist methods by using flat priors, this does not assure a proper solution. There are articles on overfitting in Bayesian methods and it appears to me that the sin seems to be in trying to be "fair" to non-Bayesian methods by starting with strictly flat priors. The difficulty is that the prior is important in normalizing the likelihood.

Bayesian models are intrinsically optimal models in Wald's admissibility sense of the word, but there is a hidden bogeyman in there. Wald is assuming the prior is your true prior and not some prior you are using so that editors won't ding you for putting too much information in it. They are not optimal in the same sense that Frequentist models are. Frequentist methods begin with the optimization of minimizing the variance while remaining unbiased.

This is a costly optimization in that it discards information and is not intrinsically admissible in the Wald sense, though it frequently is admissible. So Frequentist models provide an optimal fit to the data, given unbiasedness. Bayesian models are neither unbiased nor optimal fits to the data. This is the trade you are making to minimize overfitting.

Bayesian estimators are intrinsically biased estimators, unless special steps are taken to make them unbiased, that are usually a worse fit to the data. Their virtue is that they never use less information than an alternative method to find the "true model" and this additional information makes Bayesian estimators never more risky than alternative methods, particularly when working out of sample. That said, there will always exist a sample that could have been randomly drawn that would systematically "deceive" the Bayesian method.

As to the second part of your question, if you were to analyze a single sample, the posterior would be forever altered in all its parts and would not revert to the prior unless there was a second sample that exactly cancelled out all the information in the first sample. At least theoretically this is true. In practice, if the prior is sufficiently informative and the observation sufficiently uninformative, then the impact could be so small that a computer could not measure the differences because of the limitation on the number of significant digits. It is possible for an effect to be too small for a computer to process a change in the posterior.

So the answer is "yes" you can overfit a sample using a Bayesian method, particularly if you have a small sample size and improper priors. The second answer is "no" Bayes theorem never forgets the impact of prior data, though the effect could be so small you miss it computationally.

$\endgroup$
13
  • 2
    $\begingroup$ In They begin with the optimization of minimizing the variance while remaining unbiased., what is They? $\endgroup$ Commented Mar 4, 2017 at 9:10
  • $\begingroup$ Only a very few models (essentially a set with measure zero) permit the formation of unbiased estimators. For example, in a normal $N(\theta, \sigma^2)$ model, there is no unbiased estimator of $\sigma$. Indeed, most times we maximize a likelihood, we end up with a biased estimator. $\endgroup$
    – Andrew M
    Commented Sep 30, 2017 at 15:03
  • 1
    $\begingroup$ @AndrewM: There is an unbiased estimator of $\sigma$ in a normal model - stats.stackexchange.com/a/251128/17230. $\endgroup$ Commented Apr 25, 2018 at 13:15
  • $\begingroup$ @nbro No, I do not. I have not worked in neural networks in so many years that little I would say would be trustworthy. $\endgroup$ Commented Apr 14, 2020 at 20:45
  • $\begingroup$ Bayesian models are intrinsically biased models: do you perhaps mean estimators rather than models? (I am thinking about model bias in terms of bias-variance trade-off.) Also, Bayesian models never less risky than alternative models? Not exactly sure what you mean by risky, but is it perhaps the opposite? I found in another answer of yours that all Bayesian estimators <...> are intrinsically the least risky way to calculate an estimator. $\endgroup$ Commented Apr 5, 2021 at 4:11
21
$\begingroup$

Something to be aware of is that like practically everywhere else, a significant problem in Bayesian methods can be model misspecification.

This is an obvious point, but I thought I'd still share a story.

A vignette from back in undergrad...

A classic application of Bayesian particle filtering is to track the location of a robot as it moves around a room. Movement expands uncertainty while sensor readings reduce uncertainty.

I remember coding up some routines to do this. I wrote out a sensible, theoretically motivated model for the likelihood of observing various sonar readings given the true values. Everything was precisely derived and coded beautifully. Then I go to test it...

What happened? Total failure! Why? My particle filter rapidly thought that the sensor readings had eliminated almost all uncertainty. My point cloud collapsed to a point, but my robot wasn't necessarily at that point!

Basically, my likelihood function was bad; my sensor readings weren't as informative as I thought they were. I was overfitting. A solution? I mixed in a ton more Gaussian noise (in a rather ad-hoc fashion), the point cloud ceased to collapse, and then the filtering worked rather beautifully.

Moral?

As Box famously said, "all models are wrong, but some are useful." Almost certainly, you won't have the true likelihood function, and if it's sufficiently off, your Bayesian method may go horribly awry and overfit.

Adding a prior doesn't magically solve problems stemming from assuming observations are IID when they aren't, assuming the likelihood has more curvature than it does etc...

$\endgroup$
2

Not the answer you're looking for? Browse other questions tagged or ask your own question.