15
$\begingroup$

I recently watched this talk by Eric J. Ma and checked his blog entry, where he quotes Radford Neal, that Bayesian models do not overfit (but they can overfit) and when using them, we do not need test sets for validating them (for me the quotes seem to talk rather about using validation set to adjust the parameters). Honestly, the arguments do not convince me and I don't have access to the book, so could you give more detailed and rigorous argument for, or against such statement?

By the way, in the meanwhile, Eric Ma has pointed me this discussion on the same topic.

$\endgroup$
2
  • 4
    $\begingroup$ One major hole in this argument in regards to that talk: If you're doing MCMC, if you don't fully explore the posterior, your inference is totally invalid. If you are doing inference on a Bayesian Neural Network, you almost certainly have not explored very large portions of the posterior using MCMC. Therefore, you'd better split your data to double check your inference! $\endgroup$
    – Cliff AB
    Commented Mar 1, 2019 at 5:55
  • 1
    $\begingroup$ one thing to consider is what are we evaluating or validating? it may be that we don't use all the information we have (either in prior or likelihood). checking model fit can help with answering this question. $\endgroup$ Commented May 5, 2019 at 10:56

3 Answers 3

13
$\begingroup$

If we use "the one true model" and "true priors" reflecting some appropriately captured prior information, then as far as I am aware a Bayesian truly does not have an overfitting problem and that posterior predictive distribution given very little data will be suitably uncertain. However, if we use some kind of pragmatically chosen model (i.e. we have decided that e.g. the hazard rate is constant over time and an exponential model is appropriate or e.g. that some covariate is not in the model = point prior of coefficient zero) with some default uninformative or regularizing priors, then we really do not know whether this still applies. In that case the choice of (hyper-)priors has some arbitrariness to it that may or may not result in good out of sample predictions.

Thus, it is then very reasonable to ask the question whether the hyperparameter choice(=parameters of the hyperpriors) in combination with the chosen likelihood will perform well. In fact, you could easily decide that it is a good idea to tune your hyperparameters to obtain some desired prediction performance. From that perspective a validation set (or cross-validation) to tune hyperparameters and test set to confirm performance make perfect sense.

I think this is closely related to a number of discussions of Andrew Gelman on his blog (see e.g. blog entry 1, blog entry 2, blog entry 3 on LOO for Stan and discusions on posterior predictive checks), where he discusses his concerns around the (in some sense correct) claims that a Bayesian should not check whether their model makes sense and about practical Bayesian model evaluation.

Of course, we very often are the most interested in using Bayesian methods in settings, where there is little prior information and we want to use somewhat informative priors. At that point it may become somewhat tricky to have enough data to get anywhere with validation and evaluation on a test set.

$\endgroup$
4
$\begingroup$

So I answered the question on overfitting that you reference and I watched the video and read the blog post. Radford Neal is not saying that Bayesian models do not overfit. Let us remember that overfitting is the phenomenon of noise being treated as signal and impounded into the parameter estimate. That is not the only source of model selection error. Neal's discussion is broader though by venturing into the idea of a small sample size he ventured into the discussion of overfitting.

Let me partially revise my prior posting that Bayesian models can overfit to all Bayesian models overfit, but do so in a way that improves prediction. Again, going back to the definition of confusing signal with noise, the uncertainty in Bayesian methods, the posterior distribution, is the quantification of that uncertainty as to what is signal and what is noise. In doing so, Bayesian methods are impounding noise into estimates of signal as the whole posterior is used in inference and prediction. Overfitting and other sources of model classification error is a different type of problem in Bayesian methods.

To simplify, let us adopt the structure of Ma’s talk and focus on linear regression and avoid the deep learning discussion because, as he points out, the alternative methods he mentions are just compositions of functions and there is a direct linkage between the logic of linear regression and deep learning.

Consider the following potential model $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3.$$ Lets create a broad sample of size $N$ composed of two subsamples, $n_1,n_2$, where $n_1$ is the training set and $n_2$ is the validation set. We will see why, subject to a few caveats, Bayesian methods do not need a separate training and validation set.

For this discussion, we need to create eight more parameters, one for each model. They are $m_1\dots{_8}$. They follow a multinomial distribution and have proper priors as do the regression coefficients. The eight models are $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3,$$ $$y=\beta_0,$$ $$y=\beta_0+\beta_1x_1,$$ $$y=\beta_0+\beta_2x_2,$$ $$y=\beta_0+\beta_3x_3,$$ $$y=\beta_0+\beta_1x_1+\beta_2x_2,$$ $$y=\beta_0+\beta_1x_1+\beta_3x_3,$$ $$y=\beta_0+\beta_2x_2+\beta_3x_3,$$ $$y=\beta_0+\beta_1x_1,$$ $$y=\beta_0+\beta_2x_2,$$ and $$y=\beta_0+\beta_3x_3.$$

Now we need to get into the weeds of the differences between Bayesian and Frequentist methods. In training set, $n_1,$ the modeler using Frequentist methods chooses just one model. The modeler using Bayesian methods is not so restricted. Although the Bayesian modeler could use a model selection criterion to find just one model, they are also free to use model averaging. The Bayesian modeler is also free to change selected models in midstream in the validation segment. Moreso, the modeler using Bayesian methods can mix and match between selection and averaging.

To give a real-world example, I tested 78 models of bankruptcy. Of the 78 models, the combined posterior probability of 76 of them was about one ten-thousandth of one percent. The other two models were roughly 54 percent and 46 percent respectively. Fortunately, they also did not share any variables. That allowed me to select both models and ignore the other 76. When I had all the data points for both, I averaged their predictions based on the posterior probabilities of the two models, using only one model when I had missing data points that precluded the other. While I did have a training set and validation set, it wasn’t for the same reason a Frequentist would have them. Furthermore, at the end of every day over two business cycles, I updated my posteriors with each day’s data. That meant that my model at the end of the validation set was not the model at the end of the training set. Bayesian models do not stop learning while Frequentist models do.

To go deeper let us get concrete with our models. Let us assume that during the training sample the best fit Frequentist model and the Bayesian model using model selection matched or, alternatively, that the model weight in model averaging was so great that it was almost indistinguishable to the Frequentist model. We will imagine this model to be $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3.$$ Let’s also imagine that the true model in nature is $$y=\beta_0+\beta_1x_1+\beta_3x_3.$$

Now let's consider the difference in the validation set. The Frequentist model is overfitted to the data. Let’s assume that by some point $n_2^i$ that the model selection or validation procedure had changed the selection to the true model in nature. Further, if model averaging was used, then the true model in nature carried weight in the prediction long before the choice of models was clear-cut. E.T. Jaynes in his tome on probability theory spends some time discussing this issue. I have the book at work so I cannot get you a good citation, but you should read it. Its ISBN is 978-0521592710.

Models are parameters in Bayesian thinking and as such are random, or if you would prefer, uncertain. That uncertainty does not end during the validation process. It is continually updated.

Because of the differences between Bayesian and Frequentist methods, there are other types of cases that also must be considered. The first comes from parameter inference, the second from formal predictions. They are not the same thing in Bayesian methods. Bayesian methods formally separate out inference and decision making. They also separate out parameter estimation and prediction.

Let’s imagine, without loss of generality, that a model would be successful if $\hat{\sigma^2}<k$ and a failure otherwise. We are going to ignore the other parameters because it would be a lot of extra work to get at a simple idea. For the modeler using Bayesian methods, this is a very different type of question than it is for the one using Frequentist methods.

For the Frequentist a hypothesis test is formed based off of the training set. The modeler using Frequentist methods would test whether the estimated variance is greater than or equal to $k$ and attempt to reject the null over the sample whose size is $n_2$ by fixing the parameters to those discovered in $n_1$.

For the modeler using Bayesian methods, they would form parameter estimates during from sample $n_1$ and the posterior density of $n_1$ would become the prior for sample $n_2$. Assuming the exchangeability property holds, then it is assured that the posterior estimate of $n_2$ is equal in all senses of the word of that of a probability estimate formed from the joint sample. Splitting them into two samples is equivalent by force of math to having not split them at all.

For predictions, a similar issue holds. Bayesian methods have a predictive distribution that is also updated with each observation, whereas the Frequentist one is frozen at the end of sample $n_1$. The predictive density can be written as $\Pr(\tilde{x}=k|\mathbf{X})$. If $\tilde{x}$ is the prediction and $\mathbf{X}$ is the sample, then where are the parameters, which we will denote $\theta?$ Although Frequentist prediction systems do exist, most people just treat the point estimates as the true parameters and calculate residuals. Bayesian methods would score each prediction against the predicted density rather than just one single point. These predictions do not depend upon the parameters which are different from the point methods used in Frequentist solutions.

As a side note, formal Frequentist predictive densities do exist using the standard errors, and scoring could be done on them, but this is rare in practice. If there is no specific prior knowledge, then the two sets of predictions should be identical for the same set of data points. They will end up differing because $n_1+n_2>n_1$ and so the Bayesian solution will impound more information.

If there is no material prior information and if Frequentist predictive densities are used rather than point estimates, then for a fixed sample the results of the Bayesian and Frequentist methods will be identical if a single model is chosen. If there is prior information, then the Bayesian method will tend to generate more accurate predictions. This difference can be very large in practice. Further, if there is model averaging, then it is quite likely that the Bayesian method will be more robust. If you use model selection and freeze the Bayesian predictions, then there is no difference to using a Frequentist model using Frequentist predictions.

I used a test and validation set because my data was not exchangeable. As a result, I needed to solve two problems. The first is similar to burn-in in MCMC methods. I needed a good set of parameter estimates to start my test sequence, and so I used fifty years of prior data to get a good prior density to start my validation test. The second problem was that I needed some form of standardized period to test in so that the test would not be questioned. I used the two prior business cycles as dated by NBER.

$\endgroup$
10
  • 1
    $\begingroup$ But then, say that you estimated a MAP for linear regression model with "uninformative" priors. This would be equivalent of obtaining the maximum likelihood estimate for the model, so ML doesn't need test set either, assuming exchangeability? $\endgroup$
    – Tim
    Commented Mar 13, 2018 at 7:48
  • $\begingroup$ "overfitting is the phenomenon of noise being treated as signal and impounded into the parameter estimate" I believe this definition is specific towards additive noise models. Otherwise overfitting vs underfitting is not so well defined. $\endgroup$ Commented Mar 13, 2018 at 8:54
  • $\begingroup$ @CagdasOzgenc thanks. Do you have a suggested edit? $\endgroup$ Commented Mar 13, 2018 at 16:54
  • $\begingroup$ @Tim I never mentioned the MAP estimator. If you reduce the problem down to the MAP estimator then you surrender the robustness. The MAP estimator is the point that minimizes a cost function over a density. This can be problematic for projections if the density lacks a sufficient statistic. The MAP estimator would, intrinsically, lose information. If you were using the MAP estimator, which is not in the original question and clearly not a part of Ma's presentation, then you create a different set of problems for yourself. $\endgroup$ Commented Mar 13, 2018 at 16:58
  • 1
    $\begingroup$ i note even here you cut down the options. why not consider terms like $x_2\times x_3$ or $cos(x_1^2)$ as candidates in your model? There is an infinite class of models you can make, you have chosen just $8$. what in your prior information tells you this is "enough"? $\endgroup$ Commented May 5, 2019 at 11:03
2
$\begingroup$

I have also pondered this question and my tentative answer is a very practical one. Please take this with a pinch of salt.

Suppose you have a model that has no parameters. For instance, your model could be a curve that predicts the growth of some scalar quantity over time, and you have chosen this particular curve because it is prescribed by some available domain knowledge.

Since your model has no parameters, there is no need for a test set. To evaluate how well your model does, you can apply your model on the entire dataset. By applying, I mean you can check how well your chosen curve goes through observed data and use some criterion (e.g. likelihood) to quantify the goodness of the fit.

Now, in practice, our model will have some parameters. The Bayesian methodology sets the goal of calculating the marginal log-likelihood which involves integrating out all the model parameters. Marginal log-likelihood quantifies how well the model explains the data (or should I say how well the data support the model?). By integrating out, we are left with no parameters to tune/optimise. I will risk saying that this seems to me very similar to the case where we had a model with no parameters, the similarity being that we do not need to adapt any parameters to the observed dataset.

There is a nice quote that I have read in this forum which states that "optimisation is the root of all evil in statistics" (I think it originates from user DikranMarsupial). In my understanding, this quote says that once you have stated your model assumptions, all you have to do is "turn the Bayesian crank". In other words, as long as you can be Bayesian (i.e. integrate out the parameters), then you have no reason to worry about overfitting, as you are considering all possible settings of your parameters (with density dictated by the prior) according to your model assumptions. If instead you need to optimise a parameter, it is difficult to tell whether you are over-adapting it to the particular data you observe (overfitting) or not. One practical way of testing overfitting is of course holding out a test set which is not used when optimising the parameter.

In the presence of competing rival models, that effectively challange your assumptions, you can compare them using marginal log-likelihood in order to find the most likely one. Of course, somebody naughty may posit a model which perfectly replicates the observed data (trivially, it could be the data itself). In such a case, I am not sure how I would defend myself. In real life, however, it would be hard to motivate this contrived model in a setting such as physics where explaination based models are required...

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.