So I answered the question on overfitting that you reference and I watched the video and read the blog post. Radford Neal is not saying that Bayesian models do not overfit. Let us remember that overfitting is the phenomenon of noise being treated as signal and impounded into the parameter estimate. That is not the only source of model selection error. Neal's discussion is broader though by venturing into the idea of a small sample size he ventured into the discussion of overfitting.
Let me partially revise my prior posting that Bayesian models can overfit to all Bayesian models overfit, but do so in a way that improves prediction. Again, going back to the definition of confusing signal with noise, the uncertainty in Bayesian methods, the posterior distribution, is the quantification of that uncertainty as to what is signal and what is noise. In doing so, Bayesian methods are impounding noise into estimates of signal as the whole posterior is used in inference and prediction. Overfitting and other sources of model classification error is a different type of problem in Bayesian methods.
To simplify, let us adopt the structure of Ma’s talk and focus on linear regression and avoid the deep learning discussion because, as he points out, the alternative methods he mentions are just compositions of functions and there is a direct linkage between the logic of linear regression and deep learning.
Consider the following potential model $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3.$$ Lets create a broad sample of size $N$ composed of two subsamples, $n_1,n_2$, where $n_1$ is the training set and $n_2$ is the validation set. We will see why, subject to a few caveats, Bayesian methods do not need a separate training and validation set.
For this discussion, we need to create eight more parameters, one for each model. They are $m_1\dots{_8}$. They follow a multinomial distribution and have proper priors as do the regression coefficients. The eight models are $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3,$$ $$y=\beta_0,$$ $$y=\beta_0+\beta_1x_1,$$ $$y=\beta_0+\beta_2x_2,$$ $$y=\beta_0+\beta_3x_3,$$ $$y=\beta_0+\beta_1x_1+\beta_2x_2,$$ $$y=\beta_0+\beta_1x_1+\beta_3x_3,$$ $$y=\beta_0+\beta_2x_2+\beta_3x_3,$$ $$y=\beta_0+\beta_1x_1,$$ $$y=\beta_0+\beta_2x_2,$$ and $$y=\beta_0+\beta_3x_3.$$
Now we need to get into the weeds of the differences between Bayesian and Frequentist methods. In training set, $n_1,$ the modeler using Frequentist methods chooses just one model. The modeler using Bayesian methods is not so restricted. Although the Bayesian modeler could use a model selection criterion to find just one model, they are also free to use model averaging. The Bayesian modeler is also free to change selected models in midstream in the validation segment. Moreso, the modeler using Bayesian methods can mix and match between selection and averaging.
To give a real-world example, I tested 78 models of bankruptcy. Of the 78 models, the combined posterior probability of 76 of them was about one ten-thousandth of one percent. The other two models were roughly 54 percent and 46 percent respectively. Fortunately, they also did not share any variables. That allowed me to select both models and ignore the other 76. When I had all the data points for both, I averaged their predictions based on the posterior probabilities of the two models, using only one model when I had missing data points that precluded the other. While I did have a training set and validation set, it wasn’t for the same reason a Frequentist would have them. Furthermore, at the end of every day over two business cycles, I updated my posteriors with each day’s data. That meant that my model at the end of the validation set was not the model at the end of the training set. Bayesian models do not stop learning while Frequentist models do.
To go deeper let us get concrete with our models. Let us assume that during the training sample the best fit Frequentist model and the Bayesian model using model selection matched or, alternatively, that the model weight in model averaging was so great that it was almost indistinguishable to the Frequentist model. We will imagine this model to be $$y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3.$$ Let’s also imagine that the true model in nature is $$y=\beta_0+\beta_1x_1+\beta_3x_3.$$
Now let's consider the difference in the validation set. The Frequentist model is overfitted to the data. Let’s assume that by some point $n_2^i$ that the model selection or validation procedure had changed the selection to the true model in nature. Further, if model averaging was used, then the true model in nature carried weight in the prediction long before the choice of models was clear-cut. E.T. Jaynes in his tome on probability theory spends some time discussing this issue. I have the book at work so I cannot get you a good citation, but you should read it. Its ISBN is 978-0521592710.
Models are parameters in Bayesian thinking and as such are random, or if you would prefer, uncertain. That uncertainty does not end during the validation process. It is continually updated.
Because of the differences between Bayesian and Frequentist methods, there are other types of cases that also must be considered. The first comes from parameter inference, the second from formal predictions. They are not the same thing in Bayesian methods. Bayesian methods formally separate out inference and decision making. They also separate out parameter estimation and prediction.
Let’s imagine, without loss of generality, that a model would be successful if $\hat{\sigma^2}<k$ and a failure otherwise. We are going to ignore the other parameters because it would be a lot of extra work to get at a simple idea. For the modeler using Bayesian methods, this is a very different type of question than it is for the one using Frequentist methods.
For the Frequentist a hypothesis test is formed based off of the training set. The modeler using Frequentist methods would test whether the estimated variance is greater than or equal to $k$ and attempt to reject the null over the sample whose size is $n_2$ by fixing the parameters to those discovered in $n_1$.
For the modeler using Bayesian methods, they would form parameter estimates during from sample $n_1$ and the posterior density of $n_1$ would become the prior for sample $n_2$. Assuming the exchangeability property holds, then it is assured that the posterior estimate of $n_2$ is equal in all senses of the word of that of a probability estimate formed from the joint sample. Splitting them into two samples is equivalent by force of math to having not split them at all.
For predictions, a similar issue holds. Bayesian methods have a predictive distribution that is also updated with each observation, whereas the Frequentist one is frozen at the end of sample $n_1$. The predictive density can be written as $\Pr(\tilde{x}=k|\mathbf{X})$. If $\tilde{x}$ is the prediction and $\mathbf{X}$ is the sample, then where are the parameters, which we will denote $\theta?$ Although Frequentist prediction systems do exist, most people just treat the point estimates as the true parameters and calculate residuals. Bayesian methods would score each prediction against the predicted density rather than just one single point. These predictions do not depend upon the parameters which are different from the point methods used in Frequentist solutions.
As a side note, formal Frequentist predictive densities do exist using the standard errors, and scoring could be done on them, but this is rare in practice. If there is no specific prior knowledge, then the two sets of predictions should be identical for the same set of data points. They will end up differing because $n_1+n_2>n_1$ and so the Bayesian solution will impound more information.
If there is no material prior information and if Frequentist predictive densities are used rather than point estimates, then for a fixed sample the results of the Bayesian and Frequentist methods will be identical if a single model is chosen. If there is prior information, then the Bayesian method will tend to generate more accurate predictions. This difference can be very large in practice. Further, if there is model averaging, then it is quite likely that the Bayesian method will be more robust. If you use model selection and freeze the Bayesian predictions, then there is no difference to using a Frequentist model using Frequentist predictions.
I used a test and validation set because my data was not exchangeable. As a result, I needed to solve two problems. The first is similar to burn-in in MCMC methods. I needed a good set of parameter estimates to start my test sequence, and so I used fifty years of prior data to get a good prior density to start my validation test. The second problem was that I needed some form of standardized period to test in so that the test would not be questioned. I used the two prior business cycles as dated by NBER.