1
$\begingroup$

I want to do model selection based on the best-fit/MAP/marginal posterior I find from an MCMC and likelihood maximization. I have a likelihood $\mathcal{L}(X|\theta)$, some informative priors $\pi(\theta)$ on my parameters, and I can find the posterior $p(\theta)$, the MAP $\hat{p}$, and the MLE $\hat{\mathcal{L}}$, all for different models.

I am now interested in testing whether one model $M_1$ is preferred over another model $M_2$ or not. I found that the most common way to do this is to compare the Bayesian Information Criterion

$$ BIC(M_i) = -2 \ln \hat{\mathcal{L}} + k \ln n $$

with $k$ free parameters and $n$ degrees of freedom in my model. If one $BIC(M_1) < BIC(M_2)$, then $M_1$ is preferred over $M_2$, so far so good.

However, the derivation of the BIC assumes that $\pi(\hat{\theta})$ is negligible compared to $\hat{\mathcal{L}}$ (or at least $\pi(\hat{\theta})$ is sufficiently nonzero). In my case, I see that this is not necessarily the case.

  1. Is there an information criterion that includes the prior term in the comparison?

  2. What is the reason that Schwarz (or others) ignores the prior term? Can't we use a criterion with $-2 \ln \hat{\mathcal{L}}$ replaced by $-2 \ln \hat{p}$ (comparing the maximum a posteriori)?

$\endgroup$
1
  • $\begingroup$ Not an answer, but perhaps you benefit from having a peek at my tutorial. You can read most of it online under "Read Sample." amazon.com/dp/B0BTNVFR65 $\endgroup$ Commented May 8 at 15:07

1 Answer 1

2
$\begingroup$

BIC often doesn't work well with small sample sizes; the prior is (usually) pretty much irrelevant with larger sample sizes. Having said that, you certainly can use $\hat{p}$ instead of the likelihood in an ad-hoc manner; see the Wikipedia page, where the suggestion is made as well.

However, the derivation of BIC - see On the Derivation of the Bayesian Information Criterion - gets the $k \ln n$ term by assuming a flat prior, or, more generally, a noninformative prior, and a large $n$. If you don't want to make these assumptions, your version of "BIC" won't have the $k \ln n$ term; instead, it will look like equation (4) in the linked paper:

$$2 \ln P(y|M_i) \approx 2 \ln f(y\vert \tilde{θ}_i) + 2 \ln g_i(\tilde{θ}_i) + \ln(2π)\vert θ_i\vert + \ln\vert\tilde{H}^{−1}_{\theta_i}\vert$$

where $M_i$ refers to model $i$, $\tilde{\theta_i}$ is the posterior mode of the parameters belonging to model $i$, $g_i$ is the prior on the parameters for model $i$, and $\tilde{H}^{-1}_{\theta_i}$ is the inverse of the Hessian of the log posterior at the MAP estimates. BIC itself is derived by making some simplifying assumptions (i.e., the large $n$, flat prior ones) and showing how this equation reduces to the familiar BIC equation as a result.

A quick overview of the logic follows. As $n \to \infty$, the influence of the prior $g_i(\cdot)$ and of the $\vert \theta_i \vert$ term become negligible, as they remain constant. The last term on the r.h.s. is the negative log of the determinant of the observed Fisher information matrix; some algebra and the weak law of large numbers, made explicit in the linked paper, show that $\tilde{H}_{\theta_i} \to n\mathrm{I}_{\theta_i}$, where $\mathrm{I}_{\vert \theta_i \vert}$ is the identity matrix with dimension equal to the number of elements of $\theta_i$. Consequently, $\vert\tilde{H}^{−1}_{\theta_i}\vert \to n^{-|\theta_i\vert}$. Setting $k = \vert \theta_i \vert$ and taking the negative of the log gets us the rest of the way there.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.