291
$\begingroup$

The AIC and BIC are both methods of assessing model fit penalized for the number of estimated parameters. As I understand it, BIC penalizes models more for free parameters than does AIC. Beyond a preference based on the stringency of the criteria, are there any other reasons to prefer AIC over BIC or vice versa?

$\endgroup$
8
  • 4
    $\begingroup$ I think it is more appropriate to call this discussion as "feature" selection or "covariate" selection. To me, model selection is much broader involving specification of the distribution of errors, form of link function, and the form of covariates. When we talk about AIC/BIC, we are typically in the situation where all aspects of model building are fixed, except the selection of covariates. $\endgroup$
    – user13273
    Commented Aug 13, 2012 at 21:17
  • 7
    $\begingroup$ Deciding the specific covariates to include in a model does commonly go by the term model selection and there are a number of books with model selection in the title that are primarily deciding what model covariates/parameters to include in the model. $\endgroup$ Commented Aug 24, 2012 at 14:44
  • $\begingroup$ I don't know if your question applies specifically to phylogeny (bioinformatics), but if so, this study can provide some thoughts on this aspect: ncbi.nlm.nih.gov/pmc/articles/PMC2925852 $\endgroup$
    – tlorin
    Commented Jan 3, 2018 at 9:09
  • 1
    $\begingroup$ Merged question also asks about KIC, please update the question text and state a definition of KIC, pref with link. $\endgroup$
    – smci
    Commented Dec 12, 2018 at 22:05
  • 1
    $\begingroup$ @smci I've added stats.stackexchange.com/questions/383923/… to allow people to dig into questions related to the KIC if interested. $\endgroup$ Commented Dec 20, 2018 at 14:43

13 Answers 13

243
$\begingroup$

Your question implies that AIC and BIC try to answer the same question, which is not true. The AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates. I find it quite odd the assumption that reality is instantiated in one of the models that the researchers built along the way. This is a real issue for BIC.

Nevertheless, there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models. Overfitting occurs when the wrong model fits the data better than the generating. The point of these simulations is to see how well AIC and BIC correct these overfits. Usually, the results point to the fact that AIC is too liberal and still frequently prefers a more complex, wrong model over a simpler, true model. At first glance these simulations seem to be really good arguments, but the problem with them is that they are meaningless for AIC. As I said before, AIC does not consider that any of the candidate models being tested is actually true. According to AIC, all models are approximations to reality, and reality should never have a low dimensionality. At least lower than some of the candidate models.

My recommendation is to use both AIC and BIC. Most of the times they will agree on the preferred model, when they don't, just report it.

If you are unhappy with both AIC and BIC and have free time to invest, look up Minimum Description Length (MDL), a totally different approach that overcomes the limitations of AIC and BIC. There are several measures stemming from MDL, like normalized maximum likelihood or the Fisher Information approximation. The problem with MDL is that its mathematically demanding and/or computationally intensive.

Still, if you want to stick to simple solutions, a nice way for assessing model flexibility (especially when the number of parameters are equal, rendering AIC and BIC useless) is doing Parametric Bootstrap, which is quite easy to implement. Here is a link to a paper on it.

Some people here advocate for the use of cross-validation. I personally have used it and don't have anything against it, but the issue with it is that the choice among the sample-cutting rule (leave-one-out, K-fold, etc) is an unprincipled one.

$\endgroup$
11
  • 10
    $\begingroup$ Difference can be viewed purely from mathematical standpoint -- BIC was derived as an asymptotic expansion of log P(data) where true model parameters are sampled according to arbitrary nowhere vanishing prior, AIC was similarly derived with true parameters held fixed $\endgroup$ Commented Jan 24, 2011 at 5:57
  • 6
    $\begingroup$ You said that " there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models." Would you be so kind as to point some references. I'm curious about them! :) $\endgroup$
    – deps_stats
    Commented May 3, 2011 at 16:21
  • 4
    $\begingroup$ I do not believe the statements in this post. $\endgroup$
    – user9352
    Commented May 2, 2012 at 14:06
  • 24
    $\begingroup$ (-1) Great explanation, but I would like to challenge an assertion. @Dave Kellen Could you please give a reference to where the idea that the TRUE model has to be in the set for BIC? I would like to investigate on this, since in this book the authors give a convincing proof that this is not the case. $\endgroup$
    – gui11aume
    Commented May 27, 2012 at 21:47
  • 3
    $\begingroup$ When you work through the proof of the AIC, for the penalty term to equal the number of linearly independent parameters, the true model must hold. Otherwise it is equal to $\text{Trace}(J^{-1} I)$ where $J$ is the variance of the score, and $I$ is the expectation of the hessian of the log-likelihood, with these expectations evaluated under the truth, but the log-likelihoods are from a mis-specified model. I am unsure why many sources comment that the AIC is independent of the truth. I had this impression, too, until I actually worked through the derivation. $\endgroup$
    – Andrew M
    Commented Oct 6, 2017 at 22:05
105
$\begingroup$

Though AIC and BIC are both Maximum Likelihood estimate driven and penalize free parameters in an effort to combat overfitting, they do so in ways that result in significantly different behavior. Lets look at one commonly presented version of the methods (which results form stipulating normally distributed errors and other well behaving assumptions):

  • ${\bf AIC} = -2 \ln\left(\text{likelihood}\right) + 2k$

and

  • ${\bf BIC} = -2\ln\left(\text{likelihood}\right) + k\ln(N)$

where:

  • $k$ = model degrees of freedom
  • $N$ = number of observations

The best model in the group compared is the one that minimizes these scores, in both cases. Clearly, AIC does not depend directly on sample size. Moreover, generally speaking, AIC presents the danger that it might overfit, whereas BIC presents the danger that it might underfit, simply in virtue of how they penalize free parameters (2*k in AIC; ln(N)*k in BIC). Diachronically, as data is introduced and the scores are recalculated, at relatively low N (7 and less) BIC is more tolerant of free parameters than AIC, but less tolerant at higher N (as the natural log of N overcomes 2).

Additionally, AIC is aimed at finding the best approximating model to the unknown data generating process (via minimizing expected estimated K-L divergence). As such, it fails to converge in probability to the true model (assuming one is present in the group evaluated), whereas BIC does converge as N tends to infinity.

So, as in many methodological questions, which is to be preferred depends upon what you are trying to do, what other methods are available, and whether or not any of the features outlined (convergence, relative tolerance for free parameters, minimizing expected K-L divergence), speak to your goals.

$\endgroup$
2
  • 9
    $\begingroup$ nice answer. a possible alternative take on AIC and BIC is that AIC says that "spurious effects" do not become easier to detect as the sample size increases (or that we don't care if spurious effects enter the model), BIC says that they do. Can see from OLS perspective as in Raftery's 1994 paper, effect becomes approximately "significant" (i.e. larger model preferred) in AIC if its t-statistic is greater than $|t|>\sqrt{2}$, BIC if its t-statistic is greater than $|t|>\sqrt{log(n)}$ $\endgroup$ Commented May 13, 2011 at 14:33
  • 3
    $\begingroup$ Nice answer, +1. I especially like the caveat about whether the true model is actually present in the group evaluated. I would argue that "the true model" is never present. (Box & Draper said that "all models are false, but some are useful", and Burnham & Anderson call this "tapering effect sizes".) Which is why I am unimpressed by the BIC's convergence under unrealistic assumptions and more by the AIC's aiming at the best approximation among the models we actually look at. $\endgroup$ Commented Nov 24, 2012 at 20:00
94
$\begingroup$

My quick explanation is

  • AIC is best for prediction as it is asymptotically equivalent to cross-validation.
  • BIC is best for explanation as it is allows consistent estimation of the underlying data generating process.
$\endgroup$
6
  • $\begingroup$ AIC is equivalent to K-fold cross-validation, BIC is equivalent to leve-one-out cross-validation. Still, both theorems hold only in case of linear regression. $\endgroup$
    – user88
    Commented Jul 24, 2010 at 8:23
  • 8
    $\begingroup$ mbq, it's AIC/LOO (not LKO or K-fold) and I don't think the proof in Stone 1977 relied on linear models. I don't know the details of the BIC result. $\endgroup$
    – ars
    Commented Jul 24, 2010 at 11:01
  • 17
    $\begingroup$ ars is correct. It's AIC=LOO and BIC=K-fold where K is a complicated function of the sample size. $\endgroup$ Commented Jul 24, 2010 at 12:42
  • $\begingroup$ Congratulations, you've got me; I was in hurry writing that and so I made this error, obviously it's how Rob wrote it. Neverthelss it is from Shao 1995, where was an assumption that the model is linear. I'll analyse Stone, still I think you, ars, may be right since LOO in my field has equally bad reputation as various *ICs. $\endgroup$
    – user88
    Commented Jul 24, 2010 at 20:10
  • 1
    $\begingroup$ BIC is also equivalent to cross validation, but a "learning" type cross validation. For BIC the CV procedure is to predict the first observation with no-data (prior information alone). Then "learn" from the first observation, and predict the second. Then learn from the first and second, and predict the third, and so on. This is true because of the representation $p(D_1\dots D_n|MI)=p(D_1|MI)\prod_{i=2}^{n}p(D_i|D_1\dots D_{i-1}MI)$ $\endgroup$ Commented Apr 4, 2012 at 7:31
30
$\begingroup$

In my experience, BIC results in serious underfitting and AIC typically performs well, when the goal is to maximize predictive discrimination.

$\endgroup$
3
  • 2
    $\begingroup$ Super delayed, but since this is still ranking high on Google do you mind elaborating what area you are working in? I'm just curious if there is some effect of domain we should look at. $\endgroup$ Commented Apr 9, 2018 at 17:01
  • 2
    $\begingroup$ @verybadatthis: clinical biostatistics (just google "Frank Harrell", he has a web presence) $\endgroup$
    – Ben Bolker
    Commented Jan 20, 2019 at 22:28
  • $\begingroup$ From the perspective of optimising predictive performance then. In terms of variable selection consistency BIC (or for high dimensional settings eBIC or mBIC) is preferred... $\endgroup$ Commented Jul 13, 2023 at 12:04
20
$\begingroup$

An informative and accessible "derivation" of AIC and BIC by Brian Ripley can be found here: http://www.stats.ox.ac.uk/~ripley/Nelder80.pdf

Ripley provides some remarks on the assumptions behind the mathematical results. Contrary to what some of the other answers indicate, Ripley emphasizes that AIC is based on assuming that the model is true. If the model is not true, a general computation will reveal that the "number of parameters" has to be replaced by a more complicated quantity. Some references are given in Ripleys slides. Note, however, that for linear regression (strictly speaking with a known variance) the, in general, more complicated quantity simplifies to be equal to the number of parameters.

$\endgroup$
4
  • 5
    $\begingroup$ (+1) However, Ripley is wrong on the point where he says that the models must be nested. There is no such constraint on Akaike's original derivation, or, to be clearer, on the derivation using the AIC as an estimator of the Kullback-Leibler divergence. In fact, in a paper that I'm working on, I show somewhat "empirically" that the AIC can even be used for model selection of covariance structures (different number of parameters, clearly non-nested models). From the thousands of simulations of time-series that I ran with different covariance structures, in none of them the AIC gets it wrong... $\endgroup$
    – Néstor
    Commented Aug 14, 2012 at 17:06
  • $\begingroup$ ...if "the correct" model is in fact on the set of models (this, however, also implies that for the models I'm working on, the variance of the estimator is very small...but that's only a technical detail). $\endgroup$
    – Néstor
    Commented Aug 14, 2012 at 17:07
  • 2
    $\begingroup$ @Néstor, I agree. The point about the models being nested is strange. $\endgroup$
    – NRH
    Commented Aug 16, 2012 at 6:43
  • 5
    $\begingroup$ When selecting covariance structures for longitudinal data (mixed effects models or generalized least squares) AIC can easily find the wrong structure if there are more than 3 candidate structures. If if there are more than 3 you will have to use the bootstrap or other means to adjust for model uncertainty caused by using AIC to select the structure. $\endgroup$ Commented Apr 4, 2016 at 12:40
10
$\begingroup$

From what I can tell, there isn't much difference between AIC and BIC. They are both mathematically convenient approximations one can make in order to efficiently compare models. If they give you different "best" models, it probably means you have high model uncertainty, which is more important to worry about than whether you should use AIC or BIC. I personally like BIC better because it asks more (less) of a model if it has more (less) data to fit its parameters - kind of like a teacher asking for a higher (lower) standard of performance if their student has more (less) time to learn about the subject. To me this just seems like the intuitive thing to do. But then I am certain there also exists equally intuitive and compelling arguments for AIC as well, given its simple form.

Now any time you make an approximation, there will surely be some conditions when those approximations are rubbish. This can be seen certainly for AIC, where there exist many "adjustments" (AICc) to account for certain conditions which make the original approximation bad. This is also present for BIC, because various other more exact (but still efficient) methods exist, such as Fully Laplace Approximations to mixtures of Zellner's g-priors (BIC is an approximation to the Laplace approximation method for integrals).

One place where they are both crap is when you have substantial prior information about the parameters within any given model. AIC and BIC unnecessarily penalise models where parameters are partially known compared to models which require parameters to be estimated from the data.

one thing I think is important to note is that BIC does not assume a "true" model a) exists, or b) is contained in the model set. BIC is simply an approximation to an integrated likelihood $P(D|M,A)$ (D=Data, M=model, A=assumptions). Only by multiplying by a prior probability and then normalising can you get $P(M|D,A)$. BIC simply represents how likely the data was if the proposition implied by the symbol $M$ is true. So from a logical viewpoint, any proposition which would lead one to BIC as an approximation are equally supported by the data. So if I state $M$ and $A$ to be the propositions

$$\begin{array}{l|l} M_{i}:\text{the ith model is the best description of the data} \\ A:\text{out of the set of K models being considered, one of them is the best} \end{array} $$

And then continue to assign the same probability models (same parameters, same data, same approximations, etc.), I will get the same set of BIC values. It is only by attaching some sort of unique meaning to the logical letter "M" that one gets drawn into irrelevant questions about "the true model" (echoes of "the true religion"). The only thing that "defines" M is the mathematical equations which use it in their calculations - and this is hardly ever singles out one and only one definition. I could equally put in a prediction proposition about M ("the ith model will give the best predictions"). I personally can't see how this would change any of the likelihoods, and hence how good or bad BIC will be (AIC for that matter as well - although AIC is based on a different derivation)

And besides, what is wrong with the statement If the true model is in the set I am considering, then there is a 57% probability that it is model B. Seems reasonable enough to me, or you could go the more "soft" version there is a 57% probability that model B is the best out of the set being considered

One last comment: I think you will find about as many opinions about AIC/BIC as there are people who know about them.

$\endgroup$
10
$\begingroup$

Indeed the only difference is that BIC is AIC extended to take number of objects (samples) into account. I would say that while both are quite weak (in comparison to for instance cross-validation) it is better to use AIC, than more people will be familiar with the abbreviation -- indeed I have never seen a paper or a program where BIC would be used (still I admit that I'm biased to problems where such criteria simply don't work).

Edit: AIC and BIC are equivalent to cross-validation provided two important assumptions -- when they are defined, so when the model is a maximum likelihood one and when you are only interested in model performance on a training data. In case of collapsing some data into some kind of consensus they are perfectly ok.
In case of making a prediction machine for some real-world problem the first is false, since your training set represent only a scrap of information about the problem you are dealing with, so you just can't optimize your model; the second is false, because you expect that your model will handle the new data for which you can't even expect that the training set will be representative. And to this end CV was invented; to simulate the behavior of the model when confronted with an independent data. In case of model selection, CV gives you not only the quality approximate, but also quality approximation distribution, so it has this great advantage that it can say "I don't know, whatever the new data will come, either of them can be better."

$\endgroup$
11
  • $\begingroup$ Does that mean that for certain sample sizes BIC may be less stringent than AIC? $\endgroup$ Commented Jul 23, 2010 at 21:36
  • 1
    $\begingroup$ Stringent is not a best word here, rather more tolerant for parameters; still, yup, for the common definitions (with natural log) it happens for 7 and less objects. $\endgroup$
    – user88
    Commented Jul 23, 2010 at 22:13
  • 1
    $\begingroup$ AIC is asymptotically equivalent to cross-validation. $\endgroup$ Commented Jul 24, 2010 at 1:47
  • 6
    $\begingroup$ @mbq - I don't see how cross validation overcomes the "un-representativeness" problem. If your training data is un-representative of the data you will receive in the future, you can cross-validate all you want, but it will be unrepresentative of the "generalisation error" that you are actually going to be facing (as "the true" new data is not represented by the non-modeled part of the training data). Getting a representative data set is vital if you are to make good predictions. $\endgroup$ Commented May 13, 2011 at 14:21
  • 1
    $\begingroup$ @mbq - my point is that you seem to "gently reject" IC based selection based on an alternative which doesn't fix the problem. Cross-validation is good (although computation worth it?), but un-representative data can't be dealt with using a data driven process. At least not reliably. You need to have prior information which tells you how it is un-representative (or more generally, what logical connections the "un-representative" data has to the actual future data you will observe). $\endgroup$ Commented May 13, 2011 at 17:12
7
$\begingroup$

As you mentioned, AIC and BIC are methods to penalize models for having more regressor variables. A penalty function is used in these methods, which is a function of the number of parameters in the model.

  • When applying AIC, the penalty function is z(p) = 2 p.

  • When applying BIC, the penalty function is z(p) = p ln(n), which is based on interpreting the penalty as deriving from prior information (hence the name Bayesian Information Criterion).

When n is large the two models will produce quite different results. Then the BIC applies a much larger penalty for complex models, and hence will lead to simpler models than AIC. However, as stated in Wikipedia on BIC:

it should be noted that in many applications..., BIC simply reduces to maximum likelihood selection because the number of parameters is equal for the models of interest.

$\endgroup$
1
  • 5
    $\begingroup$ note that AIC is also equivalent to ML when dimension doesn't change. Your answer makes it seem like this is only for BIC. $\endgroup$ Commented May 13, 2011 at 12:10
7
$\begingroup$

AIC and BIC are information criteria for comparing models. Each tries to balance model fit and parsimony and each penalizes differently for number of parameters.

AIC is Akaike Information Criterion the formula is $$\text{AIC}= 2k - 2\ln(L)$$ where $k$ is number of parameters and $L$ is maximum likelihood; with this formula, smaller is better. (I recall that some programs output the opposite $2\ln(L) - 2k$, but I don't remember the details)

BIC is Bayesian Information Criterion, the formula is $$\text{BIC} = k \ln(n) - 2\ln(L)$$ and it favors more parsimonious models than AIC

I haven't heard of KIC.

$\endgroup$
3
  • $\begingroup$ haven't heard of KIC either, but for AIC and BIC have a look at the linked question, or search for AIC. stats.stackexchange.com/q/577/442 $\endgroup$
    – Henrik
    Commented Sep 16, 2011 at 10:30
  • 1
    $\begingroup$ (This reply was merged from a duplicate question that also asked for interpretation of "KIC".) $\endgroup$
    – whuber
    Commented Sep 16, 2011 at 17:49
  • 3
    $\begingroup$ The models don't need to be nested to be compared with AIC or BIC. $\endgroup$
    – Macro
    Commented Apr 5, 2012 at 13:03
6
$\begingroup$

AIC should rarely be used, as it is really only valid asymptotically. It is almost always better to use AICc (AIC with a correction for finite sample size). AIC tends to overparameterize: that problem is greatly lessened with AICc. The main exception to using AICc is when the underlying distributions are heavily leptokurtic. For more on this, see the book Model Selection by Burnham & Anderson.

$\endgroup$
2
  • 1
    $\begingroup$ So, what you are saying is that AIC doesn't sufficiently punish models for parameters so using it as a criteria may lead to overparametrization. You recommend the use of AICc instead. To put this back in the context of my initial question, since BIC already is more stringent than AIC is there a reason to use AICc over BIC? $\endgroup$ Commented Jan 25, 2011 at 5:41
  • 1
    $\begingroup$ What do you mean by AIC is valid asymptotically. As pointed out by John Taylor AIC is inconsistent. I think his coomments contrasting AIC with BIC are the best ones given. I do not see the two being the same as cross-validation. They all have a nice property that they usually peak at a model with less than the maximum number of variables. But they all can pick different models. $\endgroup$ Commented May 6, 2012 at 1:05
5
$\begingroup$

Very briefly:

  • AIC approximately minimizes the prediction error (in terms of expected Kullback-Leibler (KL) divergence, which measures the difference between the true model and the estimated model) and is asymptotically equivalent to leave-1-out cross-validation (LOOCV) (Stone 1977). It is not consistent though, which means that even with a very large amount of data ($n$ going to infinity) and if the true model is among the candidate models, the probability of selecting the true model based on the AIC criterion would not approach 1. Instead, it would retain too many features.
  • BIC is an approximation to the integrated marginal likelihood $P(D|M,A) (D=\textrm{Data}, M=\textrm{model}, A=\textrm{assumptions})$, which under a flat prior is equivalent to seeking the model that maximizes $P(M|D,A)$. Its advantage is that it is consistent, which means that with a very large amount of data ($n$ going to infinity) and if the true model is among the candidate models, the probability of selecting the true model based on the BIC criterion would approach 1. This would come at a slight cost to prediction performance though if $n$ were small. BIC is also equivalent to leave-k-out cross-validation (LKOCV) where $k=n[1−1/(\log(n)−1)]$, with $n=$ sample size (Shao 1997). There is many different versions of the BIC though which come down to making different approximations of the marginal likelihood or assuming different priors. E.g. instead of using a prior uniform of all possible models as in the original BIC, EBIC uses a prior uniform of models of fixed size (Chen & Chen 2008) whereas BICq uses a Bernouilli distribution specifying the prior probability for each parameter to be included.

Note that within the context of L0-penalized GLMs (where you penalize the log-likelihood of your model based on lambda * the nr of nonzero coefficients, i.e. the L0-norm of your model coefficients) you can optimize the AIC or BIC (or eBIC, mBIC, GIC) objectives directly and pre-pick an appropriate level of regularisation a priori, which is what is done in the l0ara R package and my own L0glm package, which is in development (see here for some benchmarks). To me this makes more sense than what they e.g. do in the case of LASSO or elastic net regression in glmnet, where optimizing one objective (LASSO or elastic net regression) is followed by the tuning of the regularization parameter(s) based on some other objective (which e.g. minimizes cross validation prediction error, AIC or BIC).

Syed (2011) on page 10 notes "We can also try to gain an intuitive understanding of the asymptotic equivalence by noting that the AIC minimizes the Kullback-Leibler divergence between the approximate model and the true model. The Kullback-Leibler divergence is not a distance measure between distributions, but really a measure of the information loss when the approximate model is used to model the ground reality. Leave-one-out cross validation uses a maximal amount of data for training to make a prediction for one observation. That is, $n −1$ observations as stand-ins for the approximate model relative to the single observation representing “reality”. We can think of this as learning the maximal amount of information that can be gained from the data in estimating loss. Given independent and identically distributed observations, performing this over $n$ possible validation sets leads to an asymptotically unbiased estimate."

Note that the LOOCV error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation. This would always be an alternative to the AIC, as that one just provides an asymptotic approximation of the LOOCV error.

References

Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7.

Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.

$\endgroup$
1
  • 2
    $\begingroup$ A much better understanding than a lot of the other posts here. If people are interested in reading more about this (and an alternative that is likely superior to AIC/BIC) I'd suggest reading this paper by Andrew Gelman et al: stat.columbia.edu/~gelman/research/published/… $\endgroup$ Commented Mar 9, 2020 at 16:26
2
$\begingroup$
  • AIC and BIC are both penalized-likelihood criteria. They are usually written in the form [-2logL + kp], where L is the likelihood function, p is the number of parameters in the model, and k is 2 for AIC and log(n) for BIC.
  • AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth.
  • BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model.
  • Both criteria are based on various assumptions and asymptotic approximations.
  • AIC always has a chance of choosing too big a model, regardless of n. BIC has very little chance of choosing too big a model if n is sufficient, but it has a larger chance than AIC, for any given n, of choosing too small a model.

References:

  1. https://www.youtube.com/watch?v=75BOMuXBSPI
  2. https://www.methodology.psu.edu/resources/AIC-vs-BIC/
$\endgroup$
2
$\begingroup$

From Shmueli (2010, Statistical Science), on the difference between explanatory and predictive modelling:

A popular predictive metric is the in-sample Akaike Information Criterion (AIC). Akaike derived the AIC from a predictive viewpoint, where the model is not intended to accurately infer the “true distribution”, but rather to predict future data as accurately as possible (see, e.g., Berk, 2008; Konishi and Kitagawa, 2007). Some researchers distinguish between AIC and the Bayesian information criterion (BIC) on this ground. Sober (2002) concluded that AIC measures predictive accuracy while BIC measures goodness of fit:

In a sense, the AIC and the BIC provide estimates of different things; yet, they almost always are thought to be in competition. If the question of which estimator is better is to make sense, we must decide whether the average likelihood of a family [=BIC] or its predictive accuracy [=AIC] is what we want to estimate

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3): 289–310. https://doi.org/10.1214/10-STS330.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.