7
$\begingroup$

On scikit-learn documentation, I found the following comments about AIC:

Information-criterion based model selection is very fast, but it relies on a proper estimation of degrees of freedom, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are actually generated by this model. They also tend to break when the problem is badly conditioned (more features than samples).

My questions are:

  1. Why would AIC break when we have more features than samples?
  2. Why is AIC and BIC commonly used in forecasting model like ARIMA?
$\endgroup$
14
  • 4
    $\begingroup$ assume the model is correct does not belong there. $\endgroup$ Commented May 10, 2021 at 5:19
  • 2
    $\begingroup$ Here is why information criteria may be preferred to cross validation in time series: "AIC versus cross validation in time series: the small sample case". $\endgroup$ Commented May 10, 2021 at 7:54
  • $\begingroup$ @RichardHardy AIC requires that model specification (the functional form) is correct. This is in fact what is fixed in TIC: ssc.wisc.edu/~bhansen/718/NonParametrics14.pdf $\endgroup$ Commented Sep 26, 2021 at 14:18
  • $\begingroup$ @CagdasOzgenc, as far as I remember this is not the case. In fact, I would say the lack of such an unrealistic requirement is one of the hallmarks of AIC. Perhaps Hansen is discussing a special case or a special use of AIC? $\endgroup$ Commented Sep 26, 2021 at 14:49
  • $\begingroup$ @RichardHardy Very few people truly understood the derivation in my opinion. The terms in “true risk” and “empirical risk” don’t really cancel out when truth is not in search space. See faculty.washington.edu/yenchic/19A_stat535/Lec7_model.pdf and ejwagenmakers.com/2003/elephant.pdfEx pp582 $\endgroup$ Commented Sep 26, 2021 at 16:40

3 Answers 3

8
$\begingroup$

What alternatives do we have in model selection for prediction?

  • The main ones are cross validation and information criteria.

Why are the latter attractive in the time series setting?

  • Information criteria are less computationally intensive. You only need to fit the model once to calculate an information criterion. This is in contrast to most applications of cross validation. Computational efficience is extra desirable in the time series setting as many basic time series models (ARMA, GARCH and the like) tend to be rather computationally demanding (more so than, say, linear regression).
  • Information criteria are also more effective in utilizing the data, as the model is estimated on the entire sample rather than just a training subset. The latter is important in small data sets* and especially in time series settings. In small data sets, we do not want to leave out too much data for testing, as then there is very little data left for training/estimation. We have leave-one-out cross validation (LOOCV) which leaves out only a single observation at a time in training/estimation, and it works well in a cross-sectional setting. However, it is often inapplicable in the time series setting due to the mutual dependence of the observations. Other types of validation that are applicable are much more data-costly. For more details, see "AIC versus cross validation in time series: the small sample case".

*Information criteria have an asymptotic justification, so their use is not unproblematic in small samples. Nevetherless, a more efficient use of the data is more desirable than a less efficient use. By using the entire sample for estimation you are closer to asymptotics than by using, say, 2/3 of the sample.

$\endgroup$
7
$\begingroup$

First off, as Richard Hardy comments, information criteria do not assume we have the true model. Quite to the contrary. For instance, AIC estimates the Kullback-Leibler distance between the proposed model and the true data generating process (up to an offset), and picking the model with minimal AIC amounts to choosing the one with the smallest distance to the true DGP. See Burnham & Anderson (2002, Model selection and multi-model inference: a practical information-theoretic approach) or Burnham & Anderson (2004, Sociological Methods & Research) for an accessible treatment. They also go into the justification for BIC.

Information criteria break down with overparameterized models, but that's not really a problem of the ICs. Instead, it's that every overparameterized model that is not regularized breaks down, and that "normal" ICs don't work with regularized models. (I believe there are IC variants that apply to regularized models, but am not an expert in this.)

ICs are used in forecasting model selection because of the above argument about distances to true DGPs. A related argument is that the AIC asymptotically estimates a monotone function of the prediction error (section 4.3.1 in Lütkepohl, 2005, New Introduction to Multiple Time Series Analysis, who also goes into other model selection criteria). Also, ICs are not the only tool used: some people prefer using holdout sets, but that means you need more data.

$\endgroup$
3
  • $\begingroup$ Regarding AIC asymptotically estimates a monotone function of the prediction error, I have a related question: "Equivalence of AIC and LOOCV under mismatched loss functions" (and more related questions linked in there). $\endgroup$ Commented May 10, 2021 at 8:24
  • $\begingroup$ I think the primary reason is that there are few if any good alternatives. The real issue is which information criterion you use, AIC, BIC, etc. There are differing opinions on that. Note that some use hold out data sets (and MAPE, MSE etc)to choose which model is best, I do, but that is not considered a statistical approach I assume. $\endgroup$
    – user54285
    Commented May 10, 2021 at 23:21
  • $\begingroup$ Burnham is wrong. The fact that somebody wrote a book doesn’t make them right. AIC requires that model specification (the functional form) is correct. This is in fact what is fixed in TIC: ssc.wisc.edu/~bhansen/718/NonParametrics14.pdf $\endgroup$ Commented Sep 26, 2021 at 14:19
0
$\begingroup$

First of all, sorry this was supposed to be a comment as opposed to an answer. The question has already been answered well. I just wanted to add that even though ICs aim at minimizing the distance to the true DGP, they might not always be able to do so. True DGP is unknown and there is no best way to identify the model closest to it. However, you can aid the ICs with autocorrelation and partial autocorrelation functions. Just by looking at these plots will give you an idea of how your model should look like in terms of lags. This will narrow down your pool of candidate models and you can then select the one with lower IC. In my understanding ICs look at how the models fit the distribution of the data but do not incorporate how the data is distributed over time. Incorporating auto-/partial autocorrelation plots helps to bridge the gap. Would love to be corrected if I am wrong.

$\endgroup$
1
  • $\begingroup$ I think you are in fact wrong. ICs are based on the likelihood, and the likelihood accounts for all of the things you mention. $\endgroup$ Commented Sep 26, 2021 at 14:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.