33
$\begingroup$

I am interested in model selection in a time series setting. For concreteness, suppose I want to select an ARMA model from a pool of ARMA models with different lag orders. The ultimate intent is forecasting.

Model selection can be done by

  1. cross validation,
  2. use of information criteria (AIC, BIC),

among other methods.

Rob J. Hyndman provides a way to do cross validation for time series. For relatively small samples, the sample size used in cross validation may be qualitatively different than the original sample size. For example, if the original sample size is 200 observations, then one could think of starting cross validation by taking the first 101 observations and expanding the window to 102, 103, ..., 200 observations to obtain 100 cross-validation results. Clearly, a model that is reasonably parsimonious for 200 observation may be too large for 100 observations and thus its validation error will be large. Thus cross validation is likely to systematically favour too-parsimonious models. This is an undesirable effect due to the mismatch in sample sizes.

An alternative to cross validation is using information criteria for model selection. Since I care about forecasting, I would use AIC. Even though AIC is asymptotically equiv­a­lent to min­i­miz­ing the out-​​of-​​sample one-​​step fore­cast MSE for time series mod­els (according to this post by Rob J. Hyndman), I doubt this is relevant here since the sample sizes I care about are not that large...

Question: should I choose AIC over time series cross validation for small/medium samples?

A few related questions can be found here, here and here.

$\endgroup$
13
  • 1
    $\begingroup$ I would also imagine BIC is also equivalent to a "longer" forecast (m-step ahead), given its link to leave k out cross validation. For 200 observations though, probably doesn't make much difference (penalty of 5p instead of 2p). $\endgroup$ Commented Feb 25, 2015 at 13:11
  • 1
    $\begingroup$ @CagdasOzgenc, I asked Rob J. Hyndman regarding whether cross validation is likely to systematically favour too-parsimonious models in the context given in the OP and got a confirmation, so that is quite encouraging. I mean, the idea I was trying to explain in the chat seems to be valid. $\endgroup$ Commented Feb 26, 2015 at 7:10
  • $\begingroup$ There are theoretical reasons for favoring AIC or BIC since if one starts with likelihood and information theory, then metric which is based on those has well known statistical properties. But often it is that one is dealing with data set which is not so large. $\endgroup$
    – Analyst
    Commented Jun 15, 2018 at 19:41
  • 3
    $\begingroup$ I've spend a fair amount of time trying to understand AIC. The equality of the statement is based on numerous approximations that amount to versions of the CLT. I personally think this makes AIC very questionable for small samples. $\endgroup$
    – meh
    Commented Jun 15, 2018 at 20:39
  • 1
    $\begingroup$ @IsabellaGhement, why should it? There is no reason to restrict ourselves to this particular use of cross validation. This is not to say that cross validation cannot be used for model assessment, of course. $\endgroup$ Commented Dec 8, 2018 at 19:45

4 Answers 4

7
$\begingroup$

Taking theoretical considerations aside, Akaike Information Criterion is just likelihood penalized by the degrees of freedom. What follows, AIC accounts for uncertainty in the data (-2LL) and makes the assumption that more parameters leads to higher risk of overfitting (2k). Cross-validation just looks at the test set performance of the model, with no further assumptions.

If you care mostly about making the predictions and you can assume that the test set(s) would be reasonably similar to the real-world data, you should go for cross-validation. The possible problem is that when your data is small, then by splitting it, you end up with small training and test sets. Less data for training is bad, and less data for test set makes the cross-validation results more uncertain (see Varoquaux, 2018). If your test sample is insufficient, you may be forced to use AIC, but keep in mind what it measures, and what assumptions it can make.

On another hand, as already mentioned in comments, AIC gives you asymptotic guarantees, and it's not the case with small samples. Small samples may be misleading about the uncertainty in the data as well.

$\endgroup$
3
  • $\begingroup$ Thanks for you answer! Would you have any specific comment regarding the undesirable effect of the much smaller sample size in cross validation due to the time series nature of the data? $\endgroup$ Commented Aug 2, 2019 at 17:02
  • $\begingroup$ You mention this also in your question, and I do not see it as unavoidable. One could think of fitting an state space model and perform "leave-one-out" cross validation. The omitted value each time might be "predicted" using the Kalman smoother. Would this not be a form of cross validation with a sample size nearly that of the original set? $\endgroup$
    – F. Tusell
    Commented Apr 21, 2020 at 11:34
  • $\begingroup$ @F.Tusell, not sure if you meant to address me here, but I did not get notified as you did not include my name preceded with @. Anyway, what you are proposing is an interesting idea. $\endgroup$ Commented Feb 6, 2021 at 8:01
5
$\begingroup$

Hm - if your ultimate goal is to predict, why do you intend to do model selection at all? As far as I know, it is well established both in the "traditional" statistical literature and the machine learning literature that model averaging is superior when it comes to prediction. Put simply, model averaging means that you estimate all plausible models, let them all predict and average their predictions weighted by their relative model evidence.

A useful reference to start is https://journals.sagepub.com/doi/10.1177/0049124104268644

They explain this quite simply and refer to the relevant literature.

Hope this helps.

$\endgroup$
2
  • 2
    $\begingroup$ Thanks, this is a good idea. Even so, it may make sense to discard the poorest models from the average, and for that I need estimates of predictive ability for individual models. $\endgroup$ Commented May 11, 2020 at 6:27
  • 2
    $\begingroup$ +1. Shameless piece of self-promotion: I looked at combining exponential smoothing methods for forecasting based on the kind of AIC-weighted combinations Burnham & Anderson propose (Kolassa, 2011, IJF). $\endgroup$ Commented May 10, 2021 at 8:26
0
$\begingroup$

My idea is, do both and see. It's direct to use AIC. Smaller the AIC, better the model. But one cannot depend on AIC and say such model is the best. So, if you have a pool of ARIMA models, take each and check on forecasting for the existing values and see which model predicts the closest to the existing time series data. Secondly check for the AIC as well and considering both, come to a good choice. There are no hard and fast rules. Just go for the model which predicts the best.

$\endgroup$
1
  • 1
    $\begingroup$ Thank you for your answer! I am looking for a principled way to select between the different methods of model selection. While you are right that There are no hard and fast rules, we need clear guidelines under hypothetical ideal conditions to assist us in the messy real-world situations. So while I generally agree with your standpoint, I do not find your answer particularly helpful. $\endgroup$ Commented Jan 30, 2019 at 10:19
0
$\begingroup$

Hyndman & Athanasopoulos "Forecasting: Principles and Practice" (3rd edition) suggests AIC for short time series. Section 13.7 states:

However, with short series, there is not enough data to allow some observations to be withheld for testing purposes, and even time series cross validation can be difficult to apply. The AICc is particularly useful here, because it is a proxy for the one-step forecast out-of-sample MSE. Choosing the model with the minimum AICc value allows both the number of parameters and the amount of noise to be taken into account.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.