6
$\begingroup$

Is the Cross Validation Error more "Informative" compared to AIC, BIC and the Likelihood Test?

As far as I understand:

  • The Likelihood Test is used to determine : Given some data, is some fitted statistical model (i.e. some specific model parameters) more "likely" to be observed compared to some "alternate statistical model"? (In practice, this "alternate statistical model" often takes the form of the former model, but where all model parameters are 0). This is often formulated as a "hypothesis test", with the Likelihood Test Statistic said to asymptotically follow a Chi-Square Distribution (Chi square approximation of the likelihood test ratio)

  • The AIC and BIC both provide an indication of the performance of a fitted statistical model compared to the "complexity" of this fitted statistical model. Both the AIC and BIC loosely convey the same idea as "Occam's Razor" - (a concept in philosophy that states) Given the choice of a simpler model (a model with fewer parameters) and a complex model (a model with more parameters) : Provided both models provide the same performance, the simpler model is more preferable. This also ties into the idea of overfitting (traditionally, it was thought that models with good performance but with many parameters are likely to overfit and poorly predict new data, i.e. Bias-Variance Tradeoff). "Better" models are said to have lower values of AIC and BIC - yet there is no statistical threshold on "how low", only a relative comparison (e.g. Model 1 AIC = 234,841 and Model 2 AIC = 100,089 : Is Model 2 significantly better than Model 1, or are both models no where near acceptable? )

enter image description here

On the other hand, Cross Validation (e.g. K-Fold Cross Validation, Leave One Out Cross Validation) is said to be able to see how badly a statistical model is overfitting the available data - if the statistical model overfits the available data, (on a heuristic level) it is thought that this statistical model is likely to poorly predict new data. Cross Validation fits a series of similar statistical models to randomly selected subsets of the available data - the model error is recorded on each subset, and the average model error (i.e. performance , e.g. MSE, F-Score, Accuracy) is recorded over all subsets (Cross Validation Error). Thus, we can obtain similar insights about our statistical model from Cross Validation as we can from the Likelihood Test and the AIC/BIC.

enter image description here

This leads me to my question: Is the Cross Validation Error more "Informative" compared to AIC, BIC and the Likelihood Test?

Here are my general thoughts :

1) When you have large datasets and statistical models with many parameters (e.g. Deep Neural Networks), the Cross Validation procedure can be very computationally expensive (i.e. thousands of models might have to be fit). 50 years ago when computers were weaker, it might not have been possible to perform Cross Validation on statistical models - whereas the Likelihood Test, AIC and BIC are less computationally expensive. Thus, originally, researchers might have favored the Likelihood Test, AIC and BIC over Cross Validation.

2) AIC and BIC are only interpreted in relative measures, e.g. Model 1 AIC = 234,841 and Model 2 AIC = 100,089 : Is Model 2 significantly better than Model 1, or are both models no where near acceptable? On the other hand, you can perform Cross Validation on a simple model vs. a complex model (e.g. regression model with 3 parameters vs 5 parameters) and measure the Cross Validation Error of both models. In essence, this should allow you to compare model complexity vs. model performance - similar to the information that the AIC and BIC provide.

3) When it comes to Inference based Models, it becomes conceptually difficult to implement Cross Validation.

enter image description here

For example, suppose instead of fitting a regression model to your data - you decide to fit an entire probability distribution to your data. Instead of the model parameters being regression coefficients beta-0, beta-1, beta-2, etc., the model parameters are now the means, variances and covariances of the different variables (e.g. a multivariate normal distribution) :

enter image description here

Probability Distributions are more informative than Regression Models : Suppose you want to predict age of a giraffe using weight and height.

  • A regression model would only allow you to predict age for different combinations of weight and height; and provide confidence intervals on the parameter estimates of weight and height.

  • A probability distribution (i.e. inference based model) would also allow you to predict the age for different combinations of weight and height - but in addition would allow you to answer more in-depth questions such as "what is the most probable weight of a giraffe that is 20 years old and 15ft tall?" (expectation of the conditional distribution via MCMC sampling) or "what is the probability of observing a giraffe that weighs more less than 500 lbs" (marginal probability distribution)?

I suppose in theory, the Cross Validation procedure could be created for measuring the error of probability distribution models (fit a probability distribution on 70% of the data, and for each measurement in the test set (30%): see how close the expected value of the conditional distribution is from the true measurement ... then repeat "k" times). But generally, the Likelihood Test is used more often to assess the fit of a probability distribution function given some data.

Are my conclusions somewhat correct? Are there some instances where Cross Validation proves to be more informative compared to AIC, BIC and the Likelihood Test (and vice-versa)?

Thanks!

References:

Note: I have never come across any performance metric (e.g. AIC, BIC, Likelihood Test) which allows you to determine the error of statistical models such as Gaussian Process Regression or Gaussian Process Regression. I always imagined that perhaps manually creating a Cross Validation loop would be the only way to measure the error/overfit of Gaussian Process Models.

$\endgroup$
9
  • 3
    $\begingroup$ Some comments: 1) AIC does have an interpretation as twice the negative expected log-likelihood (as mentioned e.g. here). Hence, it is not only a relative measure. Moreover, it is in a sense a measure of error. 2) Cross validation used to be computationally infeasible for some complex models, but AIC/BIC are algebraically infeasible since the models' likelihood and degrees of freedom can be very hard to obtain. $\endgroup$ Commented Nov 8, 2021 at 6:38
  • $\begingroup$ @ Richard Hardy: thank you for your reply! I did not know that the aic/bic can be algebraically infeasible for complex models! I guess this leads me to a point: in general, are there any instances where aic/bic can prove to be more "useful" than cross validation? Thank you! $\endgroup$
    – stats_noob
    Commented Nov 8, 2021 at 6:48
  • 1
    $\begingroup$ @stats555 This isn't an ideal point as I would usually try to do CV in addition to AIC/BIC, but sometimes the additional data to do CV is not available. This an unfortunate situation that in my opinion should be avoided when possible, but it is an option to compute AIC/BIC in the absence of CV in order to perform model selection. $\endgroup$
    – Galen
    Commented Nov 8, 2021 at 6:51
  • 1
    $\begingroup$ AIC/BIC are more useful because they are computationally cheaper; there is no need to refit the data $k$ times as in $k$-fold CV. This is relevant in numerous problems. Also, time-series CV in the form of rolling windows uses data less efficiently, so AIC and BIC have an advantage there; see e.g. this. $\endgroup$ Commented Nov 8, 2021 at 7:10
  • 1
    $\begingroup$ @ Richard Hardy: I routinely accept answers on stackoverflow when there is code involved, because I can check and see if the code provided runs. When it comes to math, I am in no position to really comment on the answers - i greatly appreciate the answers and find them very useful, but I am not sure if I can accept them because I have no way of "checking to see if they are correct" (unlike the answers on stackoverflow that are code), because I am not a mathematician. $\endgroup$
    – stats_noob
    Commented Dec 28, 2021 at 8:48

2 Answers 2

7
$\begingroup$

Another thing to bring up in addition to the answers that already exist: AIC, BIC etc. can be really good (i.e. cheap to evaluate, use all the data, let you do things like AIC-model averaging etc.) in the specific circumstances when you can define them and they are valid. What do I mean by this limitation? E.g. for some model class especially with a lot of regularization it can be very hard to even define these (e.g. what is AIC or BIC - especially in terms of the number of parameters - for a XGBoost, a random forest or a convolutional neural network), even if there are various extensions like DIC. Additionally, your model may be rather mis-specified in important ways (e.g. you are using some kind of time series model like ARIMA, but you kind of know that you are mis-specifying the true underlying correlation of records over time). In those situations, I worry whether my likelihood in AIC or BIC is right and whether it may inappropriately overstate (usually more the worry) the evidence for one model vs. another.

Cross-validation is also quite good for optimizing metrics, which are not easy to optimize directly as likelihood function. One example would be optimizing AuROC: while there are some tricks and attempts to define loss functions that directly optimize it, it's not straightforward, but you can fit some model using some standard likelihood function and then make choices based on what optimizes AuROC in cross-validation.

These factors above mean that e.g. in prediction competitions like on Kaggle forms of cross-validation are usually the go-to-method for model evaluation/making modeling choices.

I may be overly negative about the likelihood ratio test, and, yes, I realize that you could re-phrase AIC model selection in terms of model selection with a particular alpha level (but I'd not recommend model selection, anyway, but rather model averaging), but for the purposes where one would consider AIC, BIC or cross-validation, I don't find them all that useful. Sure, a null hypothesis test is useful when you have a pre-specified model for an experiment (e.g. randomized controlled trial of drug A vs. placebo for disease X), but it's a lot less useful for building a good model that performs well by some metric.

I don't really see the distinction for inference models beyond this point. You could clearly define a meaningful cross-validation metric for the example you describe.

I suspect many examples where one technique is used and another might be just as good (or even better), come down to historical precedent in certain research communities. E.g. in some areas AIC is super-popular, in others train-test splits and/or cross-validation, others are really keen on hypothesis tests, and to name another option that has not, yet, been mentioned, there's also various forms of bootstrapping.

$\endgroup$
5
$\begingroup$

@RichardHardy already gave a partial answer

1) AIC does have an interpretation as twice the negative expected log-likelihood (as mentioned e.g. here). Hence, it is not only a relative measure. Moreover, it is in a sense a measure of error. 2) Cross validation used to be computationally infeasible for some complex models, but AIC/BIC are algebraically infeasible since the models' likelihood and degrees of freedom can be very hard to obtain.

Extending it, notice that things like LR tests or AIC are measured on your training data as compared to out-of-sample approaches like having held out test set for validation, $k$-fold cross-validation, LOOCV, etc. When using the former metrics you are making the assumption that what they measure tells you something that is relevant for judging the potential out-of-sample performance of the model. When using some form of cross-validation you are directly measuring the out-of-sample performance. Of course, your test set is a subsample of the data you gathered, so if your data is not representative for the population, the cross validation metrics would be biased as well.

Moreover, as noted by Richard, using cross validation may be simpler to do (it works for whatever model you want, no math is needed), but more computationally expensive, so there would be cases where you would prefer one of the approaches as compared to the another.

You are not always concerned with out-of-sample performance. Machine learning is concerned about making predictions and it favors cross-validation, statistics is concerned about inference and it often uses the in-sample metrics. See The Two Cultures: statistics vs. machine learning? for details.

Finally, the metrics do not necessarily have sense in machine learning scenario, for example, AIC penalizes the number of parameters, you wouldn't do that for a deep learning model where the number of parameters is always huge and it's not your biggest concern.

$\endgroup$
6
  • 1
    $\begingroup$ Isn't AIC trying to approximate LOOCV error? Therefore we cannot say AIC is for stats, CV for prediction when they are both concerned about predictions. $\endgroup$
    – rep_ho
    Commented Nov 8, 2021 at 9:38
  • 1
    $\begingroup$ @rep_ho yes it is an approximation. Also yes, this is an overstatement, since it is not the case that statistics doesn't care at all about out-of-sample performance. It is about how much weight stats vs ML put to it. Of course, assuming that you want to make the distinction, what is disputable. $\endgroup$
    – Tim
    Commented Nov 8, 2021 at 9:58
  • 1
    $\begingroup$ what i meant is that AIC and CV are both trying to estimate out of sample performance, so there isn't really a difference between goals of these two approaches. There might be a difference in what people are using it for $\endgroup$
    – rep_ho
    Commented Nov 8, 2021 at 9:59
  • $\begingroup$ @rep_ho what I'm trying to say is that CV directly measures the out-pf-sample performance, AIC & Co make some assumptions (e.g. less parameters is better) and do this indirectly. They are based on different ways of looking at the problem. $\endgroup$
    – Tim
    Commented Nov 8, 2021 at 10:03
  • $\begingroup$ ok, thanks for clarification $\endgroup$
    – rep_ho
    Commented Nov 8, 2021 at 10:06

Not the answer you're looking for? Browse other questions tagged or ask your own question.