9
$\begingroup$

In general, there vast number of ways to select model/feature in machine learning or statistics. For example, empirical method like Cross-Validation, Bootstrap methods or in sample penalty such as AIC, BIC ,Mallow's CP. But I am wondering about how should a researcher choose a suitable model selection method for a specific model?

According to my knowledge, people sometimes prefer AIC/BIC over Cross-Validation because of the computational efficiency of AIC/BIC or when the sample size is relatively small for cross-validation. In this thread, AIC is asymptotically equivalent to leave-1-out cross-validation (LOOCV) , suggesting the large sample equivalence of AIC/BIC and Cross-Validation. This means that when the sample size grow (and ignoring computational issue), ICs and cross-validation should gives similar conclusions.

But for my project now, I tried to build a Logistic Regression model using a dataset with sample size ~8000. I want to choose the best subset of feature for my model. Let say, I have set of feature [A,B,C] versus [A,B,C,D,E,F] and would like to determine which set of feature will gives a better performance.

I face similar issue as Variable selection : combining AIC and Cross validation, when using AIC/BIC to evaluate my model, feature D,E,F turns out to be insignificant and the performance of [A,B,C](in terms of AIC/BIC) is better.

But when I use 10-fold cross-validation to evaluate model's performance. [A,B,C,D,E,F] gives a significantly better performance on validation set performance versus [A,B,C].

Therefore, I am wonder if there is some good way to choose model selection method, especially when different model selection method gives distinct conclusion.

$\endgroup$
3
  • $\begingroup$ Spell-checkers are useful things. $\endgroup$ Commented Jun 16, 2022 at 10:39
  • 6
    $\begingroup$ So you need a model selection criterion selection criterion $\endgroup$ Commented Jun 16, 2022 at 11:44
  • $\begingroup$ 'How you should choose..." depends critically on what you want to achieve. What is the point of your model? What would it mean for your model to be better or worse? Why do you need variable selection at all? Are you trying to build a model that will be able to accurately predict the response in the future? Are you trying to test a-priori hypotheses about the features? If you only want to know if {D, E, F} make a significant contribution after controlling for {A, B, C}, fit the full model and perform a simultaneous test. If you're trying to do something else, what you should do depends... $\endgroup$ Commented Jun 17, 2022 at 19:44

3 Answers 3

11
$\begingroup$

Nothing strange in here.

  • If all the model selection methods always gave the same results, we wouldn't have multiple criteria, but just pick arbitrary one.
  • AIC and BIC explicitly penalize the number of parameters, cross-validation not, so again, it's not surprising that they suggest a model with fewer parameters (though nothing prohibits cross-validation from picking a model with fewer parameters).
$\endgroup$
7
  • 2
    $\begingroup$ AIC and BIC explicitly penalize the number of parameters, cross-validation not, so again, it's not surprising that they suggest a model with fewer parameters – this seems to be a broadly incorrect conclusion based on a false dichotomy. The asymptotic equivalence of between AIC/BIC and certain versions of cross validation shows that. $\endgroup$ Commented Jun 16, 2022 at 10:30
  • 2
    $\begingroup$ @RichardHardy asymptotic guarantees are not equivalence, as the example shown. AIC and BIC have explicit penalties, CV does it implicitly. $\endgroup$
    – Tim
    Commented Jun 16, 2022 at 12:02
  • 2
    $\begingroup$ My point is that your second bullet-point is not helpful or even misleading. An opposite statement to what you wrote would be equally (in)valid: Cross-validation implicitly penalizes the number of parameters, AIC/BIC do not, so again, it's not surprising that they suggest a model with more parameters. It is not helpful to explain a chance event by invoking some nonexistent systemic reason. $\endgroup$ Commented Jun 16, 2022 at 12:37
  • 1
    $\begingroup$ @RichardHardy, It's not necessarily a chance event; I suspect it is systematic. In the low-data regime, AIC/BIC are not equivalent to cross-validation. AIC/BIC penalize the number of parameters more strongly than cross-validation does, in the low-data regime. The question is likely in the low-data regime. $\endgroup$
    – D.W.
    Commented Jun 16, 2022 at 18:14
  • 4
    $\begingroup$ @RichardHardy I don't agree. In real life, we don't have infinite sample sizes. It's an optimization problem, where in one case we directly penalize something, in the second case we don't, we just know it should lead to approximately the same result. That makes a difference. $\endgroup$
    – Tim
    Commented Jun 16, 2022 at 21:30
7
$\begingroup$

AIC is asymptotically equivalent to leave-1-out cross-validation (LOOCV)

  • It's not equivalent to 10-fold cross-validation, which is what you're comparing it to.
  • It's only asymptotically equivalent, so the two methods don't always give the same answer, they're only approximately the same.

It's not really clear how you're doing the train/test/validation split when cross validating, so I can't really address your final question. Note, though, that the "best" model depends on the amount of data you're using. For example, a model with 5 features may perform well when trained on the full dataset, but could overfit when trained on only a subset in cross-validation.

$\endgroup$
4
  • 3
    $\begingroup$ The result the OP is getting is the opposite of intuitive. AIC is asympt. equivalent to LOOCV, and LOOCV should select a larger model than 10-fold CV. $\endgroup$ Commented Jun 16, 2022 at 10:38
  • 2
    $\begingroup$ Feature selection/model selection is a bad idea in this context (and in most other contexts) as it results in very wrong standard errors at the end, and a lot of false confidence due to covered-up model uncertainty. Pre-specify the model and stick to it. $\endgroup$ Commented Jun 16, 2022 at 12:37
  • 1
    $\begingroup$ @FrankHarrell, in most contexts that I am familiar with, prespecifying a model does not work. It does not work for prediction (because of unnecessarily poor performance), inference (since data typically do not meet the model's and/or estimator's assumptions, so inference from the model is not valid) or description (since data is not allowed to speak for themselves). If we stuck to prespecified models, the whole field of machine learning would immediately have to be thrown out the window. $\endgroup$ Commented Jun 16, 2022 at 13:32
  • 1
    $\begingroup$ Richard that is largely a mirage. Variable selection only seems to work but invalidates much of the model, especially standard errors, and causes overfitting. Full model fits typically outperform subset model fits especially if using some shrinkage. Prespecification works not only better for inference it works better for prediction. About 2/3 of the time in my experience the last few decades. Data reduction (unsupervised learning) in tandem with pre-specified models often works even better. $\endgroup$ Commented Jun 16, 2022 at 20:41
5
$\begingroup$

Maybe you should concentrate more on the methods that are intended precisely for feature selection, rather than model selection. Model selection methods like cross-validation or AIC try to compare models independently of how they differ (this is only approximately true, but should suffice here). Feature selection methods concentrate on comparing models that differ only by feature selection.

I have had good experience with e.g. random forest based feature selection, but there are many others, some of them more specialized like spike and slab.

Having said that, the results of those methods often contradict, especially in the less simple cases. Use the ones that are top in multiple methods as suggestions and you then select the model that works best for you w.r.t. other cost criteria like complexity, runtime...

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.