In general, there vast number of ways to select model/feature in machine learning or statistics. For example, empirical method like Cross-Validation, Bootstrap methods or in sample penalty such as AIC, BIC ,Mallow's CP. But I am wondering about how should a researcher choose a suitable model selection method for a specific model?
According to my knowledge, people sometimes prefer AIC/BIC over Cross-Validation because of the computational efficiency of AIC/BIC or when the sample size is relatively small for cross-validation. In this thread, AIC is asymptotically equivalent to leave-1-out cross-validation (LOOCV) , suggesting the large sample equivalence of AIC/BIC and Cross-Validation. This means that when the sample size grow (and ignoring computational issue), ICs and cross-validation should gives similar conclusions.
But for my project now, I tried to build a Logistic Regression model using a dataset with sample size ~8000. I want to choose the best subset of feature for my model. Let say, I have set of feature [A,B,C]
versus [A,B,C,D,E,F]
and would like to determine which set of feature will gives a better performance.
I face similar issue as Variable selection : combining AIC and Cross validation, when using AIC/BIC to evaluate my model, feature D,E,F
turns out to be insignificant and the performance of [A,B,C]
(in terms of AIC/BIC) is better.
But when I use 10-fold cross-validation to evaluate model's performance. [A,B,C,D,E,F]
gives a significantly better performance on validation set performance versus [A,B,C]
.
Therefore, I am wonder if there is some good way to choose model selection method, especially when different model selection method gives distinct conclusion.
{D, E, F}
make a significant contribution after controlling for{A, B, C}
, fit the full model and perform a simultaneous test. If you're trying to do something else, what you should do depends... $\endgroup$