9
$\begingroup$

I will use an elastic net to estimate a regression model which will later be used for forecasting.

I have a grid of $\alpha$ values within [0,1] representing the proportion of $L_1$ versus $L_2$ penalty.
I also have a grid of $\lambda$ values for the amount of penalization.

There are at least two alternatives for selecting the optimal combination $(\alpha,\lambda)$:

  1. Perform leave-one-out cross validation (LOOCV) to see which combination $(\alpha,\lambda)$ delivers the lowest MSE on the validation sets (and maybe use the one-sigma rule towards parsimony).
  2. Use the whole sample to see which combination $(\alpha,\lambda)$ delivers the lowest AIC.

In the second alternative, the degrees of freedom used in AIC would be based on the effective degrees of freedom of an elastic net. (I suppose the latter should be possible to obtain as the effective degrees of freedom are known for both LASSO and ridge regression.)

Question: Which of 1. and 2. is better and why?

Some thoughts:

  • In the context of feature selection, LOOCV is known to be asymptotically equivalent to AIC-based selection. So asymptotically I would expect both 1. and 2. to yield the same result. But what about finite samples?
  • Alternative 2. could be preferred due to speed.
  • Alternative 2. requires specifying the error distribution.
  • Is it fine to use effective degrees of freedom when calculating AIC?

Here are a couple of related questions: this and this.

$\endgroup$
7
  • $\begingroup$ If the model is to be used for forecasting, why are you interested in parsimony rather than simply using ridge regression and using information from all the predictors? $\endgroup$
    – EdM
    Commented Oct 5, 2015 at 14:53
  • $\begingroup$ Indeed, I am interested in forecasting, and parsimony per se plays no role. Still, when it comes to forecasting ridge regression does not beat LASSO or elastic net by design, does it? So therefore I opt for elastic net which has the flexibility to choose the data-based balance between ridge and LASSO penalties. If it was the one-sigma rule that caught your attention, I just thought it is a standard so why not give it a try. On the other hand, it might not make much sense to be conservative when parsimony is not a goal I am seeking, so I might just give it up. $\endgroup$ Commented Oct 5, 2015 at 18:12
  • $\begingroup$ I think in general practitioners will fix alpha=0.5 when fitting an elastic net model, and use cross validation only to select $\lambda$. Searching over a grid can lead to overfitting even if your using cross validation. Nevertheless, there is an interval-search algorithm implemented in the c060 package will select the optimal parameter combination. $\endgroup$
    – user230309
    Commented Oct 11, 2015 at 16:13
  • $\begingroup$ Apparently, @FrankHarrell has a number of answers mentioning (successful) use of effective AIC (e.g. this), i.e. AIC calculated using effective degrees of freedom. So if I understand it correctly, selecting $\lambda$ using effective AIC can be a good idea. $\endgroup$ Commented Oct 11, 2015 at 19:41
  • 1
    $\begingroup$ +1. But cross-validation does not necessarily mean leave-one-out. It is generally believed to have high variance and the usual recommendation is to use something like 10-fold CV instead. This is not asymptotically equivalent to AIC anymore (and I am not sure what exactly are the conditions under which LOOCV is asympt. eq. to AIC). What I see in machine learning community, is that people tend to use cross-validation as the method of choice. $\endgroup$
    – amoeba
    Commented Oct 20, 2015 at 22:37

0