LASSO vs AIC for submodel selection via nonzero coefficient variable selection

Question

Suppose you have a linear model which you believe has too many variables -- a cubic in 10 lags, for example. You believe, without being certain, that it is probably quadratic, and maybe linear, and that only four to six of the lags matter, and that, as a result, your current equation probably does rampant overfitting. Your goal is a good predictive model.

Relaxed LASSO, in at least some of its variants, uses the LASSO for variable selection and then switches to ridge regression for shrinkage. This generally does a better job of forecasting than pure LASSO or elastic net. I am not certain how its out-of-sample performance compares to pure ridge regression, but I believe it is worse, but not much worse, after optimal selection of the tuning parameter by some version of cross-validation in each case.

Here is my question: When comparing models that differ only in the variables they include, variable selection (as by LASSO selecting variables with non-zero coefficients) and model selection (as by choosing the set of variables that minimises the AIC) are pragmatically doing the same job, however conceptually distinct they may be. Now suppose in each of those two variable selection cases (via LASSO or AIC) we then do ridge regression on the resulting model, as in relaxed LASSO. Suppose the variables selected by AIC and LASSO are not identical. Do we know which case, i. e., which selection of variables, is likely to do better out-of-sample forecasting after correct setting of the tuning parameter for each?

$\begingroup$ Any thoughts on my answer? $\endgroup$
– Richard Hardy
Commented Jul 19, 2021 at 6:38 — Richard Hardy, Commented Jul 19, 2021 at 6:38

Richard Hardy · Accepted Answer · 2021-08-15 18:25:13Z

Relaxed LASSO does LASSO (not ridge) in both steps. If it first does LASSO, then ridge, calling this relaxed LASSO seems misleading to me.

More importantly, LASSO and AIC are apples and oranges. LASSO is a fitting method while AIC is an information criterion. (E.g. there are works that use AIC instead of cross validation for determining the penalty intensity of LASSO. So you can have both at once if you like.) A better way to phrase this would be a comparison of LASSO with OLS (both being fitting methods) where the former is tuned using cross validation while the latter uses model selection (stepwise? full subset?) based on AIC.

To address you question directly, we have to recall the "no free lunch" theorem: there is no prediction method that is the best for all problems. There exist problems for which one method is better and other problems for which another method is better. E.g. in the original LASSO paper, the author shows problems where LASSO beats ridge and vice versa (I think subset selection is included in the comparison there, too, and it gets to beat and be beaten by LASSO and ridge as well). I would not be surprised if the original relaxed LASSO paper contains some similar comparisons. In summary, you have to figure out the best method for each problem without relying on overly broad generalizations.

A somewhat related thread is "AIC versus cross validation in time series: the small sample case".

Thanks, Richard! I see that I was mistaken about the second step in relaxed LASSO. But a lot of people use LASSO + cross-validation to do feature selecttion, and not just shrinkage, in a sparse high-dimensional setting. I think that is why LASSO is more popular than ridge regression in those settings, even though the latter often gives better results. Is your answer different for LASSO + CV than for LASSO alone? — andrewH, Commented Aug 14, 2021 at 17:30
@andrewH, my answer is not different. But see the link in my updated answer for a similar query. Also, I do not think ridge would typically yield better results than LASSO in sparse high-dimensional settings, because these setting are sparse and this is what LASSO excels at. The problem may be that they are not really sparse, and so ridge is closer to reality than LASSO. A pair of examples: truly sparse ~ gene expression data (only a few genes out of millions produce some feature); not sparse ~ economic or financial data (everything affects everything, but the effects may be very small). — Richard Hardy, Commented Aug 15, 2021 at 18:29

Stack Exchange Network

LASSO vs AIC for submodel selection via nonzero coefficient variable selection

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
forecasting
cross-validation
lasso
aic
ridge-regression
or ask your own question.

Linked

Hot Network Questions

LASSO vs AIC for submodel selection via nonzero coefficient variable selection

1 Answer 1

Not the answer you're looking for? Browse other questions tagged forecastingcross-validationlassoaicridge-regression or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
forecasting
cross-validation
lasso
aic
ridge-regression
or ask your own question.