If only prediction is of interest, why use lasso over ridge?

Question

On page 223 in An Introduction to Statistical Learning, the authors summarise the differences between ridge regression and lasso. They provide an example (Figure 6.9) of when "lasso tends to outperform ridge regression in terms of bias, variance, and MSE".

I understand why lasso can be desirable: it results in sparse solutions since it shrinks many coefficients to 0, resulting in simple and interpretable models. But I do not understand how it can outperform ridge when only predictions are of interest (i.e. how is it getting a substantially lower MSE in the example?).

With ridge, if many predictors have almost no affect on the response (with a few predictors having a large effect), won't their coefficients simply be shrunk to a small number very close to zero... resulting in something very similar to lasso? So why would the final model have worse performance than lasso?

$\begingroup$ stats.stackexchange.com/questions/866/… $\endgroup$
– Laksan Nathan
Commented Mar 5, 2018 at 11:10 — Laksan Nathan, Commented Mar 5, 2018 at 11:10
$\begingroup$ I saw that link. It does not answer the question. $\endgroup$
– Oliver Angelil
Commented Mar 5, 2018 at 11:11 — Oliver Angelil, Commented Mar 5, 2018 at 11:11

Frank Harrell · Accepted Answer · 2018-03-05 13:28:25Z

46

You are right to ask this question. In general, when a proper accuracy scoring rule is used (e.g., mean squared prediction error), ridge regression will outperform lasso. Lasso spends some of the information trying to find the "right" predictors and it's not even great at doing that in many cases. Relative performance of the two will depend on the distribution of true regression coefficients. If you have a small fraction of nonzero coefficients in truth, lasso can perform better. Personally I use ridge almost all the time when interested in predictive accuracy.

answered Mar 5, 2018 at 13:28

Frank Harrell

95.9k6 gold badges187 silver badges428 bronze badges

1

$\begingroup$ are there instances when you are not interested in predictive accuracy? $\endgroup$
– Walrus the Cat
Commented Mar 5, 2018 at 20:34
2

$\begingroup$ @WalrustheCat Some folks, stereo-typically coming from Stanford, advocate the use of Lasso in high-dimensional variable selection. Presumably, Frank meant "... primarily interested in predictive accuracy" rather than simply "...interested in predictive accuracy", though, in my opinion, the difference between these two is two pedantic to be useful. $\endgroup$
– John Madden
Commented Mar 5, 2018 at 20:44
$\begingroup$ I've never understood the "regularization as dimensionality reduction" approach. You could perform dimensionality reduction, either through lasso regularization or not, and then use the best regularization function for your original problem on the resulting features. But I digress. $\endgroup$
– Walrus the Cat
Commented Mar 5, 2018 at 21:17
9

$\begingroup$ From "In general [...] ridge regression will outperform lasso" and "If you have a small fraction of nonzero coefficients in truth, lasso can perform better" it seems to follow that in most prediction problems the ground truth is not sparse. Is this what you are saying? $\endgroup$
– amoeba
Commented Mar 5, 2018 at 21:39
5

$\begingroup$ Yes, mainly. If you know the ground truth "in distribution" you would create a Bayesian prior distribution for the unknown regression coefficients that would get you optimal results. And even when, say, 3/4 of the predictors have exactly zero effect, ridge is competitive with lasso. $\endgroup$
– Frank Harrell
Commented Mar 5, 2018 at 22:19

Add a comment |

Nick Cox · Accepted Answer · 2018-03-10 08:15:45Z

16

I think the specific setup of the example you reference is key to understanding why lasso outperforms ridge: only 2 of 45 predictors are actually relevant.

This borders on a pathological case: lasso, specifically intended to make reductions to zero easy, performs exactly as intended, while ridge will have to deal with a large number of useless terms (even their effect is reduced closed to zero, it is still a non-zero effect).

edited Mar 10, 2018 at 8:15

Nick Cox

58.6k8 gold badges133 silver badges199 bronze badges

answered Mar 5, 2018 at 23:47

mbrig

2711 silver badge4 bronze badges

Add a comment |

Stack Exchange Network

If only prediction is of interest, why use lasso over ridge?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
machine-learning
prediction
lasso
regularization
ridge-regression
or ask your own question.

Linked

Hot Network Questions

If only prediction is of interest, why use lasso over ridge?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged machine-learningpredictionlassoregularizationridge-regression or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
prediction
lasso
regularization
ridge-regression
or ask your own question.