46
$\begingroup$

On page 223 in An Introduction to Statistical Learning, the authors summarise the differences between ridge regression and lasso. They provide an example (Figure 6.9) of when "lasso tends to outperform ridge regression in terms of bias, variance, and MSE".

I understand why lasso can be desirable: it results in sparse solutions since it shrinks many coefficients to 0, resulting in simple and interpretable models. But I do not understand how it can outperform ridge when only predictions are of interest (i.e. how is it getting a substantially lower MSE in the example?).

With ridge, if many predictors have almost no affect on the response (with a few predictors having a large effect), won't their coefficients simply be shrunk to a small number very close to zero... resulting in something very similar to lasso? So why would the final model have worse performance than lasso?

$\endgroup$
2

2 Answers 2

46
$\begingroup$

You are right to ask this question. In general, when a proper accuracy scoring rule is used (e.g., mean squared prediction error), ridge regression will outperform lasso. Lasso spends some of the information trying to find the "right" predictors and it's not even great at doing that in many cases. Relative performance of the two will depend on the distribution of true regression coefficients. If you have a small fraction of nonzero coefficients in truth, lasso can perform better. Personally I use ridge almost all the time when interested in predictive accuracy.

$\endgroup$
5
  • 1
    $\begingroup$ are there instances when you are not interested in predictive accuracy? $\endgroup$ Commented Mar 5, 2018 at 20:34
  • 2
    $\begingroup$ @WalrustheCat Some folks, stereo-typically coming from Stanford, advocate the use of Lasso in high-dimensional variable selection. Presumably, Frank meant "... primarily interested in predictive accuracy" rather than simply "...interested in predictive accuracy", though, in my opinion, the difference between these two is two pedantic to be useful. $\endgroup$ Commented Mar 5, 2018 at 20:44
  • $\begingroup$ I've never understood the "regularization as dimensionality reduction" approach. You could perform dimensionality reduction, either through lasso regularization or not, and then use the best regularization function for your original problem on the resulting features. But I digress. $\endgroup$ Commented Mar 5, 2018 at 21:17
  • 9
    $\begingroup$ From "In general [...] ridge regression will outperform lasso" and "If you have a small fraction of nonzero coefficients in truth, lasso can perform better" it seems to follow that in most prediction problems the ground truth is not sparse. Is this what you are saying? $\endgroup$
    – amoeba
    Commented Mar 5, 2018 at 21:39
  • 5
    $\begingroup$ Yes, mainly. If you know the ground truth "in distribution" you would create a Bayesian prior distribution for the unknown regression coefficients that would get you optimal results. And even when, say, 3/4 of the predictors have exactly zero effect, ridge is competitive with lasso. $\endgroup$ Commented Mar 5, 2018 at 22:19
16
$\begingroup$

I think the specific setup of the example you reference is key to understanding why lasso outperforms ridge: only 2 of 45 predictors are actually relevant.

This borders on a pathological case: lasso, specifically intended to make reductions to zero easy, performs exactly as intended, while ridge will have to deal with a large number of useless terms (even their effect is reduced closed to zero, it is still a non-zero effect).

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.