13
$\begingroup$

This question might be dumb, but I noticed that there are two different formulations of the Lasso regression. We know that the Lasso problem is to minimize the objective consisting of the square loss plus the $L$-1 penalty term, expressed as follows, $$ \min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 \; $$

But often time I saw the Lasso estimator can be written as $$ \hat{\beta}_n(\lambda) = \displaystyle\arg \min_{\beta} \{\frac {1}{2n} \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 \} $$

My question is, are the equivalent? Where does the term $\frac {1}{2n}$ come in? The connections between the two formulations is not obvious to me.

[Update] I guess anther question I should ask is,

Why is there the second formulation? What's the advantage, theoretically or computationally, of formulating the problem that way?

$\endgroup$
2
  • 2
    $\begingroup$ If you set $\lambda$ in the second formulation equal to $1/(2n)$ times the $\lambda$ in the first formulation, then the objective function in the second formulation is $1/(2n)$ times the objective function in the first formulation. In effect, you have merely changed the units of measurement of the loss. How do you suppose that would change the optimal values of $\beta$? $\endgroup$
    – whuber
    Commented Sep 30, 2014 at 19:13
  • $\begingroup$ Thanks, @Whuber. That makes sense to me. Then why is there the latter formulation? What's the advantage, theoretically or computationally, of formulating the problem that way? $\endgroup$
    – SixSigma
    Commented Sep 30, 2014 at 19:19

1 Answer 1

12
$\begingroup$

They are indeed equivalent since you can always rescale $\lambda$ (see also @whuber's comment). From a theoretical perspective, it is a matter of convenience but as far as I know it is not necessary. From a computational perspective, I actually find the $1/(2n)$ quite annoying, so I usually use the first formulation if I am designing an algorithm that uses regularization.

A little backstory: When I first started learning about penalized methods, I got annoyed carrying the $1/(2n)$ around everywhere in my work so I preferred to ignore it -- it even simplified some of my calculations. At that time my work was mainly computational. More recently I have been doing theoretical work, and I have found the $1/(2n)$ indispensable (even vs., say, $1/n$).

More details: When you try to analyze the behaviour of the Lasso as function of the sample size $n$, you frequently have to deal with sums of iid random variables, and in practice it is generally more convenient to analyze such sums after normalizing by $n$--think law of large numbers / central limit theorem (or if you want to get fancy, concentration of measure and empirical process theory). If you don't have the $1/n$ term in front of the loss, you ultimately end up rescaling something at the end of the analysis so it's generally nicer to have it there to start with. The $1/2$ is convenient because it cancels out some annoying factors of $2$ in the analysis (e.g. when you take the derivative of the squared loss term).

Another way to think of this is that when doing theory, we are generally interested in the behaviour of solutions as $n$ increases -- that is, $n$ is not some fixed quantity. In practice, when we run the Lasso on some fixed dataset, $n$ is indeed fixed from the perspective of the algorithm / computations. So having the extra normalizing factor out front isn't all that helpful.

These may seem like annoying matters of convenience, but after spend enough time manipulating these kinds of inequalities, I've learned to love the $1/(2n)$.

$\endgroup$
2
  • 3
    $\begingroup$ Once you realize what those normalizing constants are for, you start seeing them everywhere. $\endgroup$ Commented May 12, 2015 at 17:47
  • $\begingroup$ Thank you for this explanation. We are so proud to read your great experiences in this domain. Thank you again $\endgroup$
    – Christina
    Commented May 18, 2015 at 13:12

Not the answer you're looking for? Browse other questions tagged or ask your own question.