I'm learning the book "Introduction to Statistical Learning" and in the Chapter 6 about "Linear Model Selection and Regularization", there is a small part about "Bayesian Interpretation for Ridge Regression and the Lasso" that I haven't understood the reasoning.
A Bayesian viewpoint for regression assumes that the coefficient vector $\beta$ has some prior distribution, say $p(\beta)$, where $\beta = (\beta_0, \beta_1, \dots, \beta_p)^\top$. The likelihood of the data can be written as $f(Y|X, \beta)$, where $X = (X_1, X_2, \dots, X_p)$. Multiplying the prior distribution by the likelihood gives us (up to a proportionality constant) the posterior distribution, which takes the form $$p(\beta|X, Y) \propto f(Y|X,\beta)p(\beta|X) = f(Y|X, \beta)p(\beta),$$ where the proportionality above follows from Bayes’ theorem, and the equality above follows from the assumption that X is fixed.
The authors assume the linear model: $$Y = \beta_0 + X_1\beta_1 + \dots + X_p\beta_p + \epsilon,$$ and suppose the errors are independent and drawn from a normal distribution. There is also assumption that $p(\beta) = \prod_{j=1}^p g(\beta_j)$ for some density function $g$.
Then from those points,
It turns out that ridge regression and the lasso follow naturally from two special cases of $g$:
- If $g$ is a Gaussian distribution with mean zero and standard deviation a function of $\lambda$, then it follows that the posterior mode for $\beta$ $-$ that is, the most likely value for $\beta$ , given the data—is given by the ridge regression solution. (In fact, the ridge regression solution is also the posterior mean.)
- If $g$ is a double-exponential (Laplace) distribution with mean zero and scale parameter a function of $\lambda$, then it follows that the posterior mode for $\beta$ is the lasso solution. (However, the lasso solution is not the posterior mean, and in fact, the posterior mean does not yield a sparse coefficient vector.)
I don't understand why ridge regression and the lasso follow 2 special cases of $g$ like that and why they have such distribution. Would anyone please help me explain this? Thank you so much in advance for all your help!