16
$\begingroup$

I am wondering the difference between them. Basically they do the same job at the end finding coefficients of parameters, but they look just different the way we find the coefficients. To me, Least square method seem to use differentiation and matrix form to find the coefficients and Pseudo-inverse seem to use matrix manipulation only, but how can I say the difference between them? Or there is no difference at all?

$\endgroup$
1
  • $\begingroup$ Moore-Penrose pseudo inverse matrix, by definition, provides a least squares solution. But the concept of least squares can be also derived from maximum likelihood estimation under normal model. $\endgroup$ Commented Mar 10, 2017 at 9:27

4 Answers 4

15
$\begingroup$

In the context of linear regression, 'least squares' means that we want to find the coefficients that minimize the squared error. It doesn't specify how this minimization should be performed, and there are many possibilities. Multiplying the response vector by the Moore-Penrose pseudoinverse of the regressor matrix is one way to do it, and is therefore one approach to least squares linear regression (as others have pointed out).

Differences between methods can arise when the regressor matrix does not have full rank. This can happen, for example, when the number of variables exceeds the number of data points. In this case, there are infinitely many choices of optimal coefficients. Methods differ in how they choose one solution out of this infinite set. The distinguishing characteristic of the pseudoinverse method in this situation is that it returns the solution with minimum $\ell_2$ norm.

$\endgroup$
2
  • $\begingroup$ This is the right answer, but I would be more specific, it returns, the minimum L2 norm solution, as there are infinite ways you could define your norm, and it is important to note that the solution will not be the best one for example in the L0 and L_infinity norm sense. $\endgroup$
    – boomkin
    Commented Aug 10, 2018 at 9:56
  • $\begingroup$ Very true. I meant L2 implicitly, but edited to be more specific, as you suggest. $\endgroup$
    – user20160
    Commented Aug 10, 2018 at 10:42
4
$\begingroup$

It depends on, what you mean by "differentiation techniques". There are two methods, that I could understand by that:

  • Use differentiation to derive the gradient, then perform gradient descent on the error surface. However, this would be rather unusual for linear regression (but not for other types of regression).

  • Use differentiation to derive the gradient, then use that to analytically determine a minimum by setting the gradient to zero.

The first method is very different from the pseudo-inverse. The second is not. If you perform the differentiation and solve the equation resulting from setting the gradient to zero, you will get exactly the pseudo-inverse as a general solution.

If you think about this, it makes a lot of sense. If different techniques would lead to different coefficients, it would be hard to tell, which ones are correct. If they generate the same coefficients, it should also be the case, that you can derive the equations used for one method from the other.

$\endgroup$
4
$\begingroup$

As has been pointed out in the other answers, multiplying by the pseudoinverse is one of the ways of obtaining a least squares solution.

It is easy to see why. Let us say you have $k$ points in $n-$dimensional space: $$X = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\ 1 & x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{k1} & x_{k2} & x_{k3} & \dots & x_{kn} \end{bmatrix}$$

Let each corresponding point have a value in $Y$: $$Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_k \end{bmatrix}$$

You want to find a set of weights $$W = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}$$

such that the squared error between $XW$ and $Y$ is minimized, that is the least squares solution: $min_Wf(W)$, where $f(W) = (Y-XW)^T(Y-XW)$ (you can easily see that $f(W)$ is the sum of squared errors).

We do that by finding the derivative of $f(W)$ by $W$ and setting it to $0$:

$$\frac{\delta f}{\delta W} = \frac{\delta (Y-XW)^T(Y-XW)}{\delta W} = \frac{\delta (Y^TY - W^TX^TY - Y^TXW + W^TX^TXW)}{\delta W} = \frac{\delta (Y^TY - 2Y^TXW - Y^TXW + W^TX^TXW)}{\delta W} = \frac{\delta Y^TY - 2Y^TXW + W^TX^TXW}{\delta W} = -2Y^TX + 2W^TX^TX$$

Setting the derivative to $0$:

$$2W^TX^TX = 2Y^TX$$ $$X^TXW = X^TY$$ $$(X^TX)^{-1}X^TXW = (X^TX)^{-1}X^TY$$ $$W = (X^TX)^{-1}X^TY$$

So this way we can derive the pseudo-inverse matrix as the solution to the least squares problem.

$\endgroup$
3
$\begingroup$

Pseudo inverse solution is based on least square error, as Łukasz Grad pointed out. That is, you are actually solving the minimization problem of,

$E(W) =\frac{1}{2}\sum \left(y^{(i)}-W ^Tx^{(i)}\right)^2$

by differentiating the error w.r.t $W$. Then you get the solution: $W = \left(X^TX\right)^{-1}X^TY$. (Note pseudo-inverse is not inverse. So you cannot interpret the solution as equal to $X^{-1}Y$, which may seem like a solution from $XW = Y$ directly with matrix manipulation. It is another topic how to find the pseudo-inverse.)

If you are asking about the covariance-based solution $W = \frac{cov(X, Y)}{var(X)}$, it can be interpreted as a direct solution based on the linear relation between $X$ and $Y$. Actually this solution is also strictly deduced from least square error, and the difference is nonessential from the pseudo-inverse one. This is still the pseudo-inverse solution but knowing that your line will definitely go through the point of mean values $(\bar{X},\bar{Y})$. So the error measure can be rewritten as,

$E(W) =\frac{1}{2}\sum \left((y^{(i)}-\bar{y})-W ^T(x^{(i)}-\bar{x})\right)^2$

When you use $x-\bar{x}$ to represent $x$ and $y-\bar{y}$ to represent $y$, your solution with pseudo-inverse is the same as the one with covariance. The difference is, now you have to compute the intercept separately, because, by subtracing the mean values of $x$ and $y$, you virtually center the coordinates at $(\bar{x}, \bar{y})$ and your line passes it, hence the intercept is zero. You have map the new coordinate system back the original one by computing the intercept with $w_{0} = \bar{y} -W^{T}\bar{x}$.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.