0
$\begingroup$

I'm looking at the loss function: mean squared error with gradient descent in machine learning. I'm building a single-neuron network (perceptron) that outputs a linear number. For example:

Input * Weight + Bias > linear activation > output.

Let's say the output is 40 while I expect the number 20. That means the loss function has to correct the weights+bias from 40 towards 20.

What I don't understand about mean squared error + gradient descent is: why is this number 40 displayed as a point on a parabola?

Does this parabola represent all possible outcomes? Why isn't it just a line? How do I know where on the parabola the point "40" is?

image of parabola

$\endgroup$
5
  • 2
    $\begingroup$ Are you using a specific loss function here? Please use edit to explain which one. I expect it is MSE, and if so that should cover everything needed to answer your question $\endgroup$ Commented Apr 18, 2021 at 9:38
  • $\begingroup$ Every minimum point can be approximated by a parabola. $\endgroup$
    – Kostya
    Commented Apr 18, 2021 at 10:18
  • 1
    $\begingroup$ @Kostya: Yes, but you probably would not draw a parabola for e.g. $\mathcal{L}(\hat{y}, y) = |\hat{y} - y|$ $\endgroup$ Commented Apr 18, 2021 at 12:02
  • 1
    $\begingroup$ I suspect the illustration isn't meant to be taken literally; instead, I suspect the author is intending to illustrate that gradient descent attempts to solve the minimization problem by moving downward toward what is (hopefully) a global minimum on some complex surface. $\endgroup$ Commented Apr 19, 2021 at 14:07
  • 1
    $\begingroup$ I added "mean squared error" to the question for clarity. The question is still why would all the possible "wrong" loss values happen to be points on a parabola (say, 40, 80, 14... all wrong values)... That connection is not explained in most tutorials. $\endgroup$
    – Kokodoko
    Commented Apr 19, 2021 at 21:39

1 Answer 1

1
$\begingroup$

Mean Square Error (MSE) is a quadratic function and the further you go away from your optimum the bigger (quadratic) the MSE gets. Take $o_{expected}=20$ and $o_{net}=40$ as example. Your MSE is then 400, because $MSE = (o_{expected}-o_{net})^2$.

Just imagine $y = x^2$ with x being the output of your network. If you want to shift the parabola with optimum at $20$ the formula you get is $y = (20-x)^2$. For every new case you train the net on, you get a different parabola with different parameters.

$\endgroup$
0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .