18
$\begingroup$

I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely.

I have seen this same question, and the widely accepted answer claims

Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative

I went and looked - According to Newton's Method from wikipedia,

Geometrically, (x1, 0) is the intersection of the x-axis and the tangent of the graph of f at (x0, f (x0)). The process is repeated until a sufficiently accurate value is reached

I really don't get where and why the second derivative should ever be calculated.

I also saw this similar question, and the accepted answer in short was:

the reason is that the cost functions mentioned might not have any zeroes at all, in which case Newton's method will fail to find the minima

This seems very similar to a similar problem of vanishing gradients in gradient descent, and probably would have about the same solutions, and still doesn't explain why the second derivative is required.

Please explain why is the calculation of the second derivative needed in order to calculate the Newton's method for back-propagation

$\endgroup$
2
  • $\begingroup$ Short answer is that you should be using newton's method to find roots of a derivative of a function en.wikipedia.org/wiki/Newton%27s_method_in_optimization# $\endgroup$
    – ExabytE
    Commented Nov 9, 2018 at 15:43
  • 5
    $\begingroup$ It's definitional; Newton's method uses second derivatives. If you don't use second derivatives, it's not Newton's method. That was the innovation of Newton's method - using the second derivatives to accelerate convergence for many problems. $\endgroup$
    – jbowman
    Commented Nov 9, 2018 at 16:02

2 Answers 2

26
$\begingroup$

My guess at your confusion:

Newton's method is often used to solve two different (but related) problems:

  1. Find $x$ such that $f(x) = 0$
  2. Find $x$ to minimize $g(x)$

A connection between the two problems if $g'(x)=0$ solves the minimization problem

If $g$ is continuous and differentiable, a necessary condition for an optimum to unconstrained minimization problem (2) is that the derivative $g'(x) = 0$. Let $f(x) = g'(x)$. If the first order condition $g'(x) = 0$ is also a sufficient condition for an optimum (eg. $g$ is convex), then (1) and (2) are the same problem.

Applying Newton's method, the update step for problem (1) is: $$ x_{t+1} = x_{t} - \frac{f(x_{t})}{f'(x_{t})}$$ The update step for problem (2) is: $$ x_{t+1} = x_{t} - \frac{g'(x_{t})}{g''(x_{t})}$$

As written, the update step for problem (2) has a 2nd derivative while the update step for problem one (1) only has a first derivative, but these are exactly the same update step if $f = g'$.

In the optimization context, the Newton update step can be interpreted as creating a quadratic approximation of $g$ around point $x_t$. For each step $t$, create a function $\hat{g}_t$ that's the step $t$ quadratic approximation of $g$:

$$\hat{g}_t(x) = g(x_t) + g'(x_t)(x - x_t) + \frac{1}{2}g''(x_t)(x - x_t)^2$$ then choose $x_{t+1}$ to minimize $\hat{g}_t$.

Approximating $g$ as a quadratic is equivalent to approximating $g'$ as a line.

Let $f = g'$. Then update step $t$ above is exactly the same as the canonical description of Newton Method: approximate $f$ as $\hat{f}_t (x) = f(x_t) + f'(x_t)(x - x_t)$ and set $x_{t+1}$ to be the value of $x$ where $\hat{f}_t$ crosses $0$.

$\endgroup$
0
1
$\begingroup$

Newton method for optimization approximates the curve with parabola, or a second degree polynomial $$f(x)=a+b(x-x_t)+\frac c2(x-x_t)^2$$ around the current guess $x_t$. If you look at the derivatives, you get $f'(x_t)=b$ and $f''(x_t)=c$. You could argue that a parabola approximation itself is rooted in Taylor approximation $$f(x)=f(0)+f'(x)x+\frac{f''(x)} {2!}x^2+\dots$$ That's all to it, really.

Now, let's see how does this help in our problem:

$$x_{min}=\mathrm{argmin} f(x)$$ The minimum is where $f'(x_{min})=0$, i.e. $$f'(x_{min})=b+c(\hat x_{min}-x_t)=0$$ $$\hat x_{min}=x_t-b/c$$ Using the derivatives we get the next guess $$x_{t+1}=x_t-\frac{f'(x_t)}{f''(x_t)}$$

$\endgroup$
1
  • $\begingroup$ But shouldn't we check if the second derivative is positive? $\endgroup$
    – ado sar
    Commented Apr 5, 2023 at 8:18

Not the answer you're looking for? Browse other questions tagged or ask your own question.