Why is the second derivative required for newton's method for back-propagation?

Question

I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely.

I have seen this same question, and the widely accepted answer claims

Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative

I went and looked - According to Newton's Method from wikipedia,

Geometrically, (x1, 0) is the intersection of the x-axis and the tangent of the graph of f at (x0, f (x0)). The process is repeated until a sufficiently accurate value is reached

I really don't get where and why the second derivative should ever be calculated.

I also saw this similar question, and the accepted answer in short was:

the reason is that the cost functions mentioned might not have any zeroes at all, in which case Newton's method will fail to find the minima

This seems very similar to a similar problem of vanishing gradients in gradient descent, and probably would have about the same solutions, and still doesn't explain why the second derivative is required.

Please explain why is the calculation of the second derivative needed in order to calculate the Newton's method for back-propagation

Short answer is that you should be using newton's method to find roots of a derivative of a function en.wikipedia.org/wiki/Newton%27s_method_in_optimization# — ExabytE, Commented Nov 9, 2018 at 15:43
It's definitional; Newton's method uses second derivatives. If you don't use second derivatives, it's not Newton's method. That was the innovation of Newton's method - using the second derivatives to accelerate convergence for many problems. — jbowman, Commented Nov 9, 2018 at 16:02

Matthew Gunn · Accepted Answer · 2022-10-18 14:55:15Z

My guess at your confusion:

Newton's method is often used to solve two different (but related) problems:

Find $x$ such that $f(x) = 0$
Find $x$ to minimize $g(x)$

A connection between the two problems if $g'(x)=0$ solves the minimization problem

If $g$ is continuous and differentiable, a necessary condition for an optimum to unconstrained minimization problem (2) is that the derivative $g'(x) = 0$. Let $f(x) = g'(x)$. If the first order condition $g'(x) = 0$ is also a sufficient condition for an optimum (eg. $g$ is convex), then (1) and (2) are the same problem.

Applying Newton's method, the update step for problem (1) is: $$ x_{t+1} = x_{t} - \frac{f(x_{t})}{f'(x_{t})}$$ The update step for problem (2) is: $$ x_{t+1} = x_{t} - \frac{g'(x_{t})}{g''(x_{t})}$$

As written, the update step for problem (2) has a 2nd derivative while the update step for problem one (1) only has a first derivative, but these are exactly the same update step if $f = g'$.

In the optimization context, the Newton update step can be interpreted as creating a quadratic approximation of $g$ around point $x_t$. For each step $t$, create a function $\hat{g}_t$ that's the step $t$ quadratic approximation of $g$:

$$\hat{g}_t(x) = g(x_t) + g'(x_t)(x - x_t) + \frac{1}{2}g''(x_t)(x - x_t)^2$$ then choose $x_{t+1}$ to minimize $\hat{g}_t$.

Approximating $g$ as a quadratic is equivalent to approximating $g'$ as a line.

Let $f = g'$. Then update step $t$ above is exactly the same as the canonical description of Newton Method: approximate $f$ as $\hat{f}_t (x) = f(x_t) + f'(x_t)(x - x_t)$ and set $x_{t+1}$ to be the value of $x$ where $\hat{f}_t$ crosses $0$.

Aksakal · Accepted Answer · 2022-10-18 15:40:47Z

1

Newton method for optimization approximates the curve with parabola, or a second degree polynomial $$f(x)=a+b(x-x_t)+\frac c2(x-x_t)^2$$ around the current guess $x_t$. If you look at the derivatives, you get $f'(x_t)=b$ and $f''(x_t)=c$. You could argue that a parabola approximation itself is rooted in Taylor approximation $$f(x)=f(0)+f'(x)x+\frac{f''(x)} {2!}x^2+\dots$$ That's all to it, really.

Now, let's see how does this help in our problem:

$$x_{min}=\mathrm{argmin} f(x)$$ The minimum is where $f'(x_{min})=0$, i.e. $$f'(x_{min})=b+c(\hat x_{min}-x_t)=0$$ $$\hat x_{min}=x_t-b/c$$ Using the derivatives we get the next guess $$x_{t+1}=x_t-\frac{f'(x_t)}{f''(x_t)}$$

answered Oct 18, 2022 at 15:40

Aksakal

61.9k6 gold badges104 silver badges204 bronze badges

$\begingroup$ But shouldn't we check if the second derivative is positive? $\endgroup$
– ado sar
Commented Apr 5, 2023 at 8:18

Add a comment |

Stack Exchange Network

Why is the second derivative required for newton's method for back-propagation?

2 Answers 2

My guess at your confusion:

A connection between the two problems if $g'(x)=0$ solves the minimization problem

Not the answer you're looking for? Browse other questions tagged
machine-learning
optimization
gradient-descent
backpropagation
or ask your own question.

Linked

Hot Network Questions

Why is the second derivative required for newton's method for back-propagation?

2 Answers 2

My guess at your confusion:

A connection between the two problems if $g'(x)=0$ solves the minimization problem

Not the answer you're looking for? Browse other questions tagged machine-learningoptimizationgradient-descentbackpropagation or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
optimization
gradient-descent
backpropagation
or ask your own question.