I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely.
I have seen this same question, and the widely accepted answer claims
Newton's method, a root finding algorithm, maximizes a function using knowledge of its second derivative
I went and looked - According to Newton's Method from wikipedia,
Geometrically, (x1, 0) is the intersection of the x-axis and the tangent of the graph of f at (x0, f (x0)). The process is repeated until a sufficiently accurate value is reached
I really don't get where and why the second derivative should ever be calculated.
I also saw this similar question, and the accepted answer in short was:
the reason is that the cost functions mentioned might not have any zeroes at all, in which case Newton's method will fail to find the minima
This seems very similar to a similar problem of vanishing gradients in gradient descent, and probably would have about the same solutions, and still doesn't explain why the second derivative is required.
Please explain why is the calculation of the second derivative needed in order to calculate the Newton's method for back-propagation