Convergence of gradient descent without global Lipschitz gradient assumption

Question

My question is very similar to this one and this one, but they haven't been answered.

Let $f \in C^2(\mathbb{R}^d, \mathbb{R})$ have compact sublevel sets and isolated critical points, and consider the gradient descent update $$ x_{k+1} = x_k-\alpha\nabla f(x_k) $$ for some fixed initial point $x_0$ and learning rate $\alpha$. If $f$ has $L$-Lipschitz gradient globally, it is known that $x_k$ converges to a critical point of $f$ for any $0 < \alpha < 2/L$. Now assume we drop the Lipschitz assumption. The set $U_0 = \{ f(x) \leq f(x_0) \}$ is compact and $\nabla f \in C^1$, so we can define $L = \sup_{x \in U} \lVert \nabla^2 f(x) \rVert < \infty$ (in $L^2$ norm).

I would like to prove (or disprove) that $x_k \in U_0$ for all $k$ for all $0 < \alpha < 2/L$. This would imply that $x_k$ converges to a critical point since $f|_U$ is $L$-Lipschitz. The idea would be to prove $f(x_{k+1}) \leq f(x_k)$ and conclude by induction, by Taylor expanding \begin{align*} f(x_{k+1}) &= f(x_k-\alpha \nabla f(x_k)) \\ &= f(x_k) - \alpha \lVert \nabla f(x_k) \rVert^2 + \frac{\alpha^2}{2}\nabla f(x_k)^T\nabla^2 f(x_k-t\alpha\nabla f(x_k))f(x_k) \end{align*} for some $t \in (0, 1)$. Now if we assume $(x_k-t\alpha\nabla f(x_k)) \in U$, we can conclude $$f(x_{k+1}) \leq f(x_k) - \alpha \lVert \nabla f(x_k) \rVert^2\left(1-\frac{\alpha L}{2}\right) \leq f(x_k)$$ for $\alpha < 2/L$, but this is (almost) a circular assumption... Any ideas?

smalldog · Accepted Answer · 2020-04-29 10:15:33Z

This holds: here is a proof.$\newcommand{\T}{x}\newcommand{\al}{\alpha}\newcommand{\bal}{\bar{\alpha}}$

Define $U_\al = \{ \T-t\al\nabla f(\T) \mid t \in [0,1], \T\in U_0 \}$ and the continuous function $L(\al) = \sup_{\T \in U_\al} \lVert{\nabla^2 f(\T)}\rVert$. Notice that $U_0 \subset U_{\al}$ for all $\al < \al'$. We prove that $\al L(\al) < 2$ implies $U_\al = U_0$ and in particular, $L(\al) = L(0) = L$. By Taylor expansion,

$$f(\T-t\al\nabla f) = f(\T) - \al \lVert{\nabla f(\T)}\rVert^2 + \frac{t^2\al^2}{2}\nabla f(\T)^T\nabla^2 f(\T-t'\al\nabla f)f(\T) $$

for some $t' \in [0,t] \subset [0,1]$. Since $\T-t'\al\nabla f \in U_\al$, it follows that

$$ f(\T-t\al\nabla f) \leq f(\T) -\al\lVert{\nabla f(\T)}\rVert^2(1-\al L(\al)/2) \leq f(\T) $$

for all $\al L(\al) < 2$. In particular, $\T-t\al\nabla f \in U_0$ and hence $U_\al = U_0$. We conclude that $\al L(\al) < 2$ implies $L(\al)=L$, implying in turn $\al L < 2$. We now claim the converse, namely that $\al L < 2$ implies $\al L(\al) < 2$. For contradiction, assume otherwise that there exists $\al' L < 2$ with $\al'L(\al') \geq 2$. Since $\al L(\al)$ is continuous and $0 L(0) = 0 < 2$, there exists $\bal \leq \al'$ such that $\bal L < 2$ and $\bal L(\bal) = 2$. This is in contradiction with continuity:

$$ 2 = \bal L(\bal) = \lim_{\al\to\bal^-} \al L(\al) = \lim_{\al\to\bal^-} \al L = \bal L \,. $$

Finally we conclude that $U_\al = U_0$ for all $\al L < 2$. In particular, $\T_0 \in U_0$ implies $\T_k \in U_0$ by induction.

$\begingroup$ How to verify the continuity of $L(\alpha)$? $\endgroup$
– William
Commented May 27, 2021 at 14:31 — William, Commented May 27, 2021 at 14:31
$\begingroup$ @William See this question. $\endgroup$
– smalldog
Commented May 30, 2021 at 9:04 — smalldog, Commented May 30, 2021 at 9:04
$\begingroup$ Thanks! This is quite helpful. $\endgroup$
– William
Commented May 30, 2021 at 9:50 — William, Commented May 30, 2021 at 9:50

Stack Exchange Network

Convergence of gradient descent without global Lipschitz gradient assumption

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
real-analysis
optimization
taylor-expansion
lipschitz-functions
gradient-descent
.

Linked

Hot Network Questions

Convergence of gradient descent without global Lipschitz gradient assumption

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged real-analysisoptimizationtaylor-expansionlipschitz-functionsgradient-descent.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
real-analysis
optimization
taylor-expansion
lipschitz-functions
gradient-descent
.