Gradient descent with constant learning rate for a convex function of one variable: Difference between revisions

Revision as of 23:18, 7 September 2014

Setup

This page includes a discussion of gradient descent with constant learning rate for a convex function of one variable.

Learning algorithm

Suppose $\alpha$ is a positive real number. The gradient descent with constant learning rate $\alpha$ is an iterative learning algorithm with update rule:

$x^{(k+1)}=x^{(k)}-\alpha f'(x^{(k)})$

Here, the superscripts denote the stages of iteration. Parethensized superscripts do not denote exponents. We do not use subscripts, because we want to keep notation consistent with gradient descent for functions of multiple variables, and in that case, subscripts are reserved for coordinates and superscripts for iterates.

Convergence properties based on the learning rate

Local convergence properties in the case of a unique minimum where the order of zero for derivative is one (assuming twice differentiability)

The behavior here is qualitatively similar to that for gradient descent with constant learning rate for a quadratic function of one variable with a few differences. Denote by $x^{*}$ the point of absolute minimum. We have $f'(x^{*})=0$ , and our assumption is that $f''(x^{*})\neq 0$ . Note that, by convexity, we must have $f''(x^{*})>0$ . The quadratic function whose behavior is locally approximated is the degree two Taylor polynomial for $f$ , and is explicitly given as:

$f(x^{*})+f'(x^{*})(x-x^{*})+{\frac {1}{2}}f''(x^{*})(x-x^{*})^{2}$

Since $f'(x^{*})=0$ , we get:

$f(x^{*})+{\frac {1}{2}}f''(x^{*})(x-x^{*})^{2}$

In particular, if we write this as a quadratic $ax^{2}+bx+c$ , then $a={\frac {1}{2}}f''(x^{*})$ . We can now make cases based on $\alpha$ .

Case on $\alpha$	Does the sequence converge to a minimum? How does it behave?	Order and rate of convergence on domain side	Order and rate of convergence on function value side	How the quadratic case is special
$\alpha =0$	No. The sequence stays stuck at the initial point.	--	--	--
$0<\alpha <{\frac {1}{f''(x^{*})}}$	Yes, from sufficiently close to the point of minimum	linear convergence with convergence rate $1-\alpha f''(x^{*})$ . The convergence pattern is eventually monotone.	Linear convergence with convergence rate $(1-\alpha f''(x^{*}))^{2}$	In the quadratic case, the precise convergence rate is realized at every iteration.
$\alpha ={\frac {1}{f''(x^{*})}}$	Yes, from sufficiently close to the point of minimum	The order of convergence is the order of zero at $x^{}$ of the first-degree residual at $x^{}$ of $f'$ (i.e., $f'$ minus its first-degree Taylor polynomial). This is also one less than the order of zero at $x^{*}$ of $f$ minus its second-degree Taylor polynomial.		In the quadratic case, we'd reach the optimum after one iteration (the residual would be zero, so the order of convergence would work out to $\infty$ , which is consistent with reaching the optimum after finitely many iterations).
${\frac {1}{f''(x^{})}}<\alpha <{\frac {2}{f''(x^{})}}$	Yes, from sufficiently close to the point of minimum	linear convergence with convergence rate $\alpha f''(x^{*})-1$ . The convergence pattern is eventually oscillatory, i.e., successive iterates are on opposite sides of the optimal value.		In the quadratic case, the precise convergence rate is realized at every iteration.
$\alpha ={\frac {2}{f''(x^{*})}}$	Depends on the third derivative	even if it converges, convergence is slower than linear		In the quadratic case, it oscillates between two points that are equidistant from and on opposite sides of the optimum.

Example to illustrate divergence from sufficiently far away

Consider the following example function:

$f(x)={\frac {1}{2}}x^{2}+{\frac {1}{4}}x^{4}$

The function is convex, with second derivative:

$f''(x)=1+3x^{2}$

The unique point of local and absolute minimum is $x^{*}=0$ . We have $f''(x^{*})=f''(0)=1$ .

The rule used to compute the iterate $x^{(k+1)}$ from the iterate $x^{(k)}$ is:

$g(x)=x-\alpha (x+x^{3})$

Note that if $\alpha \notin [0,2]$ , we will not converge from any $x^{(0)}\neq 0$ . For $\alpha =0,2$ also, we do not converge. For $\alpha \in (0,2)$ , we converge if we have:

$|x^{(0)}|<{\sqrt {\frac {2-\alpha }{\alpha }}}$

Below is a proof sketch:

For $x^{(0)}\neq 0$ , the condition $|x^{(1)}|<|x^{(0)}|$ is equivalent to the condition above.
Therefore, if $x^{(0)}$ violates the condition above, so does $x^{(1)}$ , and by induction, so do all iterates. Therefore, the sequence cannot converge.
On the other hand, if $x^{(0)}$ satisfies the condition, so does $x^{(1)}$ , and so do all future iterations, so the sequence is monotonically decreasing in magnitude. The sequence of absolute values must therefore converge to its glb, and the only value it can converge to is 0.

Global convergence properties

As seen above, we can use the second derivative at the optimal point to deduce local convergence behavior. Note, however, that the domain of convergence depends on the choice of learning rate, and we are not guaranteed that there exists a convergence rate that works globally. In fact, in the example $f(x)={\frac {1}{2}}x^{2}+{\frac {1}{4}}x^{4}$ , there is no convergence rate that works globally.

Suppose we have a global upper bound $B$ on the second derivative. Note that $f''(x^{*})\leq B$ , with equality holding iff either the second derivative is globally constant (as happens in the case of a quadratic function) or it attains its global maximum at $x^{*}$ . An example of a function of the latter kind is the function $f(x):={\frac {1}{2}}(x^{2}+\sin ^{2}x)$ .

Note that even in these cases, we may still choose $B>f''(x^{*})$ because of knowledge problems: we don't know the exact best bound on $f''$ , so we pick a safe bound that is guaranteed to work.

We have two related "knowledge problems" here:

We may be unaware of where $x^{*}$ is (even approximately). Therefore, we are not guaranteed that we are starting from sufficiently close to $x^{*}$ , and therefore, we should use a learning rate based on $B$ .
We are unaware of the approximate value of $f''(x^{*})$ . Therefore, the only guarantees we can make regarding convergence are based on $B$ .

Case on $\alpha$	Does the sequence converge to a minimum? How does it behave?	Convergence rate (eventual, based on $f''(x^{*}))$	Convergence rate based on $B$
$\alpha =0$	No. The sequence stays stuck at the initial point.	--	--
$0<\alpha <{\frac {1}{B}}$	Yes, for any starting point.	linear convergence with convergence rate $1-\alpha f''(x^{*})$ . The convergence pattern is eventually monotone. Note, however, that it may take a very long time for this convergence rate to kick in.	No guaranteed upper bound on convergence rate (other than the trivial bound of 1) in terms of $B$ ! This is because $f''(x^{})$ could be substantially smaller than $B$ , and we need a lower* bound on $f''(x^{*})$ in order to guarantee a convergence rate.
$\alpha ={\frac {1}{B}}$	Yes, for any starting point.	linear convergence with convergence rate $1-\alpha f''(x^{})=1-{\frac {f''(x^{})}{B}}$ if $f''(x^{})<B$ , convergence of higher order if $f''(x^{})=B$ .	Same as above
${\frac {1}{B}}<\alpha <{\frac {2}{B}}$	Yes, for any starting point.	linear convergence with convergence rate $\|1-\alpha f''(x^{})\|$ . Note that if $f''(x^{})<B/2$ , the sign is positive throughout, and we get $1-\alpha f''(x^{})$ . On the other hand, if $B/2<f''(x^{})<B$ , then $1-\alpha f''(x^{*})$ switches sign midway.	Same as above
$\alpha ={\frac {2}{B}}$	Potentially yes

@@ Line 30: / Line 30: @@
 ! Case on <math>\alpha</math> !! Does the sequence converge to a minimum? How does it behave? !! Order and rate of convergence on domain side !! Order and rate of convergence on function value side !! How the quadratic case is special
 |-
-| <math>\alpha = 0</math> || No. The sequence stays stuck at the initial point. || -- || --
+| <math>\alpha = 0</math> || No. The sequence stays stuck at the initial point. || -- || -- || --
 |-
 | <math>0 < \alpha < \frac{1}{f''(x^*)}</math> || Yes, from sufficiently close to the point of minimum || [[linear convergence]] with convergence rate <math>1 - \alpha f''(x^*)</math>. The convergence pattern is eventually monotone. || Linear convergence with convergence rate <math>(1 - \alpha f''(x^*))^2</math> || In the quadratic case, the precise convergence rate is realized at every iteration.