Gradient descent with constant learning rate for a convex function of one variable: Difference between revisions
Line 30: | Line 30: | ||
! Case on <math>\alpha</math> !! Does the sequence converge to a minimum? How does it behave? !! Order and rate of convergence on domain side !! Order and rate of convergence on function value side !! How the quadratic case is special | ! Case on <math>\alpha</math> !! Does the sequence converge to a minimum? How does it behave? !! Order and rate of convergence on domain side !! Order and rate of convergence on function value side !! How the quadratic case is special | ||
|- | |- | ||
| <math>\alpha = 0</math> || No. The sequence stays stuck at the initial point. || -- || -- | | <math>\alpha = 0</math> || No. The sequence stays stuck at the initial point. || -- || -- || -- | ||
|- | |- | ||
| <math>0 < \alpha < \frac{1}{f''(x^*)}</math> || Yes, from sufficiently close to the point of minimum || [[linear convergence]] with convergence rate <math>1 - \alpha f''(x^*)</math>. The convergence pattern is eventually monotone. || Linear convergence with convergence rate <math>(1 - \alpha f''(x^*))^2</math> || In the quadratic case, the precise convergence rate is realized at every iteration. | | <math>0 < \alpha < \frac{1}{f''(x^*)}</math> || Yes, from sufficiently close to the point of minimum || [[linear convergence]] with convergence rate <math>1 - \alpha f''(x^*)</math>. The convergence pattern is eventually monotone. || Linear convergence with convergence rate <math>(1 - \alpha f''(x^*))^2</math> || In the quadratic case, the precise convergence rate is realized at every iteration. |
Revision as of 23:18, 7 September 2014
Setup
This page includes a discussion of gradient descent with constant learning rate for a convex function of one variable.
It is similar to gradient descent with constant learning rate for a quadratic function of one variable.
Learning algorithm
Suppose is a positive real number. The gradient descent with constant learning rate is an iterative learning algorithm with update rule:
Here, the superscripts denote the stages of iteration. Parethensized superscripts do not denote exponents. We do not use subscripts, because we want to keep notation consistent with gradient descent for functions of multiple variables, and in that case, subscripts are reserved for coordinates and superscripts for iterates.
Convergence properties based on the learning rate
Local convergence properties in the case of a unique minimum where the order of zero for derivative is one (assuming twice differentiability)
The behavior here is qualitatively similar to that for gradient descent with constant learning rate for a quadratic function of one variable with a few differences. Denote by the point of absolute minimum. We have , and our assumption is that . Note that, by convexity, we must have . The quadratic function whose behavior is locally approximated is the degree two Taylor polynomial for , and is explicitly given as:
Since , we get:
In particular, if we write this as a quadratic , then . We can now make cases based on .
Case on | Does the sequence converge to a minimum? How does it behave? | Order and rate of convergence on domain side | Order and rate of convergence on function value side | How the quadratic case is special |
---|---|---|---|---|
No. The sequence stays stuck at the initial point. | -- | -- | -- | |
Yes, from sufficiently close to the point of minimum | linear convergence with convergence rate . The convergence pattern is eventually monotone. | Linear convergence with convergence rate | In the quadratic case, the precise convergence rate is realized at every iteration. | |
Yes, from sufficiently close to the point of minimum | The order of convergence is the order of zero at of the first-degree residual at of (i.e., minus its first-degree Taylor polynomial). This is also one less than the order of zero at of minus its second-degree Taylor polynomial. | In the quadratic case, we'd reach the optimum after one iteration (the residual would be zero, so the order of convergence would work out to , which is consistent with reaching the optimum after finitely many iterations). | ||
Yes, from sufficiently close to the point of minimum | linear convergence with convergence rate . The convergence pattern is eventually oscillatory, i.e., successive iterates are on opposite sides of the optimal value. | In the quadratic case, the precise convergence rate is realized at every iteration. | ||
Depends on the third derivative | even if it converges, convergence is slower than linear | In the quadratic case, it oscillates between two points that are equidistant from and on opposite sides of the optimum. |
Example to illustrate divergence from sufficiently far away
Consider the following example function:
The function is convex, with second derivative:
The unique point of local and absolute minimum is . We have .
The rule used to compute the iterate from the iterate is:
Note that if , we will not converge from any . For also, we do not converge. For , we converge if we have:
Below is a proof sketch:
- For , the condition is equivalent to the condition above.
- Therefore, if violates the condition above, so does , and by induction, so do all iterates. Therefore, the sequence cannot converge.
- On the other hand, if satisfies the condition, so does , and so do all future iterations, so the sequence is monotonically decreasing in magnitude. The sequence of absolute values must therefore converge to its glb, and the only value it can converge to is 0.
Global convergence properties
As seen above, we can use the second derivative at the optimal point to deduce local convergence behavior. Note, however, that the domain of convergence depends on the choice of learning rate, and we are not guaranteed that there exists a convergence rate that works globally. In fact, in the example , there is no convergence rate that works globally.
Suppose we have a global upper bound on the second derivative. Note that , with equality holding iff either the second derivative is globally constant (as happens in the case of a quadratic function) or it attains its global maximum at . An example of a function of the latter kind is the function .
Note that even in these cases, we may still choose because of knowledge problems: we don't know the exact best bound on , so we pick a safe bound that is guaranteed to work.
We have two related "knowledge problems" here:
- We may be unaware of where is (even approximately). Therefore, we are not guaranteed that we are starting from sufficiently close to , and therefore, we should use a learning rate based on .
- We are unaware of the approximate value of . Therefore, the only guarantees we can make regarding convergence are based on .
Case on | Does the sequence converge to a minimum? How does it behave? | Convergence rate (eventual, based on | Convergence rate based on |
---|---|---|---|
No. The sequence stays stuck at the initial point. | -- | -- | |
Yes, for any starting point. | linear convergence with convergence rate . The convergence pattern is eventually monotone. Note, however, that it may take a very long time for this convergence rate to kick in. | No guaranteed upper bound on convergence rate (other than the trivial bound of 1) in terms of ! This is because could be substantially smaller than , and we need a lower bound on in order to guarantee a convergence rate. | |
Yes, for any starting point. | linear convergence with convergence rate if , convergence of higher order if . | Same as above | |
Yes, for any starting point. | linear convergence with convergence rate . Note that if , the sign is positive throughout, and we get . On the other hand, if , then switches sign midway. | Same as above | |
Potentially yes |