Gradient descent with constant learning rate

From Calculus
Jump to: navigation, search

Definition

Gradient descent with constant learning rate is a first-order iterative optimization method and is the most standard and simplest implementation of gradient descent. In this method, the real number by which the gradient vector is multiplied to determine the step size is constant across iterations. This constant is termed the learning rate and we will customarily denote it as \alpha.

Qualitative behavior based on type of function

Gradient descent with constant learning rate, although easy to implement, can converge painfully slowly for various types of problems. The analysis gets progressively more complicated as the type of function becomes more complicated. For more, see:

Function type Gradient descent in one variable Corresponding function type for multiple variables Gradient descent in multiple variables
Quadratic function gradient descent with constant learning rate for a quadratic function of one variable quadratic function of multiple variables gradient descent with constant learning rate for a quadratic function of multiple variables
Convex function gradient descent with constant learning rate for a convex function of one variable convex function of multiple variables gradient descent with constant learning rate for a convex function of multiple variables
Arbitrary twice continuously differentiable function (link pending) Arbitrary twice continuously differentiable function of multiple variables (current page)

Qualitative cases based on learning rate

These cases apply even if we are in a convex region that contains the global minimum. The situation gets more complicated in case of nonconvexities.

Case on learning rate \alpha What happens
Very large We overshoot in each step, and end up further from the optimum than our original guess. There is no hope of convergence.
Somewhat large but not very large We overshoot, but still come closer to the global minimum (on a straight line, it would be something like going from 5 to -4 when we are trying to converge to 0). Note that because the second derivative is different in different directions, we may overshoot in some directions and undershoot in others. As long as we are not overshooting so much that we get farther than we were originally, we will still converge.
"Just right" We converge the fastest. In each step, we come close to doing as good as a line search could do for that step. Therefore, we are quickly reaching the optimum. Note that, even for quadratic functions, "just right" quantities may not even exist if the eigenvalues of the Hessian matrix are very different from each other, because the "just right" step sizes in different directions are quite different.
Small We converge, but slowly.