- 1 Definition
- 2 Iterative step
- 3 Intuition behind choice of learning rate
- 4 Corresponding root-finding method
- 5 Types of gradient descent
- 6 Gradient descent in a typical machine learning context
- 7 Variants of gradient descent
Gradient descent is a general approach used in first-order iterative optimization algorithms whose goal is to find the (approximate) minimum of a function of multiple variables. The idea is that, at each stage of the iteration, we move in the direction of the negative of the gradient vector (or computational approximation to the gradient vector). The step size that we choose depends on the exact algorithm used, and typically involves some sort of line search.
Other names for gradient descent are steepest descent and method of steepest descent.
The corresponding method, if applied to finding maxima, would move in a positive direction along the gradient vector, and is called gradient ascent. The methods are computationally equivalent: gradient descent for finding the minimum of is equivalent to gradient ascent for finding the maximum of over the same domain.
Note that we use parenthesized superscripts to denote iterates and iteration-specific variables, and subscripts to denote coordinates. Parenthesized superscripts do not denote exponents. At one place below, we use a non-parenthesized superscript for an exponent, as is clear from context.
Suppose we are applying gradient descent to minimize a function . We will denote the input to as a vector . Given a guess , gradient descent computes the next guess as follows:
Note that the quantity (called the learning rate) needs to be specified, and the method of choosing this constant describes the type of gradient descent. For the simplest type of gradient descent, called gradient descent with constant learning rate, all the equal a constant and are independent of the current iterate.
One aspect that distinguishes gradient descent from some variants (such as parallel coordinate descent, sequential coordinate descent, and accelerated gradient methods) is that in every iteration, we move strictly along the direction of the gradient vector, i.e., is a scalar multiple of the gradient vector at .
Intuition behind choice of learning rate
Essentially, the learning rate should be chosen to be the reciprocal of the second derivative (often referred to as the curvature although it is not the same as curvature as used in differential geometry). There are, however, a number of subtleties. The cases are discussed below, in roughly increasing order of complexity:
|Case||Discussion of gradient descent with constant learning rate||Variation in second derivative when restricted to a line||Variation in second derivative between directions||Explanation|
|Quadratic function of one variable||Gradient descent with constant learning rate for a quadratic function of one variable||Constant||Constant (only one dimension, so no scope for variation)||The first derivative is a linear function with slope equal to the second derivative. We need to move by the value , where is the slope of the line. The ideal learning rate is therefore . Note that is constant.|
|Quadratic function of multiple variables||Gradient descent with constant learning rate for a quadratic function of multiple variables||Constant||Variable||The main challenge arises from the "elliptical geometry", i.e., the second derivative being radically different in different directions. We can get a mathematical handle of this by finding the eigenvalues of the Hessian. The ratio of the largest to the smallest eigenvalue, also known as the condition number, helps us understand the difficulty of the problem. Basically, if we are choosing a constant learning rate, it has to be chosen based on the largest eigenvalue (the -norm) but that would result in very slow convergence along the directions of the smaller eigenvalues.|
|Convex function of one variable||Gradient descent with constant learning rate for a convex function of one variable||Variable||Constant (only one dimension, so no scope for variation)||The details depend on whether there is a global bound on the second derivative. If there is such a bound, and we are aware of it in advance, the quality of convergence depends on how widely the second derivative at the optimum differs from the global bound. The quadratic case is extreme: these values coincide, so we can get one-step convergence if we know the second derivative.|
|Convex function of multiple variables||Gradient descent with constant learning rate for a convex function of multiple variables||Variable||Variable||Two sources of variation enter the picture: the elliptical geometry arising from different second derivatives in different directions, and the ratio of the second derivative at the optimum to the global bound on the second derivative.|
Corresponding root-finding method
Types of gradient descent
|Name||Verbal description||Is the rule stationary or dependent on number of iterations?||Mathematical formalism||Further comments|
|Gradient descent with constant learning rate (default meaning of gradient descent)||Here, the step size is a fixed multiple of the gradient vector. The multiple used is termed the "learning rate" of the algorithm. The choice of learning rate affects the convergence behavior of the gradient descent.||Stationary||for all . The constant is termed the learning rate of the algorithm.||Note that this need not converge instantaneously even in the case of a quadratic function of one variable, though that is arguably the simplest nontrivial functional form to optimize. For more, see gradient descent with constant learning rate for a quadratic function of one variable. In general, if we have a global bound on the "curvature" (understood technically as the spectral norm of the Hessian), then we can take a constant learning rate that is less than or equal to the reciprocal of this bound.|
|Gradient descent with decaying learning rate||Here, the learning rate decays.||Dependent on number of iterations||depends on and approaches 0 as .||A decaying learning rate results in substantially slower convergence, and is not useful when we have a global bound on the curvature, because in that case we could just choose a constant learning rate equal to the reciprocal of that global bound. A decaying learning rate is helpful in cases where the curvature is unbounded or where we don't know any bound on the curvature.|
|Gradient descent with exact line search||The step size in each step is determined using an exact line search along the line of the gradient vector through the current point.||Stationary||No explicit functional form for , since it depends on details about the function.||Note that this requires computation of the exact functional form restricted to the line as well as a procedure to optimize that function when restricted to that line. Neither of these may be available.|
|Gradient descent using Newton's method||Here, the step size in each step is determined by applying one iteration (or a fixed number of iterations) of Newton's method for optimization of a function of one variable when performing the line search to determine the step size.||Stationary||is the reciprocal of the second-order directional derivative of at along the direction of . Explicitly, if is the Hessian of at , and , then .|| Note that this coincides with exact line search when the function involved is a quadratic function of multiple variables, because the restriction to the line is a quadratic function, and Newton's method for optimization converges in one step for a quadratic function of one variable. Note that this requires the computation of the second derivative at the point.|
Note that this method is completely different from Newton's method for optimization of a function of multiple variables, which does not move along the gradient direction, but rather, adjusts it by the inverse of the Hessian matrix.
Gradient descent in a typical machine learning context
The mathematical form of gradient descent in machine learning problems is more specific: the function that we are trying to optimize is expressible as a sum, with all the additive components having the same functional form but with different parameters (note that the parameters referred to here are the feature values for examples, not the "parameter vector" -- which plays the role of the input to the function in this case). Explicitly, the function on which we are doing gradient descent has the form:
where is a vector storing the example, and is the number of examples. In such an implementation, gradient descent can be parallelized. We have that the gradient is a sum of the gradients of each of the additive components:
The computation of the summands can be carried out in parallel by different processors, each of which has access to the corresponding set of s.
Variants of gradient descent
Variants that do not require the framework of additive separability
|Variant||Coordinatization-dependent?||Is the rule stationary or dependent on the number of iterations?||Description|
|parallel coordinate descent||Yes||Stationary||Instead of using a scalar multiple of the gradient vector, we multiply different coordinates of the gradient vector by different quantities (we can think of it as a coordinate-wise vector product). The coefficient used in each coordinate is, roughly, inversely proportional to the second derivative in that coordinate. Adjustments need to be made to avoid "overstepping". The method is naturally parallelizable, because updates to all the coordinates can be made simultaneously.|
|sequential coordinate descent||Yes||Stationary||The idea is similar to parallel coordinate descent, but we cycle through the coordinates, and after updating each coordinate, we recompute the next coordinate of the gradient vector at the new point. The simplest version cycles through all the coordinates in a predetermined order. There are also "greedy" versions that choose the coordinate with the most gain each time. Sequential coordinate descent does not suffer the problem of overstepping, but it is less parallelizable, because each coordinate needs to be completed before work can begin on the next. There is also more computational overhead because of the need to compute a coordinate of a new gradient vector each time.|
|accelerated gradient methods||No||Dependent on the number of iterations||Nesterov's accelerated gradient method uses a momentum term to accelerate gradient descent. Roughly speaking, at each iteration, we are adding up momentum from all previous steps. Whereas plain gradient descent has linear convergence, accelerated gradient methods can have convergence as good as quadratic.|