Gradient descent: Difference between revisions

Revision as of 03:44, 16 August 2014

Definition

Gradient descent is a general approach used in first-order iterative optimization algorithms whose goal is to find the (approximate) minimum of a function of multiple variables. The idea is that, at each stage of the iteration, we move in the direction of the negative of the gradient vector (or computational approximation to the gradient vector). The step size that we choose depends on the exact algorithm used, and typically involves some sort of line search.

Other names for gradient descent are steepest descent and method of steepest descent.

The corresponding method, if applied to finding maxima, would move in a positive direction along the gradient vector, and is called gradient ascent. The methods are computationally equivalent: gradient descent for finding the minimum of $f$ is equivalent to gradient ascent for finding the maximum of $-f$ over the same domain.

Iterative step

Suppose we are applying gradient descent to minimize a function $f:\mathbb {R} ^{n}\to \mathbb {R}$ . We will denote the input to $f$ as a vector ${\vec {x}}=(x_{1},x_{2},\dots ,x_{n})$ . Given a guess ${\vec {x}}_{k}$ , gradient descent computes the next guess ${\vec {x}}_{k+1}$ as follows:

${\vec {x}}_{k+1}={\vec {x}}_{k}-\alpha _{k}\nabla f({\vec {x}}_{k})$

Note that the quantity $\alpha _{k}$ needs to be specified, and the method of choosing this constant describes the type of gradient descent.

One aspect that distinguishes gradient descent from some variants (such as parallel coordinate descent, sequential coordinate descent, and accelerated gradient methods) is that in every iteration, we move strictly along the direction of the gradient vector, i.e., ${\vec {x}}_{k}-{\vec {x}}_{k+1}$ is a scalar multiple of the gradient vector at ${\vec {x}}_{k}$ .

Types of gradient descent

Name	Verbal description	Mathematical formalism	Further comments
Gradient descent with constant learning rate (default meaning of gradient descent)	Here, the step size is a fixed multiple of the gradient vector. The multiple used is termed the "learning rate" of the algorithm. The choice of learning rate affects the convergence behavior of the gradient descent.	$\alpha _{k}=\alpha$ for all $k$ . The constant $\alpha$ is termed the learning rate of the algorithm.	Note that this need not converge instantaneously even in the case of a quadratic function of one variable, though that is arguably the simplest nontrivial functional form to optimize. For more, see gradient descent with constant learning rate for a quadratic function of one variable.
Gradient descent with exact line search	The step size in each step is determined using an exact line search along the line of the gradient vector through the current point.	No explicit functional form for $\alpha _{k}$ , since it depends on details about the function.	Note that this requires computation of the exact functional form restricted to the line as well as a procedure to optimize that function when restricted to that line. Neither of these may be available.
Gradient descent using Newton's method	Here, the step size in each step is determined by applying one iteration (or a fixed number of iterations) of Newton's method for optimization of a function of one variable when performing the line search to determine the step size.	$\alpha _{k}$ is the reciprocal of the second-order directional derivative of $f$ at ${\vec {x}}_{k}$ along the direction of $(\nabla f)({\vec {x}}_{k})$ . Explicitly, if $H$ is the Hessian of $f$ at ${\vec {x}}_{k}$ , and ${\vec {v}}=(\nabla f)({\vec {x}}_{k})$ , then $\alpha _{k}={\frac {\|{\vec {v}}\|^{2}}{{\vec {v}}^{T}H{\vec {v}}}}$ .	Note that this coincides with exact line search when the function involved is a quadratic function of multiple variables, because the restriction to the line is a quadratic function, and Newton's method for optimization converges in one step for a quadratic function of one variable. Note that this requires the computation of the second derivative at the point. Note that this method is completely different from Newton's method for optimization of a function of multiple variables, which does not move along the gradient direction, but rather, adjusts it by the inverse of the Hessian matrix.

@@ Line 14: / Line 14: @@
 Note that the quantity <math>\alpha_k</math> needs to be specified, and the method of choosing this constant describes the type of gradient descent.
+One aspect that distinguishes gradient descent from some variants (such as [[parallel coordinate descent]], [[sequential coordinate descent]], and [[accelerated gradient method]]s) is that in every iteration, we move strictly along the direction of the gradient vector, i.e., <math>\vec{x}_k - \vec{x}_{k+1}</math> is a scalar multiple of the gradient vector at <math>\vec{x}_k</math>.
 ==Types of gradient descent==