Nesterov's gradient acceleration

Definition

Nesterov's gradient acceleration refers to a general approach that can be used to modify a gradient descent-type method to improve its initial convergence.

The two-step iteration description

In this description, there are two intertwined sequences of iterates that constitute our guesses:

${\vec {x}}^{(0)},{\vec {x}}^{(1)},{\vec {x}}^{(2)},\dots$

${\vec {y}}^{(0)},{\vec {y}}^{(1)},{\vec {y}}^{(2)},\dots$

Explicitly, the sequences are intertwined as follows:

${\vec {x}}^{(0)},{\vec {y}}^{(0)},{\vec {x}}^{(1)},{\vec {y}}^{(1)},{\vec {x}}^{(2)},{\vec {y}}^{(2)},\dots$

We use parenthesized superscripts to denote the iteration stage, because the subscripts are reserved for use by the coordinates.

For the initialization, we set:

${\vec {x}}^{(0)}={\vec {y}}^{(0)}={\mbox{Initial guess}}$

The two-step iteration is as follows:

${\vec {x}}^{(k+1)}={\mbox{Gradient descent-type method applied to }}{\vec {y}}^{(k)}$
${\vec {y}}^{(k+1)}={\vec {x}}^{(k+1)}+m^{(k+1)}\left({\vec {x}}^{(k+1)}-{\vec {x}}^{(k)}\right)$

Note that the sign on the right side is not a typographical error: ${\vec {y}}^{(k+1)}$ is a non-convex combination of ${\vec {x}}^{(k)}$ and ${\vec {x}}^{(k+1)}$ . The value $m^{(k+1)}$ is a member of a predetermined sequence of real numbers in the interval $[0,1)$ that is independent of the specific problem. A typical expression that works is:

$m^{(k)}=(k-1)/(k+2)$

This would yield $m^{(k+1)}=k/(k+3)$ .

The description in terms of an approximating quadratic that is updated at each iteration

Fill this in later

Where do we expect Nesterov's gradient acceleration to be helpful?

In order to understand why Nesterov's gradient acceleration could be helpful, we need to first understand how the gradient descent-type method that we are using could be useful but limited in a manner that cannot be rectified by tuning the learning rate parameters.

The basic philosophy behind gradient descent with constant learning rate is that we should choose a learning rate that is just right. If the learning rate is too small, then gradient descent will converge, but very slowly: the steps are too conservative. If the learning rate is a little bigger than optimal, then gradient descent will converge, but in an oscillatory fashion, overshooting in each iteration. If the learning rate is too large, gradient descent diverges. Ideally, we'd like a learning rate that is approximately the reciprocal of the second derivative.

The problem arises when the second derivative is highly variable. In the one-dimensional case, this could happen for non-quadratic functions, particularly for functions without global bounds on their second derivative. In higher-dimensional settings, the problem could arise from elliptical geometry: the second derivatives are different in different directions, so that it is difficult to choose a learning rate that guarantees good convergence in all dimensions.

In these cases, we set a small learning rate, obtained as the reciprocal of an upper bound on the global curvature across dimensions. We cannot increase the learning rate, because doing so might lead us to diverge along some dimensions or from some starting points. At the same time, our learning rate is highly conservative in most settings. Thus, the problem cannot be remedied by tweaking the learning rate. This is the sort of situation where Nesterov-type acceleration helps.

Intuitively, Nesterov's acceleration is exploring points near the result we get from the gradient descent-type algorithm, choosing larger and larger exploratory steps (as governed by the increasing size of the momentum term). But the beauty of Nesterov's method is that despite this exploration, it does not diverge: even if a particular iteration overshoots and gets us far from the optimum, we'll get back close to the optimum in a few more iterations.