Gradient descent with constant learning rate for a quadratic function of multiple variables

From Calculus
Jump to: navigation, search

Setup

This page includes a detailed analysis of gradient descent with constant learning rate for a quadratic function of multiple variables. It builds on the analysis at the page gradient descent with constant learning rate for a quadratic function of one variable.

Function

The function we are interested is a function f: \R^n \to \R of the form:

f(\vec{x}) = \vec{x}^TA\vec{x} + \vec{b}^T\vec{x} + c

where A is a n \times n symmetric positive-definite matrix with entries a_{ij} and \vec{b} is the column vector with entries b_i. For part of this page, we will generalize somewhat to the case that A is a symmetric positive-semidefinite matrix.

Note that we impose the condition of symmetry because we could replace the matrix A with the symmetric matrix (A + A^T)/2 without changing the functional form. The positive-definite or positive-semidefinite condition is to guarantee that the function has the right sort of convexity. In the positive-definite case, we are guaranteed that there is a unique point of local minimum that is also the point of absolute minimum. The positive-semidefinite case is somewhat more complicated: we could either have no minimum, or we could have an affine space worth of points at which the function has a local and absolute minimum.

Learning algorithm

Suppose \alpha is a positive real number. The gradient descent with constant learning rate \alpha is an iterative algorithm that aims to find a (the) point of local minimum for f. The algorithm starts with a guess \vec{x}^{(0)} and updates according to the rule:

\vec{x}^{(k+1)} = \vec{x}^{(k)} - \alpha \nabla f(\vec{x}^{(k)})

Convergence properties based on the learning rate: the case of a symmetric positive-definite matrix

We will carry out our analysis of eigenvalues in terms of the Hessian matrix 2A. If we were instead dealing with the eigenvalues of A, we would have to double them to get the eigenvalues of 2A. Since the bounds on the learning rate are described as reciprocals of the eigenvalues, we would therefore obtain additional factors of 2 in the denominators, or equivalently, we would remove the factor of 2 from the numerators.

Note that in the case n = 1, we have \sigma_{\max} = \sigma_{\min} = 2a and \kappa = 1 in the notation below.

Summary

Since 2A is a symmetric matrix, its singular values and eigenvalues coincide. Explicitly, we will denote by \sigma_{\max} the largest eigenvalue (and hence also the largest singular value) and by \sigma_{\min} the smallest eigenvalue (and hence also the smallest singular value). The condition number of the matrix 2A is the quotient:

\kappa = \frac{\sigma_{\max}}{\sigma_{\min}}

The minimum possible value of \kappa is 1, and this occurs when 2A is a scalar matrix. In general, the larger the value of \kappa, the worse the performance of gradient descent with constant learning rate. The following are true:

Value of \alpha Conclusion about the convergence of gradient descent with constant learning rate \alpha (assume that we do not start at the point of local minimum) Convergence on domain side Convergence on function value side Limit of convergence rate as \kappa = \sigma_\max/\sigma_\min \to \infty
\alpha \ge \frac{2}{\sigma_{\min}} It does not converge -- -- --
\frac{2}{\sigma_{\max}} \le \alpha < \frac{2}{\sigma_{\min}} It converges if we start at a point that is in an appropriate affine subspace, but does not converge for most starting points. -- -- --
\alpha < \frac{2}{\sigma_{\max}} It converges Linear convergence, and the worst-case convergence rate is \max \{ \left| \sigma_{\max}\alpha - 1 \right|, \left| \sigma_{\min}\alpha - 1 \right| \} (note that linear convergence means that the distance from the optimum decays exponentially with the number of iterations). Linear convergence, rate \max \{ \left|\sigma_\max \alpha - 1 \right|, \left| \sigma_\min \alpha - 1 \right| \}^2 1, i.e., in the limit, the convergence is slower than linear.
\alpha = \frac{2}{\sigma_{\max} + \sigma_{\min}} It converges Linear convergence, and the worst-case convergence rate is \frac{\sigma_\max - \sigma_\min}{\sigma_\max + \sigma_\min} = \frac{\kappa - 1}{\kappa + 1} where \kappa = \frac{\sigma_\max}{\sigma_\min}. This is the best (lowest) possible linear convergence rate over choices of \alpha. Linear convergence, rate \left(\frac{\kappa - 1}{\kappa + 1} \right)^2 1, i.e., in the limit, the convergence is slower than linear.
\alpha = \frac{1}{\sigma_\max} It converges Linear convergence, and the worst-case convergence rate is \frac{\sigma_\max - \sigma_\min}{\sigma_\max} = \frac{\kappa - 1}{\kappa} where \kappa = \frac{\sigma_\max}{\sigma_\min}. This choice of \alpha may be used in practice in cases where \sigma_\max is easy to compute (or bound) but \sigma_\min is not. Linear convergence, rate \left( \frac{\kappa - 1}{\kappa}\right)^2 1, i.e., in the limit, the convergence is slower than linear.

Detailed analysis after transforming the problem to a setting where the coordinates are simple

Suppose the eigenvalues of 2A are:

\sigma_\max = \sigma_1 \ge \sigma_2 \ge \sigma_3 \ge \dots \ge \sigma_n = \sigma_\min

Via a coordinate change of the domain using an orthogonal matrix, we can transform A into a diagonal matrix with positive real entries. These entries are halves of the eigenvalues of 2A. Via a translation, we can get rid of the linear term, and finally, we can translate the function value by a constant (this does not affect the point of local minimum, though it affects the minimum value). We can therefore obtain the following simplified functional form:

f(\vec{x}) = \frac{1}{2} \sum_{i=1}^n \sigma_ix_i^2

The unique local and absolute minimum is at the zero vector.

Note that even though we know that our matrix can be transformed this way, we do not in general know how to bring it in this form -- if we did, we could directly solve the problem without using gradient descent (this is an alternate solution method). However, even though we may not know the explicit diagonalized form of the function, the fact that it does have such a form gives us information about how the gradient descent process converges.

Starting with a point:

\vec{x} = (x_1,x_2,\dots,x_n)

we obtain that the gradient vector at the point is:

\nabla f(\vec{x}) = (\sigma_1x_1,\sigma_2x_2,\dots,\sigma_nx_n)

Therefore, after one step of gradient descent, the new coordinates are:

x_i^{\mbox{new}} = x_i - \alpha \sigma_ix_i

In other words, what we see is that the gradient descent proceeds independently in each coordinate in the way that we would have gradient descent with constant learning rate for a quadratic function of one variable. In the special case that a particular coordinate is already zero, it stays zero.

In order to guarantee convergence to the point of local minimum (the zero vector) we need to make sure that the learning rate satisfies the condition of being less than 2/\sigma_i for each i where the initial value of x_i is nonzero. Moreover, for the i^{th} direction, the convergence rate is |\alpha\sigma_i - 1|, and if that value is 1 or more (that happens if \alpha \ge 2/\sigma_i) then we do not obtain convergence to 0 in that coordinate.

Worst-case convergence rate across all coordinates as a function of learning rate

The worst-case convergence rate (assuming all coordinates nonzero for our starting point) is therefore obtained as follows. Consider the expression:

\max_{1 \le i \le n} |\alpha\sigma_i - 1|

This is taking the maximum of the convergence rates in all coordinates.

If this quantity is 1 or more, then the worst-case situation is no convergence in at least one coordinate. If the quantity is less than 1, then it equals the worst-case convergence rate (or, the convergence rate in the slowest coordinate).

Let us look more closely at this expression. Consider the continuous function:

g(\alpha,\sigma) = |\alpha\sigma - 1|

It can be verified that this is convex in \alpha as well as \sigma. Therefore, for a given \alpha, the maximum is attained at the extreme values of \sigma, i.e.:

\max_{1 \le i \le n} |\alpha\sigma_i - 1| = \max \{ |\alpha \sigma_\max - 1|, |\alpha \sigma_\min - 1| \}

We further need to cap this value at 1, i.e., if the maximum value above is greater than 1, then we do not have convergence in the worst case. We therefore obtain the following piecewise functional form for the worst-case convergence rate:

\left \lbrace \begin{array}{rl} 1 - \alpha\sigma_\min, & 0 < \alpha < \frac{2}{\sigma_\min + \sigma_\max} \\ \alpha\sigma_\max - 1, & \frac{2}{\sigma_\min + \sigma_\max} \le \alpha < \frac{2}{\sigma_\max} \\\end{array}\right.

Optimization of learning rate to get the best upper bound on the worst-case convergence rate on the domain side

Unless n = 1 there is no choice of learning rate that guarantees immediate or even finite-step convergence. Different learning rates work better from different starting points. If we only know \sigma_{\min} and \sigma_{\max}, then the value of \alpha that makes the worst-case convergence as fast as possible is:

\alpha = \frac{2}{\sigma_\max + \sigma_\min}

This can be computed by minimizing over \alpha the expression \max \{ \left| \sigma_{\max}\alpha - 1 \right|, \left| \sigma_{\min}\alpha - 1 \right| \}, described using a piecewise description above.

The corresponding upper bound on convergence rate is (note that smaller upper bounds indicate faster convergence):

\frac{\sigma_\max - \sigma_\min}{\sigma_\max + \sigma_\min}

In terms of the condition number \kappa, this upper bound on convergence rate can be expressed as:

\frac{\kappa - 1}{\kappa + 1}

Note that this checks out in the case \kappa = 1: in this case, 2A is a scalar matrix, so all the eigenvalues are equal, and we can choose \alpha = 1/\sigma_\min = 1/\sigma_\max and converge in one step. The case n = 1 is a special case of \kappa = 1.

As a practical matter, gradient descent algorithms generally choose a learning rate of \frac{1}{\sigma_\max} (or a lower bound on that) rather than the above minimum regret value. Partly, this is because \sigma_\min is hard to compute. Partly, it is because, particularly in the case that \sigma_\min is a lot less than \sigma_\max, 2/(\sigma_\min + \sigma_\max) comes perilously close to 2/\sigma_\max, at which we might end up diverging. Note that in principle, the worst-case convergence rate from that would be somewhat worse, but not significantly so. The convergence rate would work out to:

\frac{\kappa - 1}{\kappa}

Optimization of learning rate to get the best upper bound on the worst-case convergence on the function value side

The same bounds apply, i.e., the best upper bound on the worst-case convergence rate occurs at the value:

\alpha = \frac{2}{\sigma_\min + \sigma_\max}

Moreover, we still have linear convergence on the function value side (i.e., exponential decay in terms of the number of iterations).

The only change is that the convergence rate is now the square of the previous convergence rate, i.e.,:

\left(\frac{\kappa - 1}{\kappa + 1}\right)^2

Similarly, if we use the value \alpha = 1/\sigma_\max, we get the convergence rate:

\left(\frac{\kappa - 1}{\kappa}\right)^2

Convergence properties based on the learning rate: the case of a symmetric positive-semidefinite matrix

As before, our analysis is in terms of the eigenvalues of the Hessian matrix 2A. However, zero is one of the eigenvalues, i.e., we have \sigma_\min = 0. Thus, the condition number \kappa = \frac{\sigma_\max}{\sigma_\min} is \infty.

An equivalent formulation is that the matrix A is nonsingular. As discussed on the quadratic function of multiple variables page, there are two possibilities:

  • The vector \vec{b} is not in the image of A: In this case, there is no global minimum, because the graph of the quadratic function contains a section that is a line with nonzero slope, that can take arbitrarily large positive and negative values.
  • The vector \vec{b} is in the image of A: In this case, we get an affine space worth of points at which the function takes its minimum value.

If we assume that we are in the second of these regimes, then we can show that the convergence behavior, viewed as convergence behavior towards the affine subspace as a whole rather than to any specific point, mimics the case of the symmetric positive-definite matrix, provided we replace \sigma_\min by the smallest positive eigenvalue. What will happen is that all our moves will occur perpendicular to the affine subspace that we are trying to converge to.

The interesting case: a very large but finite condition number

The situation of the greatest practical interest in gradient descent is where the condition number is very large, but still finite. In principle, we still have the linear convergence (i.e., the distance from the function value and the optimal value decays exponentially with the number of iterations). However, the actual convergence rate is extremely slow, so that in practice, the bounds we get are unhelpful to us.