Gradient descent with constant learning rate for a quadratic function of multiple variables
This page includes a detailed analysis of gradient descent with constant learning rate for a quadratic function of multiple variables. It builds on the analysis at the page gradient descent with constant learning rate for a quadratic function of one variable.
The function we are interested is a function of the form:
where is a symmetric positive-definite matrix with entries and is the column vector with entries . For part of this page, we will generalize somewhat to the case that is a symmetric positive-semidefinite matrix.
Note that we impose the condition of symmetry because we could replace the matrix with the symmetric matrix without changing the functional form. The positive-definite or positive-semidefinite condition is to guarantee that the function has the right sort of convexity. In the positive-definite case, we are guaranteed that there is a unique point of local minimum that is also the point of absolute minimum. The positive-semidefinite case is somewhat more complicated: we could either have no minimum, or we could have an affine space worth of points at which the function has a local and absolute minimum.
Suppose is a positive real number. The gradient descent with constant learning rate is an iterative algorithm that aims to find a (the) point of local minimum for . The algorithm starts with a guess and updates according to the rule:
Convergence properties based on the learning rate: the case of a symmetric positive-definite matrix
We will carry out our analysis of eigenvalues in terms of the Hessian matrix . If we were instead dealing with the eigenvalues of , we would have to double them to get the eigenvalues of . Since the bounds on the learning rate are described as reciprocals of the eigenvalues, we would therefore obtain additional factors of 2 in the denominators, or equivalently, we would remove the factor of 2 from the numerators.
Note that in the case , we have and in the notation below.
Since is a symmetric matrix, its singular values and eigenvalues coincide. Explicitly, we will denote by the largest eigenvalue (and hence also the largest singular value) and by the smallest eigenvalue (and hence also the smallest singular value). The condition number of the matrix is the quotient:
The minimum possible value of is 1, and this occurs when is a scalar matrix. In general, the larger the value of , the worse the performance of gradient descent with constant learning rate. The following are true:
|Value of||Conclusion about the convergence of gradient descent with constant learning rate (assume that we do not start at the point of local minimum)||Convergence on domain side||Convergence on function value side||Limit of convergence rate as|
|It does not converge||--||--||--|
|It converges if we start at a point that is in an appropriate affine subspace, but does not converge for most starting points.||--||--||--|
|It converges||Linear convergence, and the worst-case convergence rate is (note that linear convergence means that the distance from the optimum decays exponentially with the number of iterations).||Linear convergence, rate||1, i.e., in the limit, the convergence is slower than linear.|
|It converges||Linear convergence, and the worst-case convergence rate is where . This is the best (lowest) possible linear convergence rate over choices of .||Linear convergence, rate||1, i.e., in the limit, the convergence is slower than linear.|
|It converges||Linear convergence, and the worst-case convergence rate is where . This choice of may be used in practice in cases where is easy to compute (or bound) but is not.||Linear convergence, rate||1, i.e., in the limit, the convergence is slower than linear.|
Detailed analysis after transforming the problem to a setting where the coordinates are simple
Suppose the eigenvalues of are:
Via a coordinate change of the domain using an orthogonal matrix, we can transform into a diagonal matrix with positive real entries. These entries are halves of the eigenvalues of . Via a translation, we can get rid of the linear term, and finally, we can translate the function value by a constant (this does not affect the point of local minimum, though it affects the minimum value). We can therefore obtain the following simplified functional form:
The unique local and absolute minimum is at the zero vector.
Note that even though we know that our matrix can be transformed this way, we do not in general know how to bring it in this form -- if we did, we could directly solve the problem without using gradient descent (this is an alternate solution method). However, even though we may not know the explicit diagonalized form of the function, the fact that it does have such a form gives us information about how the gradient descent process converges.
Starting with a point:
we obtain that the gradient vector at the point is:
Therefore, after one step of gradient descent, the new coordinates are:
In other words, what we see is that the gradient descent proceeds independently in each coordinate in the way that we would have gradient descent with constant learning rate for a quadratic function of one variable. In the special case that a particular coordinate is already zero, it stays zero.
In order to guarantee convergence to the point of local minimum (the zero vector) we need to make sure that the learning rate satisfies the condition of being less than for each where the initial value of is nonzero. Moreover, for the direction, the convergence rate is , and if that value is 1 or more (that happens if ) then we do not obtain convergence to 0 in that coordinate.
Worst-case convergence rate across all coordinates as a function of learning rate
The worst-case convergence rate (assuming all coordinates nonzero for our starting point) is therefore obtained as follows. Consider the expression:
This is taking the maximum of the convergence rates in all coordinates.
If this quantity is 1 or more, then the worst-case situation is no convergence in at least one coordinate. If the quantity is less than 1, then it equals the worst-case convergence rate (or, the convergence rate in the slowest coordinate).
Let us look more closely at this expression. Consider the continuous function:
It can be verified that this is convex in as well as . Therefore, for a given , the maximum is attained at the extreme values of , i.e.:
We further need to cap this value at 1, i.e., if the maximum value above is greater than 1, then we do not have convergence in the worst case. We therefore obtain the following piecewise functional form for the worst-case convergence rate:
Optimization of learning rate to get the best upper bound on the worst-case convergence rate on the domain side
Unless there is no choice of learning rate that guarantees immediate or even finite-step convergence. Different learning rates work better from different starting points. If we only know and , then the value of that makes the worst-case convergence as fast as possible is:
This can be computed by minimizing over the expression , described using a piecewise description above.
The corresponding upper bound on convergence rate is (note that smaller upper bounds indicate faster convergence):
In terms of the condition number , this upper bound on convergence rate can be expressed as:
Note that this checks out in the case : in this case, is a scalar matrix, so all the eigenvalues are equal, and we can choose and converge in one step. The case is a special case of .
As a practical matter, gradient descent algorithms generally choose a learning rate of (or a lower bound on that) rather than the above minimum regret value. Partly, this is because is hard to compute. Partly, it is because, particularly in the case that is a lot less than , comes perilously close to , at which we might end up diverging. Note that in principle, the worst-case convergence rate from that would be somewhat worse, but not significantly so. The convergence rate would work out to:
Optimization of learning rate to get the best upper bound on the worst-case convergence on the function value side
The same bounds apply, i.e., the best upper bound on the worst-case convergence rate occurs at the value:
Moreover, we still have linear convergence on the function value side (i.e., exponential decay in terms of the number of iterations).
The only change is that the convergence rate is now the square of the previous convergence rate, i.e.,:
Similarly, if we use the value , we get the convergence rate:
Convergence properties based on the learning rate: the case of a symmetric positive-semidefinite matrix
As before, our analysis is in terms of the eigenvalues of the Hessian matrix . However, zero is one of the eigenvalues, i.e., we have . Thus, the condition number is .
An equivalent formulation is that the matrix is nonsingular. As discussed on the quadratic function of multiple variables page, there are two possibilities:
- The vector is not in the image of : In this case, there is no global minimum, because the graph of the quadratic function contains a section that is a line with nonzero slope, that can take arbitrarily large positive and negative values.
- The vector is in the image of : In this case, we get an affine space worth of points at which the function takes its minimum value.
If we assume that we are in the second of these regimes, then we can show that the convergence behavior, viewed as convergence behavior towards the affine subspace as a whole rather than to any specific point, mimics the case of the symmetric positive-definite matrix, provided we replace by the smallest positive eigenvalue. What will happen is that all our moves will occur perpendicular to the affine subspace that we are trying to converge to.
The interesting case: a very large but finite condition number
The situation of the greatest practical interest in gradient descent is where the condition number is very large, but still finite. In principle, we still have the linear convergence (i.e., the distance from the function value and the optimal value decays exponentially with the number of iterations). However, the actual convergence rate is extremely slow, so that in practice, the bounds we get are unhelpful to us.