Gradient descent with constant learning rate for a quadratic function of multiple variables

Setup

This page includes a detailed analysis of gradient descent with constant learning rate for a quadratic function of multiple variables. It builds on the analysis at the page gradient descent with constant learning rate for a quadratic function of one variable.

Function

The function we are interested is a function $f:\mathbb {R} ^{n}\to \mathbb {R}$ of the form:

$f({\vec {x}})={\vec {x}}^{T}A{\vec {x}}+{\vec {b}}^{T}{\vec {x}}+c$

where $A$ is a $n\times n$ symmetric positive-definite matrix with entries $a_{ij}$ and ${\vec {b}}$ is the column vector with entries $b_{i}$ . For part of this page, we will generalize somewhat to the case that $A$ is a symmetric positive-semidefinite matrix.

Note that we impose the condition of symmetry because we could replace the matrix $A$ with the symmetric matrix $(A+A^{T})/2$ without changing the functional form. The positive-definite or positive-semidefinite condition is to guarantee that the function has the right sort of convexity. In the positive-definite case, we are guaranteed that there is a unique point of local minimum that is also the point of absolute minimum. The positive-semidefinite case is somewhat more complicated: we could either have no minimum, or we could have an affine space worth of points at which the function has a local and absolute minimum.

Learning algorithm

Suppose $\alpha$ is a positive real number. The gradient descent with constant learning rate $\alpha$ is an iterative algorithm that aims to find a (the) point of local minimum for $f$ . The algorithm starts with a guess ${\vec {x}}_{0}$ and updates according to the rule:

${\vec {x}}_{k+1}={\vec {x}}_{k}-\alpha \nabla f({\vec {x}}_{k})$

Convergence properties based on the learning rate

We will carry out our analysis of eigenvalues in terms of the Hessian matrix $2A$ . If we were instead dealing with the eigenvalues of $A$ , we would have to double them to get the eigenvalues of $2A$ . Since the bounds on the learning rate are described as reciprocals of the eigenvalues, we would therefore obtain additional factors of 2 in the denominators, or equivalently, we would remove the factor of 2 from the numerators.

Note that in the case $n=1$ , we have $\sigma _{\max }=\sigma _{\min }=2a$ and $\kappa =1$ in the notation below.

Summary for a symmetric positive-definite matrix

Since $2A$ is a symmetric matrix, its singular values and eigenvalues coincide. Explicitly, we will denote by $\sigma _{\max }$ the largest eigenvalue (and hence also the largest singular value) and by $\sigma _{\min }$ the smallest eigenvalue (and hence also the smallest singular value). The condition number of the matrix $2A$ is the quotient:

$\kappa ={\frac {\sigma _{\max }}{\sigma _{\min }}}$

The minimum possible value of $\kappa$ is 1, and this occurs when $2A$ is a scalar matrix. In general, the larger the value of $\kappa$ , the worse the performance of gradient descent with constant learning rate. The following are true:

Value of $\alpha$	Conclusion about the convergence of gradient descent with constant learning rate $\alpha$ (assume that we do not start at the point of local minimum)	Convergence rate
$\alpha \geq {\frac {2}{\sigma _{\min }}}$	It does not converge
${\frac {2}{\sigma _{\max }}}\leq \alpha <{\frac {2}{\sigma _{\min }}}$	It converges if we start at a point that is in an appropriate affine subspace, but does not converge for most starting points.
$\alpha <{\frac {2}{\sigma _{\max }}}$	It converges	Linear convergence, and the worst-case convergence rate is $\max\{\left\|\sigma _{\max }\alpha -1\right\|,\left\|\sigma _{\min }\alpha -1\right\|\}$

Detailed analysis after transforming the problem to a setting where the coordinates are simple

Suppose the eigenvalues of $2A$ are:

$\sigma _{\max }=\sigma _{1}\geq \sigma _{2}\geq \sigma _{3}\geq \dots \geq \sigma _{n}=\sigma _{\min }$

Via a coordinate change of the domain using an orthogonal matrix, we can transform $A$ into a diagonal matrix with positive real entries. These entries are halves of the eigenvalues of $2A$ . Via a translation, we can get rid of the linear term, and finally, we can translate the function value by a constant (this does not affect the point of local minimum, though it affects the minimum value). We can therefore obtain the following simplified functional form:

$f({\vec {x}})={\frac {1}{2}}\sum _{i=1}^{n}\sigma _{i}x_{i}^{2}$

The unique local and absolute minimum is at the zero vector.

Note that even though we know that our matrix can be transformed this way, we do not in general know how to bring it in this form -- if we did, we could directly solve the problem without using gradient descent (this is an alternate solution method). However, even though we may not know the explicit diagonalized form of the function, the fact that it does have such a form gives us information about how the gradient descent process converges.

Starting with a point:

${\vec {x}}=(x_{1},x_{2},\dots ,x_{n})$

we obtain that the gradient vector at the point is:

$\nabla f({\vec {x}})=(\sigma _{1}x_{1},\sigma _{2}x_{2},\dots ,\sigma _{n}x_{n})$

Therefore, after one step of coordinate descent, the new coordinates are:

$x_{i}^{\mbox{new}}=x_{i}-\alpha \sigma _{i}x_{i}$

In other words, what we see is that the gradient descent proceeds independently in each coordinate in the way that we would have gradient descent with constant learning rate for a quadratic function of one variable. In the special case that a particular coordinate is already zero, it stays zero.

In order to guarantee convergence to the point of local minimum (the zero vector) we need to make sure that the learning rate satisfies the condition of being less than $2/\sigma _{i}$ for each $i$ where the initial value of $x_{i}$ is nonzero. Moreover, for the $i^{th}$ direction, the convergence rate is $|\alpha \sigma _{i}-1|$ , and if that value is 1 or more (that happens if $\alpha \geq 2/\sigma _{i}$ ) then we do not obtain convergence to 0 in that coordinate.

Worst-case convergence rate across all coordinates as a function of learning rate

The worst-case convergence rate (assuming all coordinates zero for our starting point) is therefore obtained as follows. Consider the expression:

$\max _{1\leq i\leq n}|\alpha \sigma _{i}-1|$

This is taking the maximum of the convergence rates in all coordinates.

If this quantity is 1 or more, then the worst-case situation is no convergence in at least one coordinate. If the quantity is less than 1, then it equals the worst-case convergence rate (or, the convergence rate in the slowest coordinate).

Let us look more closely at this expression. Consider the continuous function:

$g(\alpha ,\sigma )=|\alpha \sigma -1|$

It can be verified that this is convex in $\alpha$ as well as $\sigma$ . Therefore, for a given $\alpha$ , the maximum is attained at the extreme values of $\sigma$ , i.e.:

$\max _{1\leq i\leq n}|\alpha \sigma _{i}-1|=\max\{|\alpha \sigma _{\max }-1|,|\alpha \sigma _{\min }-1|\}$

We further need to cap this value at 1, i.e., if the maximum value above is greater than 1, then we do not have convergence in the worst case. We therefore obtain the following piecewise functional form for the worst-case convergence rate:

$\left\lbrace {\begin{array}{rl}1-\alpha \sigma _{\min },&0<\alpha <{\frac {2}{\sigma _{\min }+\sigma _{\max }}}\\\alpha \sigma _{\max }-1,&{\frac {2}{\sigma _{\min }+\sigma _{\max }}}\leq \alpha <{\frac {2}{\sigma _{\max }}}\\\end{array}}\right.$

Optimization of learning rate to get the best upper bound on the worst-case convergence rate on the domain side

Unless $n=1$ there is no choice of learning rate that guarantees immediate or even finite-step convergence. Different learning rates work better from different starting points. If we only know $\sigma _{\min }$ and $\sigma _{\max }$ , then the value of $\alpha$ that makes the worst-case convergence as fast as possible is:

$\alpha ={\frac {2}{\sigma _{\max }+\sigma _{\min }}}$

This can be computed by minimizing over $\alpha$ the expression $\max\{\left|\sigma _{\max }\alpha -1\right|,\left|\sigma _{\min }\alpha -1\right|\}$ , described using a piecewise description above.

The corresponding upper bound on convergence rate is (note that smaller upper bounds indicate faster convergence):

${\frac {\sigma _{\max }-\sigma _{\min }}{\sigma _{\max }+\sigma _{\min }}}$

In terms of the condition number $\kappa$ , this upper bound on convergence rate can be expressed as:

${\frac {\kappa -1}{\kappa +1}}$

Note that this checks out in the case $\kappa =1$ : in this case, $2A$ is a scalar matrix, so all the eigenvalues are equal, and we can choose $\alpha =1/\sigma _{\min }=1/\sigma _{\max }$ and converge in one step. The case $n=1$ is a special case of $\kappa =1$ .

Optimization of learning rate to get the best upper bound on the worst-case convergence on the function value side

Fill this in later