Approximate Newton's method for optimization of a function of multiple variables

Definition

The approximate Newton's method (or approximate Newton descent, or quasi-Newton method) for optimization of a function of multiple variables is a variation of Newton's method for optimization of a function of multiple variables where we do not necessarily use the exact inverse Hessian at each iteration, but use a matrix that is intended to approximate or bound it.

Note that the term quasi-Newton method may be used to describe this type of method, but that term is often restricted in meaning to a narrower subset of approximate Newton's methods.

Iterative step

Note that we use parenthesized superscripts to denote iterates and iteration-specific variables, and subscripts to denote coordinates. Parenthesized superscripts do not denote exponents.

Suppose we are applying gradient descent to minimize a function $f:\mathbb {R} ^{n}\to \mathbb {R}$ . We will denote the input to $f$ as a vector ${\vec {x}}=(x_{1},x_{2},\dots ,x_{n})$ . Given a guess ${\vec {x}}^{(k)}$ , gradient descent computes the next guess ${\vec {x}}^{(k+1)}$ as follows:

${\vec {x}}^{(k+1)}={\vec {x}}^{(k)}-S^{(k)}\nabla f\left({\vec {x}}^{(k)}\right)$

Here, $S^{(k)}$ is intended as a stand-in (an approximation or a bound of some sort) on the inverse matrix of the Hessian matrix of $f$ . If we are using the usual Newton's method for optimization of a function of multiple variables, then $S^{(k)}$ would be the inverse of $H(f)\left({\vec {x}}^{(k)}\right)$ . However, in order to guarantee global convergence, $S^{(k)}$ may be chosen in terms of global bounds on the Hessian rather than the Hessian at a particular point.

Types of approximate Newton's method

Method	Nature of the matrix $S^{(k)}$	Is $S^{(k)}$ sparse?	Does $S^{(k)}$ depend on the current value of ${\vec {x}}^{(k)}$ ?	Is the rule stationary or dependent on the number of iterations?
gradient descent with constant learning rate $\alpha$	The scalar matrix with scalar value $\alpha$	Yes	No	Stationary
gradient descent with decaying learning rate	A scalar matrix that changes with each iteration	Yes	No	Dependent on the number of iterations
gradient descent using Newton's method	A scalar matrix that changes with each iteration	Yes	Yes	Stationary
gradient descent with exact line search	A scalar matrix that changes with each iteration	Yes	Yes	Stationary
parallel coordinate descent with constant learning rate	A diagonal matrix with positive entries	Yes	No	Stationary
sequential coordinate descent (of various types)	A matrix with exactly one nonzero entry, located on a diagonal. The location of the entry changes with the iteration.	Yes	Depends on the type of variant used	Depends on the type of variant used
Newton's method for optimization of a function of multiple variables	A symmetric positive-definite matrix	No (not guaranteed)	Yes	Stationary
Approximate Newton's method with constant approximate inverse Hessian	A symmetric positive-definite matrix that remains constant across iterations	Not guaranteed in general, but some methods of this type may guarantee sparsity (indeed, we can think of gradient descent with constant learning rate and parallel coordinate descent with constant learning rate as being special cases of this where the "approximate inverse Hessian" is chosen to be a scalar and a diagonal matrix respectively).	No	Stationary