# Parallel coordinate descent

## Definition

**Parallel coordinate descent** is a variant of gradient descent where we use a different learning rate in each coordinate. Explicitly, whereas with ordinary gradient descent, we define each iterate by subtracting a *scalar* multiple of the gradient vector from the previous iterate:

Ordinary gradient descent:

In parallel coordinate descent, we use a vector learning rate, i.e., we use a learning rate that could be different in each coordinate:

Parallel coordinate descent: for each coordinate :

Alternatively, using coordinate-wise vector multiplication, we can describe the above as:

## Intuition behind choice of learning rate

For ordinary gradient descent, the learning rate is typically chosen to be the reciprocal of the second derivative, or a suitable global bound on the second derivative. We may therefore expect that for parallel coordinate descent, we should choose, in each direction, the reciprocal of the second-order partial derivative of in that direction. In other words, we might expect that the vector is the vector with coordinates:

The problem with this naive implementation is that if the coordinates are correlated, then we can *overstep*. To avoid overstepping, we can take the above expression and divide by an upper bound on the amount of correlation between coordinates. *Fill this in later*

## Types of parallel coordinate descent

Name | Verbal description | Is the rule stationary or dependent on number of iterations? | Mathematical formalism | Further comments |
---|---|---|---|---|

Parallel coordinate descent with constant learning rate (default meaning of parallel coordinate descent) |
Here, the learning rate vector is constant and independent of the number of iterations. Note that it could still differ between the different coordinates -- that's the point of parallel coordinate descent. | Stationary | , is the learning rate vector | |

Parallel coordinate descent with exact line search | Here, we perform an exact line search in each coordinate; however, we still need to perform an overstep correction. | Stationary | ||

Parallel coordinate descent using Newton's method | Here, we use Newton's method for optimization of a function of one variable; however, we still need to perform an overstep correction. | Stationary |