L1-regularized quadratic function of multiple variables: Difference between revisions

From Calculus
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Definition==
==Definition==


A <math>L^1</math>-'''regularized quadratic function''' of the variables <math>x_1,x_2,\dots,x_n</math> is a function of the form:
A <math>L^1</math>-'''regularized quadratic function''' of the variables <math>x_1,x_2,\dots,x_n</math> is a function of the form (satisfying the positive definiteness condition below):


<math>f(x_1,x_2,\dots,x_n) := \left(\sum_{i=1}^n \sum_{j=1}^n a_{ij} x_ix_j\right) + \left(\sum_{i=1}^n b_ix_i\right) + \lambda \sum_{i=1}^n |x_i| + c</math>
<math>f(x_1,x_2,\dots,x_n) := \left(\sum_{i=1}^n \sum_{j=1}^n a_{ij} x_ix_j\right) + \left(\sum_{i=1}^n b_ix_i\right) + \lambda \sum_{i=1}^n |x_i| + c</math>
Line 10: Line 10:


where <math>A</math> is the <math>n \times n</math> matrix with entries <math>a_{ij}</math> and <math>\vec{b}</math> is the column vector with entries <math>b_i</math>.
where <math>A</math> is the <math>n \times n</math> matrix with entries <math>a_{ij}</math> and <math>\vec{b}</math> is the column vector with entries <math>b_i</math>.
Note that the matrix <math>A</math> is non-unique: if <math>A + A^T = F + F^T</math> then we could replace <math>A</math> by <math>F</math>. Therefore, we could choose to replace <math>A</math> by the matrix <math>(A + A^T)/2</math>. We will thus assume that <math>A</math> is a [[linear:symmetric matrix|symmetric matrix]].
'''We impose the further restriction that the matrix <math>A</math> be a [[linear:symmetric positive definite matrix|symmetric positive definite matrix]].'''
==Related functions==
* [[Quadratic function]]
* [[Quadratic function of multiple variables]]
* [[L1-regularized quadratic function]]


==Key data==
==Key data==
Line 25: Line 35:
The partial derivative with respect to the variable <math>x_i</math>, and therefore also the <math>i^{th}</math> coordinate of the [[gradient vector]] (if it exists), is given as follows when <math>x_i \ne 0</math>:
The partial derivative with respect to the variable <math>x_i</math>, and therefore also the <math>i^{th}</math> coordinate of the [[gradient vector]] (if it exists), is given as follows when <math>x_i \ne 0</math>:


<math>\frac{\partial f}{\partial x_i} = \left(\sum_{j=1}^n a_{ij}x_j\right) + b_i + \lambda \operatorname{sgn}(x_i)</math>  
<math>\frac{\partial f}{\partial x_i} = \left(\sum_{j=1}^n (a_{ij} + a_{ji})x_j\right) + b_i + \lambda \operatorname{sgn}(x_i)</math>
 
By the symmetry assumption, this becomes:
 
<math>\frac{\partial f}{\partial x_i} = \left(\sum_{j=1}^n 2a_{ij}x_j\right) + b_i + \lambda \operatorname{sgn}(x_i)</math>


The partial derivative is undefined when <math>x_i = 0</math>.
The partial derivative is undefined when <math>x_i = 0</math>.
Line 31: Line 45:
The gradient vector exists if and only if ''all the coordinates are nonzero''.
The gradient vector exists if and only if ''all the coordinates are nonzero''.


In vector notation, the gradient vector is:
In vector notation, the gradient vector is as follows for all <math>\vec{x}</math> with all coordinates nonzero:


<math>\nabla f (\vec{x}) = A\vec{x} + \vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x})</math>
<math>\nabla f (\vec{x}) = 2A\vec{x} + \vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x})</math>


where <math>\overline{\operatorname{sgn}}</math> is the [[signum vector function]].
where <math>\overline{\operatorname{sgn}}</math> is the [[signum vector function]].
===Hessian matrix===
The Hessian matrix of the function, defined wherever all the coordinates are nonzero, is the matrix <math>2A</math>.
==Optimization problem==
We know the following two facts about the function:
* The function is a [[strictly convex function]], and therefore it has a unique point of local minimum on the whole domain that is therefore also the point of absolute minimum. Further, it is either one of the points where the gradient vector is undefined, or it is the unique point where the gradient vector is defined and equal to zero.
* The function is a piecewise quadratic function, with the quadratic form associated with the function (namely, the quadratic term in the expression) the same in all pieces. The total number of pieces is <math>2^n</math>, namely the different orthants (the points with different sign combinations for the coordinates).
===Preliminaries===
Since <math>A</math> is a [[linear:symmetric positive definite matrix|symmetric positive definite matrix]], we can write <math>A</math> in the form:
<math>A = M^TM</math>
where <math>M</math> is a <math>n \times n</math> [[linear:invertible matrix|invertible matrix]].
We can "complete the square" for this function:
<math>f(\vec{x}) = \left(M\vec{x} + \frac{1}{2}(M^T)^{-1}\vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x})\right)^T\left(M\vec{x} + \frac{1}{2}(M^T)^{-1}(\vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x})\right) + \left(c - \frac{1}{4}\vec{b}^TA^{-1}\vec{b}\right)</math>
In other words:
<math>f(\vec{x}) = \left \| M\vec{x} + \frac{1}{2}(M^T)^{-1}\vec{b}\right \|^2 + \left(c - \frac{1}{4}\vec{b}^TA^{-1}\vec{b}\right)</math>
If this equals zero, that must happen at a unique point, and that point must satisfy:
<math>M\vec{x} + \frac{1}{2}(M^T)^{-1}(\vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x})) = \vec{0}</math>
Simplifying, we obtain that if we can find a solution to the equation below, that is the unique point of local and absolute minimum:
<math>\vec{x} = -\frac{1}{2}A^{-1}(\vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x}))</math>
Moreover, the value of the minimum is:
<math>c - \frac{1}{4}(\vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x}))^TA^{-1}(\vec{b} + \lambda \overline{\operatorname{sgn}}(\vec{x}))</math>

Latest revision as of 03:18, 12 May 2014

Definition

A L1-regularized quadratic function of the variables x1,x2,,xn is a function of the form (satisfying the positive definiteness condition below):

f(x1,x2,,xn):=(i=1nj=1naijxixj)+(i=1nbixi)+λi=1n|xi|+c

In vector form, if we denote by x the column vector with coordinates x1,x2,,xn, then we can write the function as:

xTAx+bTx+λ|x|1+c

where A is the n×n matrix with entries aij and b is the column vector with entries bi.

Note that the matrix A is non-unique: if A+AT=F+FT then we could replace A by F. Therefore, we could choose to replace A by the matrix (A+AT)/2. We will thus assume that A is a symmetric matrix.

We impose the further restriction that the matrix A be a symmetric positive definite matrix.

Related functions

Key data

Item Value
default domain the whole of Rn

Differentiation

Partial derivatives and gradient vector

The partial derivative with respect to the variable xi, and therefore also the ith coordinate of the gradient vector (if it exists), is given as follows when xi0:

fxi=(j=1n(aij+aji)xj)+bi+λsgn(xi)

By the symmetry assumption, this becomes:

fxi=(j=1n2aijxj)+bi+λsgn(xi)

The partial derivative is undefined when xi=0.

The gradient vector exists if and only if all the coordinates are nonzero.

In vector notation, the gradient vector is as follows for all x with all coordinates nonzero:

f(x)=2Ax+b+λsgn¯(x)

where sgn¯ is the signum vector function.

Hessian matrix

The Hessian matrix of the function, defined wherever all the coordinates are nonzero, is the matrix 2A.

Optimization problem

We know the following two facts about the function:

  • The function is a strictly convex function, and therefore it has a unique point of local minimum on the whole domain that is therefore also the point of absolute minimum. Further, it is either one of the points where the gradient vector is undefined, or it is the unique point where the gradient vector is defined and equal to zero.
  • The function is a piecewise quadratic function, with the quadratic form associated with the function (namely, the quadratic term in the expression) the same in all pieces. The total number of pieces is 2n, namely the different orthants (the points with different sign combinations for the coordinates).

Preliminaries

Since A is a symmetric positive definite matrix, we can write A in the form:

A=MTM

where M is a n×n invertible matrix.

We can "complete the square" for this function:

f(x)=(Mx+12(MT)1b+λsgn¯(x))T(Mx+12(MT)1(b+λsgn¯(x))+(c14bTA1b)

In other words:

f(x)=Mx+12(MT)1b2+(c14bTA1b)

If this equals zero, that must happen at a unique point, and that point must satisfy:

Mx+12(MT)1(b+λsgn¯(x))=0

Simplifying, we obtain that if we can find a solution to the equation below, that is the unique point of local and absolute minimum:

x=12A1(b+λsgn¯(x))

Moreover, the value of the minimum is:

c14(b+λsgn¯(x))TA1(b+λsgn¯(x))