Newton's method for optimization of a logistic log-loss function of one variable

Setup

This page includes a detailed discussion of Newton's method for optimization of a function of one variable applied to the logistic log-loss function of one variable.

Function

Explicitly, the function is:

$f(x)=-(p\ln(g(x))+(1-p)\ln(1-g(x)))$

where $g$ is the logistic function and $\ln$ denotes the natural logarithm. Explicitly, $g(x)={\frac {1}{1+e^{-x}}}$ .

Note that $1-g(x)=g(-x)$ , so the above can be written as:

$f(x)=-(p\ln(g(x))+(1-p)\ln(g(-x)))$

$p\in (0,1)$ (we avoid extremes of 0 and 1 because in the extreme case the optimum is at infinity).

More explicitly, $f$ is the function:

$f(x)=p\ln(1+e^{-x})+(1-p)\ln(1+e^{x})$

The optimal value that we want to converge to is:

$x^{*}=g^{-1}(p)=\ln \left({\frac {p}{1-p}}\right)$

Learning algorithm

The iterative step is as follows:

$x^{(k+1)}=x^{(k)}-{\frac {f'\left(x^{(k)}\right)}{f''\left(x^{(k)}\right)}}$

Explicitly, this works out to:

$x^{(k+1)}=x^{(k)}-{\frac {g\left(x^{(k)}\right)-p}{g\left(x^{(k)}\right)\left(1-g\left(x^{(k)}\right)\right)}}$

Convergence properties

Domain of convergence

Note that applying Newton's method for optimization on $f$ is equivalent to applying Newton's method for root-finding for a function of one variable to the function $x\mapsto g(x)-p$ .

We know that Newton's method converges from close enough to the root if the function has the right kind of convexity. Specifically, it converges from close by if both the function and its second derivative have the same sign. In particular, we see that Newton's method converges for any starting point in the interval with endpoints 0 and $g^{-1}(p)$ (this interval is $[0,g^{-1}(p)]$ if $p>1/2$ , and $[g^{-1}(p),0]$ if $p<1/2$ . Moreover, convergence from any starting point in this domain is monotone -- each step moves in the same direction.

Convergence rate

Case $p\neq 1/2$

Recall that Newton's method converges quadratically from sufficiently close to a root of multiplicity one. Note that this refers to Newton's method for root-finding, so the convergence rate computed would be applied to the derivative of $f$ . In particular, we obtain that, if $x^{*}$ is the point of optimum, Newton's method converges quadratically, and the quadratic convergence rate is:

${\frac {1}{2}}\left|{\frac {f'''(x^{*})}{f''(x^{*})}}\right|={\frac {1}{2}}\left|{\frac {g''(x^{*})}{g'(x^{*})}}\right|={\frac {1}{2}}\left|{\frac {g(x^{*})(1-g(x^{*}))(1-2g(x^{*}))}{g(x^{*})(1-g(x^{*}))}}\right|=\left|{\frac {1}{2}}-g(x^{*})\right|=\left|{\frac {1}{2}}-p\right|$

Recall that smaller convergence rate means faster convergence. Note that since $p\in (0,1)$ , the worst-case quadratic convergence rate is $1/2$ , while the convergence rate is even faster when $p$ is close to 1/2.

Case that $p=1/2$

Again, recall that Newton's method converges quadratically from sufficiently close to a root of multiplicity one, but this time, we have that $f'''(x^{*})=f'''(0)=g''(0)=0$ . It turns out that $g'''(0)\neq 0$ , so we get cubic convergence, with convergence rate:

${\frac {3-1}{3!}}\left|{\frac {f^{(4)}(0)}{f''(0)}}\right|={\frac {1}{3}}\left|{\frac {g'''(0)}{g'(0)}}\right|={\frac {1}{3}}{\frac {1}{2}}={\frac {1}{6}}$