Setup
This page includes a detailed discussion of Newton's method for optimization of a function of one variable applied to the logistic log-loss function of one variable.
Function
Explicitly, the function is:
where
is the logistic function and
denotes the natural logarithm. Explicitly,
.
Note that
, so the above can be written as:
(we avoid extremes of 0 and 1 because in the extreme case the optimum is at infinity).
More explicitly,
is the function:
The optimal value that we want to converge to is:
Learning algorithm
The iterative step is as follows:
Explicitly, this works out to:
Convergence properties
Domain of convergence
Note that applying Newton's method for optimization on
is equivalent to applying Newton's method for root-finding for a function of one variable to the function
.
We know that Newton's method converges from close enough to the root if the function has the right kind of convexity. Specifically, it converges from close by if both the function and its second derivative have the same sign. In particular, we see that Newton's method converges for any starting point in the interval with endpoints 0 and
(this interval is
if
, and
if
. Moreover, convergence from any starting point in this domain is monotone -- each step moves in the same direction.
Convergence rate
Case 
Recall that Newton's method converges quadratically from sufficiently close to a root of multiplicity one. Note that this refers to Newton's method for root-finding, so the convergence rate computed would be applied to the derivative of
. In particular, we obtain that, if
is the point of optimum, Newton's method converges quadratically, and the quadratic convergence rate is:
Recall that smaller convergence rate means faster convergence. Note that since
, the worst-case quadratic convergence rate is
, while the convergence rate is even faster when
is close to 1/2.
Case that 
Again, recall that Newton's method converges quadratically from sufficiently close to a root of multiplicity one, but this time, we have that
. It turns out that
, so we get cubic convergence, with convergence rate: