Logistic log-loss function of one variable: Difference between revisions

Latest revision as of 02:02, 28 September 2014

Definition

The logistic log-loss function of one variable is obtained by composing the logarithmic cost function with the logistic function, and it is of importance in the analysis of logistic regression.

Explicitly, the function has the form:

$f (x) = - (p \ln (g (x)) + (1 - p) \ln (1 - g (x)))$

where $g$ is the logistic function and $\ln$ denotes the natural logarithm. Explicitly, $g (x) = \frac{1}{1 + e^{- x}}$ .

Note that $1 - g (x) = g (- x)$ , so the above can be written as:

$f (x) = - (p \ln (g (x)) + (1 - p) \ln (g (- x)))$

We restrict $p$ to the interval $(0, 1)$ . Conceptually, $p$ is the corresponding probability.

More explicitly, $f$ is the function:

$f (x) = p \ln (1 + e^{- x}) + (1 - p) \ln (1 + e^{x})$

Key data

Item	Value
default domain	all of $R$ , i.e., all reals
range	$[m, \infty)$ where $m$ is the minimum value, given as $- (p \ln p + (1 - p) \ln (1 - p))$ .
local maximum value and points of attainment	No local maximum values
local minimum value and points of attainment	Local minimum value $- (p \ln p + (1 - p) \ln (1 - p))$ , attained at $x = \ln (\frac{p}{1 - p})$ .
derivative	$g (x) - p$ where $g$ is the logistic function
second derivative	$g (x) (1 - g (x)) = g (x) g (- x)$
third derivative	$g (x) g (- x) (g (- x) - g (x)) = g (x) (1 - g (x)) (1 - 2 g (x))$

Differentiation

WHAT WE USE: chain rule for differentiation, Logistic function#First derivative

First derivative

We use that:

$g^{'} (x) = g (x) (1 - g (x)) = g (x) g (- x)$

or equivalently:

$\frac{d}{d x} (\ln (g (x)) = 1 - g (x) = g (- x)$

Similarly:

$\frac{d}{d x} (\ln (g (- x)) = - g (x)$

Plugging these in, we get:

$f^{'} (x) = - (p (1 - g (x)) + (1 - p) (- g (x)))$

This simplifies to:

$f^{'} (x) = g (x) - p$

Second derivative

Using the first derivative and the expression for $g^{'}$ , we obtain:

$f^{″} (x) = g (x) (1 - g (x)) = g (x) g (- x)$

Note that the second derivative is independent of $p$ .

Points and intervals of interest

Critical points

The expression for the derivative is:

$f^{'} (x) = g (x) - p$

This is always defined, and is zero iff $g (x) = p$ . This happens if and only if $x = g^{- 1} (p)$ , i.e.:

$x = g^{- 1} (p) = \ln (\frac{p}{1 - p})$

The value of $f$ at this point is:

$f (x) = - (p \ln p + (1 - p) \ln (1 - p))$

Intervals of increase and decrease

We have that $f^{'} (x) < 0$ for $g (x) < p$ and $f^{'} (x) > 0$ for $g (x) > p$ . Since $g$ is increasing, we get that $f^{'} (x) < 0$ for $x < g^{- 1} (p)$ and $f^{'} (x) > 0$ for $x > g^{- 1} (p)$ . Thus, $f$ is decreasing to the left of its unique critical point and increasing to the right of the critical point.

Thus, the critical point $g^{- 1} (p)$ is the unique point of local and absolute minimum for $f$ .

Subtle point about how the result is independent of the arithmetic form of the logistic function

The fact that the minimum occurs at $g^{- 1} (p)$ and has value $- (p \ln p + (1 - p) \ln (1 - p))$ is actually independent of $g$ being a logistic function. If we replaced $g$ by any increasing function with range $(0, 1)$ , the minimum would occur at $g^{- 1} (p)$ . The logistic function is particularly nice for other theoretical and practical reasons -- for instance, the expression for the derivative is very simple for the logistic function, but can be complicated for other functions.

The general observation that the minimum occurs at $g^{- 1} (p)$ is based on the fact that logarithmic scoring is a proper scoring function.

Intervals of concave up and concave down

The second derivative is:

$f^{″} (x) = g^{'} (x) = g (x) g (- x) = g (x) (1 - g (x))$

This is always positive, so the function is a convex function and its graph is concave up everywhere. In particular, there are no points of inflection.

Optimization methods

We will denote the point of minimum, $g^{- 1} (p)$ , as $x^{*}$ .

Method	Domain of convergence	Order of convergence (general case)	Convergence rate (general case)	Convergence rate (special cases)
Gradient descent with constant learning rate for a logistic log-loss function of one variable, with learning rate $α$	If $α < \frac{1}{f^{″} (x^{})} = \frac{1}{p (1 - p)}$ with $x^{}$ the point of minimum, convergence occurs from any starting point, but can be quite slow. Since we don't know $x^{*}$ , setting $α = 4$ is a safe choice.	Linear convergence	$\| 1 - α f^{″} (x^{*}) \| = \| 1 - α p (1 - p) \|$ If we set $α = 4$ , we get $\| 1 - 4 p (1 - p) \|$ .	Case $α = \frac{1}{p (1 - p)}$ and $p \neq \frac{1}{2}$ : quadratic convergence with convergence rate $\| \frac{1}{2} - p \|$ . Case that $p = \frac{1}{2}, α = 4$ : cubic convergence with convergence rate $\frac{1}{6} \| \frac{g^{‴} (0)}{g^{'} (0)} \| = \frac{1}{12}$
Newton's method for optimization of a logistic log-loss function of one variable	Not all reals. However, the domain includes the interval between 0 and $x^{*}$ , and convergence is monotone on the interval.	Quadratic convergence.	$\| \frac{1}{2} - p \|$	Case that $p = \frac{1}{2}$ : cubic convergence with convergence rate $\frac{1}{3} \| \frac{g^{‴} (0)}{g^{'} (0)} \| = \frac{1}{6}$

@@ Line 13: / Line 13: @@
 <math>f(x) = -(p \ln (g(x)) + (1 - p) \ln (g(-x)))</math>
-We restrict <math>p</math> to the interval <math>[0,1]</math>. Conceptually, <math>p</math> is the corresponding probability.
+We restrict <math>p</math> to the interval <math>(0,1)</math>. Conceptually, <math>p</math> is the corresponding probability.
 More explicitly, <math>f</math> is the function:
 <math>f(x) = p \ln(1 + e^{-x}) + (1 - p) \ln (1 + e^x)</math>
+==Key data==
+{| class="sortable" border="1"
+! Item !! Value
+|-
+| default [[domain]] || all of <math>\R</math>, i.e., all reals
+|-
+| range || <math>[m, \infty)</math> where <math>m</math> is the minimum value, given as <math>-(p \ln p + (1 - p) \ln (1 - p))</math>.
+|-
+| [[local maximum value]] and points of attainment || No local maximum values
+|-
+| [[local minimum value]] and points of attainment || Local minimum value <math>-(p \ln p + (1 - p) \ln (1 - p))</math>, attained at <math>x = \ln\left(\frac{p}{1 - p}\right)</math>.
+|-
+| [[derivative]] || <math>g(x) - p</math> where <math>g</math> is the [[logistic function]]
+|-
+| [[second derivative]] || <math>g(x)(1 - g(x)) = g(x)g(-x)</math>
+|-
+| [[third derivative]] || <math>g(x)g(-x)(g(-x) - g(x)) = g(x)(1 - g(x))(1 - 2g(x))</math>
+|}
 ==Differentiation==
@@ Line 50: / Line 70: @@
 <math>f''(x) = g(x)(1 - g(x)) = g(x)g(-x)</math>
+Note that the second derivative is independent of <math>p</math>.
+==Points and intervals of interest==
+===Critical points===
+The expression for the derivative is:
+<math>f'(x) = g(x) - p</math>
+This is always defined, and is zero iff <math>g(x) = p</math>. This happens if and only if <math>x = g^{-1}(p)</math>, i.e.:
+<math>x = g^{-1}(p) = \ln \left(\frac{p}{1 - p}\right)</math>
+The value of <math>f</math> at this point is:
+<math>f(x) = -(p \ln p + (1 - p) \ln (1 - p))</math>
+===Intervals of increase and decrease===
+We have that <math>f'(x) < 0</math> for <math>g(x) < p</math> and <math>f'(x) > 0</math> for <math>g(x) > p</math>. Since <math>g</math> is increasing, we get that <math>f'(x) < 0</math> for <math>x < g^{-1}(p)</math> and <math>f'(x) > 0</math> for <math>x > g^{-1}(p)</math>. Thus, <math>f</math> is ''decreasing'' to the left of its unique critical point and ''increasing'' to the right of the critical point.
+Thus, the critical point <math>g^{-1}(p)</math> is the unique point of local and absolute minimum for <math>f</math>.
+===Subtle point about how the result is independent of the arithmetic form of the logistic function===
+The fact that the minimum occurs at <math>g^{-1}(p)</math> and has value <math>-(p \ln p + (1 - p) \ln (1 - p))</math> is actually independent of <math>g</math> being a logistic function. If we replaced <math>g</math> by any increasing function with range <math>(0,1)</math>, the minimum would occur at <math>g^{-1}(p)</math>. The logistic function is particularly nice for other theoretical and practical reasons -- for instance, the expression for the derivative is very simple for the logistic function, but can be complicated for other functions.
+The general observation that the minimum occurs at <math>g^{-1}(p)</math> is based on the fact that logarithmic scoring is a proper scoring function.
+===Intervals of concave up and concave down===
+The second derivative is:
+<math>f''(x) = g'(x) = g(x)g(-x) = g(x)(1 - g(x))</math>
+This is always positive, so the function is a convex function and its graph is concave up everywhere. In particular, there are no points of inflection.
+==Optimization methods==
+We will denote the point of minimum, <math>g^{-1}(p)</math>, as <math>x^*</math>.
+{| class="sortable" border="1"
+! Method !! Domain of convergence !! Order of convergence (general case) !! Convergence rate (general case) !! Convergence rate (special cases)
+|-
+| [[Gradient descent with constant learning rate for a logistic log-loss function of one variable]], with learning rate <math>\alpha</math> || If <math>\alpha < \frac{1}{f''(x^*)} = \frac{1}{p(1 - p)}</math> with <math>x^*</math> the point of minimum, convergence occurs from any starting point, but can be quite slow. Since we don't know <math>x^*</math>, setting <math>\alpha = 4</math> is a safe choice. || Linear convergence|| <math>|1 - \alpha f''(x^*)| = |1 - \alpha p(1- p)|</math><br>If we set <math>\alpha = 4</math>, we get <math>|1 - 4p(1 - p)|</math>. || Case <math>\alpha = \frac{1}{p(1 - p)}</math> and <math>p \ne \frac{1}{2}</math>: quadratic convergence with convergence rate <math>\left| \frac{1}{2} - p \right|</math>.<br>Case that <math>p = \frac{1}{2}, \alpha = 4</math>: cubic convergence with convergence rate <math>\frac{1}{6} \left| \frac{g'''(0)}{g'(0)} \right| = \frac{1}{12}</math>
+|-
+| [[Newton's method for optimization of a logistic log-loss function of one variable]] || ''Not'' all reals. However, the domain includes the interval between 0 and <math>x^*</math>, and convergence is monotone on the interval. || Quadratic convergence. || <math>\left|\frac{1}{2} - p \right|</math> || Case that <math>p = \frac{1}{2}</math>: cubic convergence with convergence rate <math>\frac{1}{3} \left| \frac{g'''(0)}{g'(0)} \right| = \frac{1}{6}</math>
+|}