Logarithmic scoring rule is proper

Statement

The logarithmic scoring rule is a proper scoring rule. Explicitly:

Consider a random variable $X$ that can take $n$ distinct values $1, 2, \dots, n$ . Suppose we estimate probabilities $p_{1}, p_{2}, \dots, p_{n}$ for these values respectively (with $p_{i} \in [0, 1]$ , $\sum_{i = 1}^{n} p_{i} = 1$ ). The logarithmic scoring rule works as follows: for every instance of the random variable $X$ , we assign score equal to the negative of the logarithm of the corresponding probability $p_{i}$ . Explicitly, if the instances are $X_{1}, X_{2}, \dots, X_{m}$ , the total score is:

$\sum_{j = 1}^{m} - \ln (p_{X_{j}})$

The claim is that, if the actual probabilities are $q_{1}, q_{2}, \dots, q_{n}$ , then the assignment that minimizes the expected value of the score is $p_{1} = q_{1}, p_{2} = q_{2}, \dots, p_{n} = q_{n}$ .

Related facts

Logarithmic scoring rule is the only proper scoring rule up to affine transformations in case of more than two classes

Proof

Reduction to one random instance

Under the assumption that the instances are independent of each other, it suffices to show the result for one instance.

Reduction to the case that all probabilities are strictly between zero and one

We now show that if any particular $q_{i} = 0$ , the corresponding $p_{i}$ must equal 0, and if any $q_{i} = 1$ , the corresponding $p_{i}$ must equal 1.

Fill this in later

Proof for one instance and where all the actual probabilities are nonzero and less than one

The expected value for one instance is:

$\sum_{i = 1}^{n} - q_{i} \ln (p_{i})$

In words, we weight each score by the probability that that score is attained.

We are constrained to lie on the codimension one hyperplane given by $\sum_{i = 1}^{n} p_{i} = 1$ . We can therefore use the idea of Lagrange multipliers to find the optima. The gradient vector of the expected value function is the vector with coordinates:

$(\frac{- q_{1}}{p_{1}}, \frac{- q_{2}}{p_{2}}, \dots, \frac{- q_{n}}{p_{n}})$

The normal vector to the hyperplane is given as the gradient vector of the function $\sum_{i = 1}^{n} p_{i}$ , and is the vector:

$(1, 1, \dots, 1)$

By the theory of Lagrange multipliers, we have that at any local extreme value, there exists a value $λ$ such that:

$\frac{- q_{i}}{p_{i}} = λ (1)$

for all $i$ . In other words:

$q_{i} = (- λ) p_{i}$

for all $i$ . Adding up, we get:

$\sum_{i = 1}^{n} q_{i} = (- λ) \sum_{i = 1}^{n} p_{i}$

We have that $\sum_{i = 1}^{n} q_{i} = 1$ as well (these are the actual probabilities, so they add up to 1), so we get:

$1 = (- λ) (1)$

so $- λ = 1$ . Plugging back, we get that the only point that could potentially be a point of local extremum satisfies $q_{i} = p_{i}$ for all $i$ .

We can now verify that this is indeed a point of local minimum. Fill this in later

We can also verify that the absolute minimum does not occur at the boundary: if $q_{i} \neq 0$ but we set $p_{i} = 0$ , then our expected score is $\infty$ , because there's a nonzero probability of paying an infinite cost, namely, in the case that the random variable takes the value $i$ .