1
$\begingroup$

I am studying Machine Learning, and I came across a proof in section 1.61 in Bishop's textbook Pattern Recognition & Machine Learning which I couldn't quite understand. The claim is that $KL(p||q) = 0$ iff $p(x) = q(x)$, where $KL$ denotes the KL Divergence. The key idea is the strict convexity of $-\lg x$. I covered the proof of Jensen's inequality a while ago and I understand it, but I don't understand the last equivalence, equation 1.118 on page 56:

\begin{equation} KL(p||q) = - \int p(x) \ln \left\{ \frac{q(x)}{p(x)} \right\} dx \geq -\ln \int q(x) dx = 0 \end{equation} Bishop follows this inequality with "where we have used the fact that $−\ln x$ is a convex function, together with the normalization condition $\int q(x) dx = 1$. In fact, $−\ln x$ is a strictly convex function, so the equality will hold if, and only if, $q(x) = p(x)$ for all $x$."

Why is that? I get that the fact that the function is strict means that on the interval $[a,b]$, for any $a < x < b$, $f(x) < c(x)$, where $c(x)$ denotes the point on the chord from $a$ to $b$ at $x$, and $f(x) = c(x)$ only at the points $x=a$ and $x=b$. It's probably very obvious, but I don't see the equivalence.... I get that the integral of the PDF $q(x)$ integrates to 1 and thus the right-hand side of inequality 1.118 is consequently zero. And I certainly understand that if $q(x) = p(x)$, then you have equality. But why does equality imply that $q(x) = p(x)$?

For additional context, note that the claim is found in the section on Relative Entropy and Mutual Information, where it is used to show that the mutual information between variables $x$ and $y$ is zero if and only if the KL Divergence is zero, that is if $KL(p(x,y)||p(x)p(y)) = 0$.

$\endgroup$
0

1 Answer 1

1
$\begingroup$

The condition for Jensen's inequality to be an equality is that the concave function is affine or the random variable is a.s. constant. (see also: here, here for proof details).

Your Jensen is an equality and the function ($\ln$) isn't affine, so the random variable $q(X)/p(X)$ must be a.s. constant. Since $q$ and $p$ are mass functions, the constant must be 1.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .