I am studying Machine Learning, and I came across a proof in section 1.61 in Bishop's textbook Pattern Recognition & Machine Learning which I couldn't quite understand. The claim is that $KL(p||q) = 0$ iff $p(x) = q(x)$, where $KL$ denotes the KL Divergence. The key idea is the strict convexity of $-\lg x$. I covered the proof of Jensen's inequality a while ago and I understand it, but I don't understand the last equivalence, equation 1.118 on page 56:
\begin{equation} KL(p||q) = - \int p(x) \ln \left\{ \frac{q(x)}{p(x)} \right\} dx \geq -\ln \int q(x) dx = 0 \end{equation} Bishop follows this inequality with "where we have used the fact that $−\ln x$ is a convex function, together with the normalization condition $\int q(x) dx = 1$. In fact, $−\ln x$ is a strictly convex function, so the equality will hold if, and only if, $q(x) = p(x)$ for all $x$."
Why is that? I get that the fact that the function is strict means that on the interval $[a,b]$, for any $a < x < b$, $f(x) < c(x)$, where $c(x)$ denotes the point on the chord from $a$ to $b$ at $x$, and $f(x) = c(x)$ only at the points $x=a$ and $x=b$. It's probably very obvious, but I don't see the equivalence.... I get that the integral of the PDF $q(x)$ integrates to 1 and thus the right-hand side of inequality 1.118 is consequently zero. And I certainly understand that if $q(x) = p(x)$, then you have equality. But why does equality imply that $q(x) = p(x)$?
For additional context, note that the claim is found in the section on Relative Entropy and Mutual Information, where it is used to show that the mutual information between variables $x$ and $y$ is zero if and only if the KL Divergence is zero, that is if $KL(p(x,y)||p(x)p(y)) = 0$.