19
$\begingroup$

This question gives a quantitative definition of cross entropy, in terms of it's formula.

I'm looking for a more notional definition, wikipedia says:

In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.

I have Emphasised the part that is giving me trouble in understanding this. I would like a nice definition that doesn't require separate (pre-existing) understanding of Entropy.

$\endgroup$
3
  • 1
    $\begingroup$ You are asking for a definition of cross-entropy that, at the same time, will define entropy itself. And intuitively so... If you have trouble understanding the concept of Entropy itself, it would be a good idea to first understand the basic concept and then any one of its extensions. $\endgroup$ Commented Jan 1, 2014 at 15:44
  • 1
    $\begingroup$ Personally I had a basic understanding of Entropy (though it has been almost 12 months since I applied it). But a quantitive expression of Entropy, should fit in one short paragraph, and cross entropy should only take one more. So I feel a good answer can include both, so that the reader doesn't need to refer elsewhere to understand it. $\endgroup$ Commented Jan 1, 2014 at 16:54
  • $\begingroup$ See the related posts: stats.stackexchange.com/questions/66186/… and stats.stackexchange.com/questions/188903/… $\endgroup$ Commented Sep 1, 2019 at 11:00

1 Answer 1

28
$\begingroup$

To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").

So in optimal encoding the average length of encoded message is $$ \sum_i p_i \log_2(\tfrac{1}{p_i}), $$ that is, Shannon entropy of the original probability distribution.

However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is $$ \sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}), $$ is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.

As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.

Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)

But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).

$\endgroup$
2
  • $\begingroup$ Nice explanation, thanks. However, the wikipedia definition is sum_i[p_i * log(q_i)]. Your use of 1/q_i gives the number of possible states, hence log_2 converts that into the number of bits required to encode a single symbol, but the wikipedia page is describing something subtly different. $\endgroup$
    – redcalx
    Commented Jun 26, 2015 at 20:14
  • 6
    $\begingroup$ @locster In Wikipedia it has the minus sign before the sum, which is equivalent to having $1/q_i$, as $\log(1/q_i)=-\log(q_i)$. $\endgroup$ Commented Jun 27, 2015 at 9:26

Not the answer you're looking for? Browse other questions tagged or ask your own question.