Shannon entropy of a fair dice

Question

The formula for Shannon entropy is as follows,

$$\text{Entropy}(S) = - \sum_i p_i \log_2 p_i $$

Thus, a fair six sided dice should have the entropy,

$$- \sum_{i=1}^6 \dfrac{1}{6} \log_2 \dfrac{1}{6} = \log_2 (6) = 2.5849...$$

However, the entropy should also correspond to the average number of questions you have to ask in order to know the outcome (as exampled in this guide under the headline Information Theory).

Now, trying to construct a decision tree to describe the average number of questions we have to ask in order to know the outcome of a dice, and this seems to be the optimal one:

Decision tree for dice

Looking at the average number of questions in the image, there are 3 questions in 4/6 cases in 2 questions in 2/6 cases. Thus the entropy should be:

$$\dfrac{4}{6} \times 3 + \dfrac{2}{6} \times 2 = 2.6666...$$

So, obviously the result for the entropy isn't the same in the two calculations. How come?

Take base 6 entropy, and $H_{6}(X)=1$, which is the same as max depth of your 6-ary decision tree which decides your outcome. The discrepancy comes form the base of the log. you are trying to encode 6 different outcomes in base 2, so you are going to have to use bit-strings of length at least 3. There is clearly some waste here, since there are $2^3 =8$ unique "encodings" on 3 bits, but you only need to encode 6 elements. In general, the entropy bound gets tighter as the number of elements to encode grows, as the first answer already mentioned. — mm8511, Commented Sep 14, 2018 at 16:37

Mitchell Spector · Accepted Answer · 2018-09-15 15:12:35Z

To recover entropy, you have to consider a sequence of dice throws, and ask how many questions per roll you need in an optimal strategy, in the limit that the number of rolls goes to infinity. Note that each question can cover all the rolls, for example for two rolls, you could ask at some point: “Are the results in $\{16,21,22,23\}$?” (where the first digit denotes the first throw, and the second digit denotes the second throw).

I'm too lazy to do it for 36 possibilities, therefore here a simpler example: Consider a die for which each roll gives only one of three results with equal probability. Then the entropy is about $1.58496$.

For one toss, the optimal strategy is simply to ask “was it $1$?” followed by ”was it $2$?”, which on average gives $5/3 = 1.66$ questions.

For two tosses, an optimal strategy would be to first ask “was it one of $\{11,12,13,21\}$?” (where the first digit gives the result of the first toss, and the second digit the result of the second toss). If the answer is “yes”, then use two questions to single out one of the four results. Otherwise, ask “was the first toss a $2$?”, if yes then it was one of $22$ or $23$, and one question is sufficient to determine that. In the remaining case you know the first toss was $3$ and know nothing about the second, so you employ the one-toss strategy to determine the second toss.

This strategy needs on average $29/9=3.2222$ questions, or $1.61111$ questions per toss. Which is already much better, and indeed only $1.65\,\%$ worse that the value given by the entropy.

Note that the average number of questions of the single-toss optimal strategy can differ dramatically from the entropy. For this, consider the toss of a biased coin. The entropy of this can be made arbitrary low by making the coin sufficiently biased. But obviously there's no way you can get the result of a coin toss with less than one question.

Is it accurate to say that the entropy of a process is the expected length of a branch in a decision tree before the branch closes? In the simplest cases like those of flipping a coin, the tree will close after a single branch. Or is it the sum of weighted lengths of paths from the root to closing? — MSIS, Commented Jul 11, 2020 at 2:03
@MSIS: In the generality you write it, it is definitely not accurate. For example, for dice, you might make a suboptimal decision tree of the form “is it 1? if not, is it 2?, if not, is it 3? …”. For $n$-sided dice, that decision tree will have an expected length $(n+1)/2$, which is larger than the entropy $\log_2 n$. You will need an optimal decision tree. Note however that the expected length for an optimal decision tree for a single event, may still be off up to $1$; see the last paragraph of my answer. To get better, you need to consider repeated experiments, as described in the answer. — celtschk, Commented Jul 11, 2020 at 6:02

metamorphy · Accepted Answer · 2018-09-14 17:24:31Z

In your setting, the Shannon entropy is "just" a lower bound for an entropy of any decision tree (including optimal ones). These don't have to coincide. To get closer to what the Shannon entropy is, imagine an optimal decision tree identifying outcomes of throwing a dice $N$ times with some large $N$ (assuming independence). The larger $N$ is, the smaller (yet nonnegative) is the difference between the "averaged" (i.e. divided by $N$) entropy of this "compound" decision tree and the Shannon entropy of the dice. (It resembles a background of arithmetic coding).

A. Webb · Accepted Answer · 2018-09-14 21:43:50Z

If you have $1$ die, there are $6$ possible outcomes. Label them 0 through 5 and express as a binary number. This takes $\lceil\log_2{6}\rceil = 3$ bits. You can always determine the 1 die with 3 questions, just ask about each bit in turn.

If you have $10$ dice, then there are $6^{10}$ possible outcomes. Label them 0 through $6^{10}-1$ and express as a binary number. This takes $\lceil\log_2{6^{10}}\rceil = \lceil10\log_2{6}\rceil = 26$ bits. You can always determine the 10 dice with 26 questions, just ask about each bit in turn. The average is 26 questions / 10 dice = 2.6.

If you have $100$ dice, then there are $6^{100}$ possible outcomes. Label them 0 through $6^{100}-1$ and express as a binary number. This takes $\lceil\log_2{6^{100}}\rceil = \lceil100\log_2{6}\rceil = 259$ bits. You can always determine the 100 dice with 259 questions, just ask about each bit in turn. The average is 259 questions / 100 dice = 2.59.

If you have $1000$ dice, then there are $6^{1000}$ possible outcomes. Label them 0 through $6^{1000}-1$ and express as a binary number. This takes $\lceil\log_2{6^{1000}}\rceil = \lceil1000\log_2{6}\rceil = 2585$ bits. You can always determine the 1000 dice with 2585 questions, just ask about each bit in turn. The average is 2585 questions / 1000 dice = 2.585.

Each order of magnitude gets you one more digit, converging toward the Shannon entropy.

On the other hand, with the decision tree in your example, you would not converge towards splitting the outcome space in half with each question. The first question $d_1 \in \{1,2,3\}$? does, but then there is waste if you have to ask two questions to determine 3 remaining outcomes. The second question (given a yes to the first), could be is either $d_1 = 1$ or $d_1 = 2$ and $d_2 \in \{1,2,3\}$?, which does split the outcomes space in half for multiple dice. Now you are forced to ask 3 questions to get the first die, but have gained information about the following dice. The strategy of enumerating and encoding the outcomes as above is just an extension of this idea. It doesn't pay off for a low number of dice, but does for many.

Ahmad Bazzi · Accepted Answer · 2018-09-14 21:38:54Z

There is nothing wrong with what you did. In the book "Elements on Information Theory", there is a proof that the average number of questions needed lies between $H(X)$ and $H(X)+1$, which agrees with what you did. So, in terms of "questions", the entropy gives you an accuracy within $1$ question. The following argument is from "Elements on Information Theory":

Proof that $H(X) \leq L < H(X) + 1$

If $L$ is the average number of questions (in the book it is the referred to as the expected description length), it could be written as $$L = \sum p_i l_i$$ subject to the constraints that each $l_i$ is an integer, because $l_i$ reflects the number of questions asked to arrive at the answer of the $i^{th}$ outcome. Also, you have $$\sum D ^{-l_i} \leq 1$$where $D$ is the size of your alphabets. Furthermore, the optimal number of questions can be found by minimizing the $D-$adic probability distribution closest to the distribution of $X$ in relative entropy, that is, by finding the $D-$adic $r$, where $$r_i = \frac{D^{-l_i}}{\sum_j D^{-l_j}}$$ that minimizes $$L - H(X) = D(p \Vert r) - \log(\sum D^{-l_i}) \geq 0$$ The choice of questions $l_i = \log_D \frac{1}{p_i}$ will give $L = H$. Since $\log_D \frac{1}{p_i}$ is not necessarily an integer, you could $$l_i = \lceil \log_D \frac{1}{p_i} \rceil$$. Using Kraft-McMillan's inequality, you can say $$\sum D^{-\lceil \log_D \frac{1}{p_i} \rceil} \leq \sum D^{- \log \frac{1}{p_i}} = \sum p_i = 1$$ Now you'll get that the optimal $l_i$ are bounded between $$\log_D \frac{1}{p_i} \leq l_i < \log_D \frac{1}{p_i} + 1$$ which gives you

$$H(X) \leq L < H(X) + 1$$ You computed $L \simeq 2.666$ and $H(X) \simeq 2.58$

Nice explanation. But I am a bit confused. If we flip a coin, it seems one question is enough: Is it heads ( tails)? A single yes, no discloses everything we need to know. What am I missing? — MSIS, Commented Aug 7, 2020 at 3:46

Stack Exchange Network

Shannon entropy of a fair dice

4 Answers 4

Proof that $H(X) \leq L < H(X) + 1$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
probability-theory
information-theory
entropy
decision-trees
.

Linked

Hot Network Questions

Shannon entropy of a fair dice

4 Answers 4

Proof that $H(X) \leq L < H(X) + 1$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged probability-theoryinformation-theoryentropydecision-trees.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
probability-theory
information-theory
entropy
decision-trees
.