23
$\begingroup$

I thought that the concept of typical set was pretty intuitive: a sequence of length $n$ would belong to the typical set $A_\epsilon ^{(n)}$ if the probability of the sequence coming out was high. So, any sequence that was likely would be in $A_\epsilon ^{(n)}$. (I'm avoiding the formal definition related to entropy because I'm trying to understand it qualitatively.)

However, I've read that, in general, the most likely sequence doesn't belong to the typical set. This confused me big time.

Is there an intuitive definition of typical set? Or is it just a mathematical tool that doesn't have much to do with common sense?

$\endgroup$

5 Answers 5

21
$\begingroup$

I know you have explicitly asked for an intuitive explanation and to leave out the formal definition, but I think they are rather related, so let me recall the definition of typical set:

$X_1, X_2 ,... $ are i.i.d. random variables $\sim $ $p(x)$ then the typical set $A_\epsilon^{(n)} $ with respect to $p(x)$ is the set of sequences $(x_1,x_2,...,x_n) \in \chi^n$ with the property $$2^{-n(H(X)+\epsilon)}\le p(x_1,x_2,...,x_n) \le 2^{-n(H(X)-\epsilon)} \tag{1}$$ This means that for a fixed $\epsilon$, the typical set is composed of all the sequences whose probabilities are close to $2^{-nH(X)}$. So in order for a sequence to belong to the typical set, it just has to have a probability close to $2^{-nH(X)}$, it usually does not though. To understand why, let me rewrite the equation 1 by applying $log_2$ on it.

$$H(X)-\epsilon\le \frac{1}{n}\log_2\left(\frac{1}{p(x_1,x_2,...,x_n)}\right) \le H(X)+\epsilon \tag{2}$$

Now the typical set definition is more directly related to the concept of entropy, or stated another way, the average information of the random variable. The middle term can be thought as the sample entropy of the sequence, thus the typical set is made by all the sequences that are giving us an amount of information close to the average information of the random variable $X$. The most probable sequence usually gives us less information than the average. Remember that, the lower the probability of an outcome is, the higher the information it gives us will be. To understand why let me give an example:

Let's suppose that you live in a city whose weather is highly likely to be sunny and warm, between 24°C and 26°C. You may watch the weather report every morning but you wouldn't care much about it, I mean, it is always sunny and warm. But what if someday the weather man/woman tells you that today will be rainy and cold, that is a game changer. You will have to use some different clothes and take an umbrella and do other things that you usually don't, so the weather man has given you a real important information.

To sum up, the intuitive definition of the typical set is that it consists of sequences that give us an amount of information close to the expected one of the source (random variable).

$\endgroup$
2
  • 1
    $\begingroup$ ... or rather $$H(X)-\epsilon\le \frac{1}{n}log_2(\frac{1}{p(x_1,x_2,...,x_n)}) \le H(X)+\epsilon \tag{2}$$... $\endgroup$
    – Cbhihe
    Commented Jan 9, 2018 at 13:16
  • 1
    $\begingroup$ OK, but what's the purpose of typical set defined in this way, then? Previously I thought we created a notion of typical set to have an intuition which THE SMALLEST subset of sequences we need to take to make sure we "cover" (1 - \eps)% cases. In this way, taking the most probable sequence is an obvious choice. What am I missing? $\endgroup$ Commented Feb 14, 2018 at 17:02
20
+50
$\begingroup$

Diegobatt's answer does a good job of explaining intuitively what the typical set is. This answer will address the OP's other question, echoed by @tomwesolowski: why would you define the typical set in a way that can exclude the most probable elements?

The short answer is that the typical set is primarily a theoretical tool. It was defined to help prove something, and this definition is the most convenient one for the proof. It is a good example of how theoretical needs can sometimes trump intuitive preferences in mathematics.

The typical set was defined by the father of information theory, Claude Shannon. He wanted to determine how efficiently one could possibly encode a stream of symbols from a fixed alphabet, assuming each symbol is an i.i.d. random sample from some distribution. His key insights were that:

  1. There is an easily identifiable, relatively small set of "typical" sequences that show up disproportionately often in the stream.
  2. Assigning this "typical set" of sequences the shortest encodings yields an optimally efficient encoding (asymptotically, as the stream's output grows arbitrarily long).

The typical set Shannon discovered is composed precisely of the sequences whose self-information, or "surprising-ness", is about the same as the self-information expected, on average, for the stream's source distribution. Such sequences are "typical" in the sense that their information is about average, but this definition implicitly excludes those sequences which have significantly less information than average. These less-informative sequences are also the most probable ones.

As the OP notes, this is not intuitively appealing! On its face, the typical set sounds like it should contain all the most probable sequences up to some threshold. That would better represent what is typically seen in the stream.

But Shannon did not want the most "typical" possible typical set; he wanted one that made it easy to prove the result he wanted to prove. The typical set defined by Shannon is guaranteed to exist, it is guaranteed to be small, and it is guaranteed to be about as small as any other set you might propose, as this answer points out. Adding the most likely elements makes the set more likely, which is good, but it also makes the set bigger, which is bad. The net effect on the soundness of the proof isn't immediately clear; it could make the proof more complicated* to finish. Best not to fix what isn't broken.

If you have different objectives than Shannon, your preferred concept of typicality might be different as well. For example, in Huffman coding, the most probable symbols (or symbol sequences) get the shortest codes. In a certain technical sense, Huffman coding is the optimal solution to Shannon's original problem, and it better captures our intuition about typicality. On the other hand, Shannon's definition of typicality is more convenient for proving things.

*Given a typical set, you could add the most probable element and drop the least probable element. This would keep the set the same size and increase its probability, so the proof should still work with this new set. You could repeat this as many times as you wanted, until the typical set contains all the most probable elements. But this step has no direct value to the proof, so why bother?

$\endgroup$
3
  • 1
    $\begingroup$ Excellent reasoning, and kudos on a job well done addressing the gap between intuition and definition. I would say this discrepancy happens due to a language shortcoming from everyday life where typical and average usually mean the same thing, but in terms of statistics, the typical (in the sense of probability, i.e. mode) is not necessarily the same as the average, i.e. expected value. $\endgroup$
    – Emil
    Commented Feb 19, 2018 at 22:37
  • 1
    $\begingroup$ One question though, when you say the definition excludes those sequences which have "significantly less information than average", shouldn't that be "significantly less or more" since the lower and upper bound is respectively $H(x)-\varepsilon$ and $H(x)+\varepsilon$ ? $\endgroup$
    – Emil
    Commented Feb 20, 2018 at 9:34
  • 1
    $\begingroup$ @Emil, I assume that author said it this way, because we all agreed that sequences having more information (less probable) shouldn't be contained in typical set. $\endgroup$ Commented Feb 20, 2018 at 12:23
2
$\begingroup$

The idea of a typical set implicitly treats the outcome sequences as multisets, ie it assumes you just care about the histogram of each sequence, eg you consider all 10 coin toss sequences with 7 heads and 3 tails as equivalent.

Imagine you have a very biased coin, say $p(H) = .9$. This is just the binomial distribution. The most probable 100-toss sequence is 100 heads, but there is only 1 100 head sequence. There are exponentially many more sequences that contain 10 tails, but these are much less probable individually. The the greatest number sequences is with half heads & half tails, but these are even less probable. So there is a tension between the probability of individual sequences and the number of equivalent sequences in a class. Maximum probability is reached when the frequencies in the sequences match the probabilities.

The important result is that for sufficiently long sequences almost all sampled sequences will have be arbitrarilly close to the expected frequencies, ie the distribution becomes extremely peaked as the length of sequences considered increases.

For example observed $10^5$ toss sequences of the $P(H)=.9$ coin will get sequences with $10^4{+/-}300$ tails 99% of the time since the standard deviation on the number of tails in a sequnce is approximately 100. The probability of all heads is negligible despite that being the most probable specific sequence.

Typical set is a more general, information theoretically defined version of this idea.

$\endgroup$
1
$\begingroup$

According to theorem 6.3 in these lecture notes no matter if we take subset of sequences with highest probability or those with probability close to $2^{-nH(X)}$ (from typical set) we have to take approximately $2^{nH}$ to make sure that chosen subset contains random sequence with high probability. We usually take typical set elements, because we can bound the size of it more easily.

$\endgroup$
2
  • 1
    $\begingroup$ Could you explain how this addresses the request for "intuitive definition of typical set"? $\endgroup$
    – whuber
    Commented Feb 14, 2018 at 22:54
  • $\begingroup$ I'm not sure, but it meant to address "However, I've read that, in general, the most likely sequence doesn't belong to the typical set. This confused me big time." part of question :) $\endgroup$ Commented Feb 17, 2018 at 10:56
0
$\begingroup$

Consider a biased coin that has a 60% probability of landing heads. If this coin is flipped 100 times, the most likely sequence would be 100 heads, but that isn't "typical". Typical sequences would have about 60 heads.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.