Diegobatt's answer does a good job of explaining intuitively what the typical set is. This answer will address the OP's other question, echoed by @tomwesolowski: why would you define the typical set in a way that can exclude the most probable elements?
The short answer is that the typical set is primarily a theoretical tool. It was defined to help prove something, and this definition is the most convenient one for the proof. It is a good example of how theoretical needs can sometimes trump intuitive preferences in mathematics.
The typical set was defined by the father of information theory, Claude Shannon. He wanted to determine how efficiently one could possibly encode a stream of symbols from a fixed alphabet, assuming each symbol is an i.i.d. random sample from some distribution. His key insights were that:
- There is an easily identifiable, relatively small set of "typical" sequences that show up disproportionately often in the stream.
- Assigning this "typical set" of sequences the shortest encodings yields an optimally efficient encoding (asymptotically, as the stream's output grows arbitrarily long).
The typical set Shannon discovered is composed precisely of the sequences whose self-information, or "surprising-ness", is about the same as the self-information expected, on average, for the stream's source distribution. Such sequences are "typical" in the sense that their information is about average, but this definition implicitly excludes those sequences which have significantly less information than average. These less-informative sequences are also the most probable ones.
As the OP notes, this is not intuitively appealing! On its face, the typical set sounds like it should contain all the most probable sequences up to some threshold. That would better represent what is typically seen in the stream.
But Shannon did not want the most "typical" possible typical set; he wanted one that made it easy to prove the result he wanted to prove. The typical set defined by Shannon is guaranteed to exist, it is guaranteed to be small, and it is guaranteed to be about as small as any other set you might propose, as this answer points out. Adding the most likely elements makes the set more likely, which is good, but it also makes the set bigger, which is bad. The net effect on the soundness of the proof isn't immediately clear; it could make the proof more complicated* to finish. Best not to fix what isn't broken.
If you have different objectives than Shannon, your preferred concept of typicality might be different as well. For example, in Huffman coding, the most probable symbols (or symbol sequences) get the shortest codes. In a certain technical sense, Huffman coding is the optimal solution to Shannon's original problem, and it better captures our intuition about typicality. On the other hand, Shannon's definition of typicality is more convenient for proving things.
*Given a typical set, you could add the most probable element and drop the least probable element. This would keep the set the same size and increase its probability, so the proof should still work with this new set. You could repeat this as many times as you wanted, until the typical set contains all the most probable elements. But this step has no direct value to the proof, so why bother?