6
$\begingroup$

I have 927 unique sequences of the numbers 1, 2 and 3, all of which sum to 12 and represent every possible one-octave scale on the piano, with the numbers representing the intervals between notes in half-steps (i.e., adjacent keys). For example, the Major Scale is { 2,2,1,2,2,2,1 }, while the hexatonic (6-note) Blues Scale is { 3, 2, 1, 1, 3, 2 }. It can help mentally to add a zero at the beginning to represent the root note.

I'm trying to decide how to categorize them, and believe Shannon entropy is a natural place to begin, with one problem: The default Shannon Equation returns the same figure agnostic of the order of the sequence (as we would expect), whereas this is a case where the order is supremely relevant. (Music theory and information theory have a lot in common, though this appears to be largely unexplored.)

Let's take a simple example: Both of these are valid scales, though neither has a name, to my knowledge:

{ 1,2,1,2,1,2,1,2 }
{ 2,2,1,2,1,1,1,2 }

Since there are equal numbers of 1s and 2s in each set, we don't need a calculator to see that the Shannon Entropy for each is 4:

$$-\sum_{i=1}^8 \frac{1}{2} \log_{2}\frac{1}{2}$$

However, I believe--I may be very wrong here--that there is significantly more information contained in the second sequence, given that the first can be communicated as "{1,2}, four times" while the second is considerably more arbitrary. (I believe this is debate in bioinformatics, where one can locate long spans of simple repeating sequences of nucleotides that some consider "junk DNA," though others dispute that adjective.)

How does one account for order when measuring surprisal? If I were to take, say, The Great Gatsby and sort the ~261,000 characters (as in letters, etc., not fictional people!) in ASCII order, I don't think it would contain the same amount of information as the original text.

One strategy I tried was a rolling measure of probability, where the algorithm is initially unaware of the total frequency of each number but accumulates the odds as it goes. (I suppose this is like measuring the accumulating surprisal of a die you cannot see and don't know how many sides it has.) This did not produce very intuitive or useful results.

Another thought I had was to divide each sequence into "words" that are repeating sub-sequences of at least two digits, then measure the entropy of each word and ... I'm not sure what to do with them. A sum of $H[{1,2}]$ four times still adds to 4. I'm a bit out of my depth here both in reducing sequences to the most efficient subsequences and knowing what to do with them.

I've read Shannon's 1948 paper twice, including the discussion of the relative odds of letters in a given language, which seems relevant but is, in fact, not useful in this case, where every possible ordered combination is present and the intervals are added independently of those preceding them, with the edge case that they cannot exceed a sum of 12.

Which is all to say, how does one quantify the entropy of an ordered sequence, when the order is of extreme relevance but each item is independent of the others? Below I'll include how I generated the 927 sequences for anyone interested, but it's ancillary to the question, I think.

***

First, I wrote a simple recursive algorithm that begins with an empty set and calls itself three times after adding 1, 2 or 3, terminating when the sum is 12 and tossing any that go over. Happy to provide code or pseudo-code, but I think it's intuitive.

To check myself, I generated the all the combinations of [1,2,3] that add to 12--there are 19--using a Coin-Change algorithm, and then calculated the unique, distinct permutations of each of them (treating identical digits as indistinguishable). The sum of the permutations is also 927, which range from 1 for the sequence of twelve 1s (the Chromatic Scale) or four 3s (a sparse diminished scale) to 140 possible scales for the intervals { 1, 1, 1, 2, 2, 2, 3 }, which almost subsumes the Harmonic Minor Scale.

One more thought

Regarding the above example of four 1s and four 2s: If I was flipping an unbiased coin 12 times and coding Heads as "1" and Tails as "2", I would be more surprised, in a general sense of the word, to get alternating results than to get the second sequence with small clusters. So it's possible that "surprisal" is the wrong frame of mind here versus information.

$\endgroup$
1
  • $\begingroup$ Maybe I am missing something here: In your example setup, you have three (3) possibilities, i.e. intervals 1 or 2 or 3. Then for your sequence 1,2,1,2,1,2,1,2 you have a Shannon Entropy calculation where each probability $p_i$ is 0.5. Should each $p_i$ not be 1/3? $\endgroup$ Commented May 1 at 11:28

2 Answers 2

5
$\begingroup$

You're starting to think of Kolmogorov complexity, which is a (almost uncomputable) measure of "how hard it is to describe" the sequence. It is completely dependent on "what is allowed to be used to describe" sequences (as computer programs, actually).

Shannon entropy fundamentally describes how much information per character there is when a stream of the specified probability distribution arrives. Serial correlations and so on are specifically not accounted for (you may of course consider characters to be multi-interval, and then your second sequence does have higher entropy).

Another point: surprisal in terms of "the intervals being in an unexpected mathematical sequence" is drastically different from surprisal per "this is a dissonant interval relative to the previous ones", but you're into atonality so that may not matter.

This is also heavy math, but in the direction of the lattice of rationals (abstract algebra), not information theory.

$\endgroup$
1
  • $\begingroup$ Thank you! This is enormously interesting and extremely helpful in understanding Shannon entropy -- I suspected I was being fast-and-loose with "surprisal." I'm thinking that source coding (in the sense of data compression) might be an avenue to explore here. $\endgroup$ Commented Jan 3, 2021 at 21:24
1
$\begingroup$

Very interesting question! I've also been thinking about entropy in sequences, but with regards to language. I'm surprised I haven't been able to find much about the entropy of sequences (still looking), but I think I've been able to work some parts of it out.

As the first commenter mentioned, Shannon's entropy $H(X)$ assumes that $X$ is i.i.d. (independent and identically-distributed). That means in the limit there's no correlation between outcomes of $X$, and that the probability distribution of $X$ remains the same for each trial.

This works great for coin flips, but not when outcomes are correlated. For that, we need the concept of conditional entropy. We can use the conditional entropy between two correlated random variables, $X$ and $Y$, to calculate their joint entropy, $H(X,Y)$:

$$H(X,Y) = H(X) + H(Y|X)$$

or alternatively,

$$H(X,Y) = H(Y) + H(X|Y)$$

For sequences, not only are random variables correlated, but they're also ordered. This order contains much of the sequence's information; as you said, sorting the characters in The Great Gatsby into ASCII order would change the information in the novel. To calculate the entropy in a sequence, its items must be generated in this order. Changing this order changes the entropy, because it changes the sequence.

For a sequence of outcomes generated by a sequence of random variables $X_0,\dots,X_{n-1}$, its entropy would then be

$$H(X_0,\dots,X_{n-1}) = H(X_0) + \sum_{i=1}^{n-1} H(X_i|X_0,\dots,X_{i-1})$$

or alternatively, backwards:

$$H(X_{n-1},\dots,X_0) = H(X_{n-1}) + \sum_{i=n-2}^{0} H(X_i|X_{i+1},\dots,X_{n-1})$$

(In practice, this can be hard to calculate for longer sequences, if there aren't very many of them. If most of the correlations are very local, we can assume that each conditional entropy is only conditional on the last $k$ outcomes, for some constant positive integer $k$. When $k=1$, we get the Markov assumption).

Since items in a sequence are generated in order, one after another, it also makes sense to talk about the entropy of a particular sequence:

$$H(\{X_0,\dots,X_{n-1}\}=\{x_0,\dots,x_{n-1}\})$$ $$= H(X_0) + \sum_{i=1}^{n-1} H(X_i|X_0=x_0,\dots,X_{i-1}=x_{i-1})$$

To make this more concrete, we can imagine a situation where one person has to indicate to another one of the scales you generated. First, person A randomly picks one of the 927 scales, and gives it person B. Then B has to send a message to person C, to tell C which scale they've been given by A. B and C can agree on a protocol beforehand, but can't communicate after B gets a scale from A. The entropy of a scale would then be the amount of information or uncertainty contained in the message that describes it.

So for the first scale, we only have to know the first two intervals to know the rest. The message "12" would be enough to specify it; the transmitter and receiver could agree beforehand that, for example, if the intervals didn't add up to 12, the receiver would just repeat the written intervals until they did. Since $H(X_0) = -log_2 (\frac{1}{3})$, and since $H(X_1|X_0)$ is the same, the joint entropy $H(\{X_0,\dots,X_7\}=\{1,2,1,2,1,2,1,2\})$ is then $-2log_2 \frac{1}{3} \approx 3.17$ bits, or $0.396$ bits/interval.

For the second scale, however, there's no pattern that can shorten the message, so all intervals are independent and must be specified. With the exception of the final 2, which otherwise could only have been a 1, all the other intervals could have been 1, 2, or 3. So $H(\{X_0,\dots,X_7\}=\{2,2,1,2,1,1,1,2\})$ would then be $-7log_2 \frac{1}{3} - log_2 \frac{1}{2} \approx 12.09$ bits, or $1.51$ bits/interval.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .