You are currently browsing the tag archive for the ‘Shannon entropy’ tag.
[This post is dedicated to Luca Trevisan, who recently passed away due to cancer. Though far from his most significant contribution to the field, I would like to mention that, as with most of my other blog posts on this site, this page was written with the assistance of Luca’s LaTeX to WordPress converter. Mathematically, his work and insight on pseudorandomness in particular have greatly informed how I myself think about the concept. – T.]
Recently, Timothy Gowers, Ben Green, Freddie Manners, and I were able to establish the following theorem:
Theorem 1 (Marton’s conjecture) Letbe non-empty with
. Then there exists a subgroup
of
with
such that
is covered by at most
translates of
, for some absolute constant
.
We established this result with , although it has since been improved to
by Jyun-Jie Liao.
Our proof was written in order to optimize the constant as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant
, but just to show that the result holds for some constant
. This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.
The key concept here is that of the entropic Ruzsa distance between two random variables
taking values
, defined as
Theorem 2 (Entropic Marton’s conjecture) Letbe a
-valued random variable with
. Then there exists a uniform random variable
on a subgroup
of
such that
for some absolute constant
.
We were able to establish Theorem 2 with , which implies Theorem 1 with
by fairly standard additive combinatorics manipulations (such as the Ruzsa covering lemma); see the blueprint for details.
The key proposition needed to establish Theorem 2 is the following distance decrement property:
Proposition 3 (Distance decrement) Ifare
-valued random variables, then one can find
-valued random variables
such that
and
for some absolute constants
.
Indeed, suppose this proposition held. Starting with both equal to
and iterating, one can then find sequences of random variables
with
,
To prove Proposition 3, we can reformulate it as follows:
Proposition 4 (Lack of distance decrement implies vanishing) Ifare
-valued random variables, with the property that
for all
-valued random variables
and some sufficiently small absolute constant
, then one can derive a contradiction.
Indeed, we may assume from the above proposition that
The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version
— 1. Main argument —
Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.
Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set
should be at least as large as the doubling constant of the image
of that set under a homomorphism, times the doubling constant of a typical fiber
of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.
Here is the fibring identity:
Proposition 5 (Fibring identity) Letbe a homomorphism. Then for any independent
-valued random variables
, one has
The proof is of course in the blueprint, but given that it is a central pillar of the argument, I reproduce it here.
Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule
We will only care about the characteristic setting here, so we will now assume that all groups involved are
-torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where
,
,
is the addition map
, and
,
are pairs of independent random variables in
, we obtain the following corollary:
Corollary 6 Letbe independent
-valued random variables. Then we have the identity
This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality
Now we use the fibring identity again, relabeling as
and requiring
to be independent copies of
. We conclude that
Remark 7 A similar argument works in the-torsion case for general
. Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”
for independent
. By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable
. If one then takes
,
to be an array of
copies of
, one can get to the point where the row sums
and the column sums
have small conditional mutual information with respect to the double sum
. If we then set
and
, the data processing inequality again shows that
and
are nearly independent given
. The
-torsion now crucially intervenes as before to ensure that
has the same form as
or
, leading to a contradiction as before. See this previous blog post for more discussion.
[This post is dedicated to Luca Trevisan, who recently passed away due to cancer. Though far from his most significant contribution to the field, I would like to mention that, as with most of my other blog posts on this site, this page was written with the assistance of Luca’s LaTeX to WordPress converter. Mathematically, his work and insight on pseudorandomness in particular have greatly informed how I myself think about the concept. – T.]
Recently, Timothy Gowers, Ben Green, Freddie Manners, and I were able to establish the following theorem:
Theorem 1 (Marton’s conjecture) Let
be non-empty with
. Then there exists a subgroup
of
with
such that
is covered by at most
translates of
, for some absolute constant
.
We established this result with , although it has since been improved to
by Jyun-Jie Liao.
Our proof was written in order to optimize the constant as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant
, but just to show that the result holds for some constant
. This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.
The key concept here is that of the entropic Ruzsa distance between two random variables
taking values
, defined as
where are independent copies of
, and
denotes the Shannon entropy of
. This distance is symmetric and non-negative, and obeys the triangle inequality
for any random variables ; see the blueprint for a proof. The above theorem then follows from an entropic analogue:
Theorem 2 (Entropic Marton’s conjecture) Let
be a
-valued random variable with
. Then there exists a uniform random variable
on a subgroup
of
such that
for some absolute constant
.
We were able to establish Theorem 2 with , which implies Theorem 1 with
by fairly standard additive combinatorics manipulations; see the blueprint for details.
The key proposition needed to establish Theorem 2 is the following distance decrement property:
Proposition 3 (Distance decrement) If
are
-valued random variables, then one can find
-valued random variables
such that
and
for some absolute constants
.
Indeed, suppose this proposition held. Starting with both equal to
and iterating, one can then find sequences of random variables
with
,
and
In particular, from the triangle inequality and geometric series
By weak compactness, some subsequence of the ,
converge to some limiting random variables
, and by some simple continuity properties of entropic Ruzsa distance, we conclude that
and
Theorem 2 then follows from the “100% inverse theorem” for entropic Ruzsa distance; see the blueprint for details.
To prove Proposition 3, we can reformulate it as follows:
Proposition 4 (Lack of distance decrement implies vanishing) If
are
-valued random variables, with the property that
for all
-valued random variables
and some sufficiently small absolute constant
, then one can derive a contradiction.
Indeed, we may assume from the above proposition that
for some , which will imply Proposition 3 with
.
The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version
for any random variables that are possibly coupled with
respectively. In particular, if we define a “relevant” random variable
(conditioned with respect to some auxiliary data
) to be a random variable for which
or equivalently (by the triangle inequality)
then we have the useful lower bound
whenever and
are relevant conditioning on
respectively. This is quite a useful bound, since the laws of “entropic Ruzsa calculus” will tell us, roughly speaking, that virtually any random variable that we can create from taking various sums of copies of
and conditioning against other sums, will be relevant. (Informally: the space of relevant random variables is
-separated with respect to the entropic Ruzsa distance.)
— 1. Main argument —
Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.
Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set
should be at least as large as the doubling constant of the image
of that set under a homomorphism, times the doubling constant of a typical fiber
of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.
Here is the fibring identity:
Proposition 5 (Fibring identity) Let
be a homomorphism. Then for any independent
-valued random variables
, one has
The proof is of course in the blueprint, but given that it is a central pillar of the argumnt, I reproduce it here.
Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule
and
it suffices to establish the identity
But from the chain rule again we have
and from the definition of conditional mutual information (using the fact that is determined both by
and by
) one has
giving the claim.
We will only care about the characteristic setting here, so we will now assume that all groups involved are
-torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where
,
,
is the addition map
, and
,
are pairs of independent random variables in
, we obtain the following corollary:
Corollary 6 Let
be independent
-valued random variables. Then we have the identity
This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality
If we let be independent copies of
respectively (note the swap in the last two variables!) we obtain
From entropic Ruzsa calculus, one can check that ,
, and
are all relevant random variables, so from (2) we now obtain both upper and lower bounds for
:
A pleasant upshot of this is that we now get to work in the symmetric case without loss of generality. Indeed, if we set
, we now have from (2) that
whenever are relevant, which by entropic Ruzsa calculus is equivalent to asking that
Now we use the fibring identity again, relabeling as
and requiring
to be independent copies of
. We conclude that
As before, the random variables ,
,
,
are all relevant, so from (3) we have
We could now also match these lower bounds with upper bounds, but the more important takeaway from this analysis is a really good bound on the conditional mutual information:
By the data processing inequality, we can discard some of the randomness here, and conclude
Let us introduce the random variables
then we have
Intuitively, this means that and
are very nearly independent given
. For sake of argument, let us assume that they are actually independent; one can achieve something resembling this by invoking the entropic Balog-Szemerédi-Gowers theorem, established in the blueprint, after conceding some losses of
in the entropy, but we skip over the details for this blog post. The key point now is that because we are in characteristic
,
has the same form as
or
:
In particular, by permutation symmetry, we have
and so by the definition of conditional Ruzsa distance we have a massive distance decrement
contradicting (1) as desired. (In reality, we end up decreasing the distance not all the way to zero, but instead to due to losses in the Balog-Szemerédi-Gowers theorem, but this is still enough to reach a contradiction.)
Remark 7 A similar argument works in the
-torsion case for general
. Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”
for independent
. By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable
. If one then takes
,
to be an array of
copies of
, one can get to the point where the row sums
and the column sums
have small conditional mutual information with respect to the double sum
. If we then set
and
, the data processing inequality again shows that
and
are nearly independent given
. The
-torsion now crucially intervenes as before to ensure that
has the same form as
or
, leading to a contradiction as before. See this previous blog post for more discussion.
Let be a non-empty finite set. If
is a random variable taking values in
, the Shannon entropy
of
is defined as
Lemma 1 (Gibbs variational formula) Letbe a function. Then
Proof: Note that shifting by a constant affects both sides of (1) the same way, so we may normalize
. Then
is now the probability distribution of some random variable
, and the inequality can be rewritten as
In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.
Theorem 2 (Generalized Cauchy-Schwarz inequality) Let, let
be finite non-empty sets, and let
be functions for each
. Let
and
be positive functions for each
. Then
where
is the quantity
where
is the set of all tuples
such that
for
.
Thus for instance, the identity is trivial for . When
, the inequality reads
We now prove this inequality. We write and
for some functions
and
. If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes
Lemma 3 (Conditional expectation computation) Letbe an
-valued random variable. Then there exists a
-valued random variable
, where each
has the same distribution as
, and
Proof: We induct on . When
we just take
. Now suppose that
, and the claim has already been proven for
, thus one has already obtained a tuple
with each
having the same distribution as
, and
With a little more effort, one can replace by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.
Tim Gowers, Ben Green, Freddie Manners, and I have just uploaded to the arXiv our paper “On a conjecture of Marton“. This paper establishes a version of the notorious Polynomial Freiman–Ruzsa conjecture (first proposed by Katalin Marton):
Theorem 1 (Polynomial Freiman–Ruzsa conjecture) Letbe such that
. Then
can be covered by at most
translates of a subspace
of
of cardinality at most
.
The previous best known result towards this conjecture was by Konyagin (as communicated in this paper of Sanders), who obtained a similar result but with replaced by
for any
(assuming that say
to avoid some degeneracies as
approaches
, which is not the difficult case of the conjecture). The conjecture (with
replaced by an unspecified constant
) has a number of equivalent forms; see this survey of Green, and these papers of Lovett and of Green and myself for some examples; in particular, as discussed in the latter two references, the constants in the inverse
theorem are now polynomial in nature (although we did not try to optimize the constant).
The exponent here was the product of a large number of optimizations to the argument (our original exponent here was closer to
), but can be improved even further with additional effort (our current argument, for instance, allows one to replace it with
, but we decided to state our result using integer exponents instead).
In this paper we will focus exclusively on the characteristic case (so we will be cavalier in identifying addition and subtraction), but in a followup paper we will establish similar results in other finite characteristics.
Much of the prior progress on this sort of result has proceeded via Fourier analysis. Perhaps surprisingly, our approach uses no Fourier analysis whatsoever, being conducted instead entirely in “physical space”. Broadly speaking, it follows a natural strategy, which is to induct on the doubling constant . Indeed, suppose for instance that one could show that every set
of doubling constant
was “commensurate” in some sense to a set
of doubling constant at most
. One measure of commensurability, for instance, might be the Ruzsa distance
, which one might hope to control by
. Then one could iterate this procedure until doubling constant dropped below say
, at which point the conjecture is known to hold (there is an elementary argument that if
has doubling constant less than
, then
is in fact a subspace of
). One can then use several applications of the Ruzsa triangle inequality
There are a number of possible ways to try to “improve” a set of not too large doubling by replacing it with a commensurate set of better doubling. We note two particular potential improvements:
- (i) Replacing
with
. For instance, if
was a random subset (of density
) of a large subspace
of
, then replacing
with
usually drops the doubling constant from
down to nearly
(under reasonable choices of parameters).
- (ii) Replacing
with
for a “typical”
. For instance, if
was the union of
random cosets of a subspace
of large codimension, then replacing
with
again usually drops the doubling constant from
down to nearly
.
Unfortunately, there are sets where neither of the above two operations (i), (ii) significantly improves the doubling constant. For instance, if
is a random density
subset of
random translates of a medium-sized subspace
, one can check that the doubling constant stays close to
if one applies either operation (i) or operation (ii). But in this case these operations don’t actually worsen the doubling constant much either, and by applying some combination of (i) and (ii) (either intersecting
with a translate, or taking a sumset of
with itself) one can start lowering the doubling constant again.
This begins to suggest a potential strategy: show that at least one of the operations (i) or (ii) will improve the doubling constant, or at least not worsen it too much; and in the latter case, perform some more complicated operation to locate the desired doubling constant improvement.
A sign that this strategy might have a chance of working is provided by the following heuristic argument. If has doubling constant
, then the Cartesian product
has doubling constant
. On the other hand, by using the projection map
defined by
, we see that
projects to
, with fibres
being essentially a copy of
. So, morally,
also behaves like a “skew product” of
and the fibres
, which suggests (non-rigorously) that the doubling constant
of
is also something like the doubling constant of
, times the doubling constant of a typical fibre
. This would imply that at least one of
and
would have doubling constant at most
, and thus that at least one of operations (i), (ii) would not worsen the doubling constant.
Unfortunately, this argument does not seem to be easily made rigorous using the traditional doubling constant; even the significantly weaker statement that has doubling constant at most
is false (see comments for more discussion). However, it turns out (as discussed in this recent paper of myself with Green and Manners) that things are much better. Here, the analogue of a subset
in
is a random variable
taking values in
, and the analogue of the (logarithmic) doubling constant
is the entropic doubling constant
, where
are independent copies of
. If
is a random variable in some additive group
and
is a homomorphism, one then has what we call the fibring inequality
Applying this inequality with replaced by two independent copies
of itself, and using the addition map
for
, we obtain in particular that
A version of this endgame conclusion is in fact valid in any characteristic. But in characteristic , we can take advantage of the identity
To deal with the situation where the conditional mutual information is small but not completely zero, we have to use an entropic version of the Balog-Szemeredi-Gowers lemma, but fortunately this was already worked out in an old paper of mine (although in order to optimise the final constant, we ended up using a slight variant of that lemma).
I am planning to formalize this paper in the Lean4 language. Further discussion of this project will take place on this Zulip stream, and the project itself will be held at this Github repository.
A family of sets for some
is a sunflower if there is a core set
contained in each of the
such that the petal sets
are disjoint. If
, let
denote the smallest natural number with the property that any family of
distinct sets of cardinality at most
contains
distinct elements
that form a sunflower. The celebrated Erdös-Rado theorem asserts that
is finite; in fact Erdös and Rado gave the bounds
Rao’s argument used the Shannon noiseless coding theorem. It turns out that the argument can be arranged in the very slightly different language of Shannon entropy, and I would like to present it here. The argument proceeds by locating the core and petals of the sunflower separately (this strategy is also followed in Alweiss-Lovett-Wu-Zhang). In both cases the following definition will be key. In this post all random variables, such as random sets, will be understood to be discrete random variables taking values in a finite range. We always use boldface symbols to denote random variables, and non-boldface for deterministic quantities.
Definition 1 (Spread set) Let. A random set
is said to be
-spread if one has
for all sets
. A family
of sets is said to be
-spread if
is non-empty and the random variable
is
-spread, where
is drawn uniformly from
.
The core can then be selected greedily in such a way that the remainder of a family becomes spread:
Lemma 2 (Locating the core) Letbe a family of subsets of a finite set
, each of cardinality at most
, and let
. Then there exists a “core” set
of cardinality at most
such that the set
has cardinality at least
, and such that the family
is
-spread. Furthermore, if
and the
are distinct, then
.
Proof: We may assume is non-empty, as the claim is trivial otherwise. For any
, define the quantity
Let be the set (3). Since
,
is non-empty. It remains to check that the family
is
-spread. But for any
and
drawn uniformly at random from
one has
In view of the above lemma, the bound (2) will then follow from
Proposition 3 (Locating the petals) Letbe natural numbers, and suppose that
for a sufficiently large constant
. Let
be a finite family of subsets of a finite set
, each of cardinality at most
which is
-spread. Then there exist
such that
is disjoint.
Indeed, to prove (2), we assume that is a family of sets of cardinality greater than
for some
; by discarding redundant elements and sets we may assume that
is finite and that all the
are contained in a common finite set
. Apply Lemma 2 to find a set
of cardinality
such that the family
is
-spread. By Proposition 3 we can find
such that
are disjoint; since these sets have cardinality
, this implies that the
are distinct. Hence
form a sunflower as required.
Remark 4 Proposition 3 is easy to prove if we strengthen the condition onto
. In this case, we have
for every
, hence by the union bound we see that for any
with
there exists
such that
is disjoint from the set
, which has cardinality at most
. Iterating this, we obtain the conclusion of Proposition 3 in this case. This recovers a bound of the form
, and by pursuing this idea a little further one can recover the original upper bound (1) of Erdös and Rado.
It remains to prove Proposition 3. In fact we can locate the petals one at a time, placing each petal inside a random set.
Proposition 5 (Locating a single petal) Let the notation and hypotheses be as in Proposition 3. Letbe a random subset of
, such that each
lies in
with an independent probability of
. Then with probability greater than
,
contains one of the
.
To see that Proposition 5 implies Proposition 3, we randomly partition into
by placing each
into one of the
,
chosen uniformly and independently at random. By Proposition 5 and the union bound, we see that with positive probability, it is simultaneously true for all
that each
contains one of the
. Selecting one such
for each
, we obtain the required disjoint petals.
We will prove Proposition 5 by gradually increasing the density of the random set and arranging the sets to get quickly absorbed by this random set. The key iteration step is
Proposition 6 (Refinement inequality) Letand
. Let
be a random subset of a finite set
which is
-spread, and let
be a random subset of
independent of
, such that each
lies in
with an independent probability of
. Then there exists another
-spread random subset
of
whose support is contained in the support of
, such that
and
Note that a direct application of the first moment method gives only the bound
One can iterate the above proposition, repeatedly replacing with
(noting that this preserves the
-spread nature of
) to conclude
Corollary 7 (Iterated refinement inequality) Let,
, and
. Let
be a random subset of a finite set
which is
-spread, and let
be a random subset of
independent of
, such that each
lies in
with an independent probability of
. Then there exists another random subset
of
with support contained in the support of
, such that
Now we can prove Proposition 5. Let be chosen shortly. Applying Corollary 7 with
drawn uniformly at random from the
, and setting
, or equivalently
, we have
It remains to establish Proposition 6. This is the difficult step, and requires a clever way to find the variant of
that has better containment properties in
than
does. The main trick is to make a conditional copy
of
that is conditionally independent of
subject to the constraint
. The point here is that this constrant implies the inclusions
In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.
The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form
of the prime number theorem (where we take the limit ). One can interpret this estimate probabilistically as
where is a random variable drawn uniformly from the natural numbers up to
, and
denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum
After dividing by , this is essentially
With probabilistic intuition, one may expect the random variables to be approximately independent (there is no obvious relationship between the number of prime factors of
, and of
), and so the above average would be expected to be approximately equal to
which by (2) is equal to . Thus we are led to the prediction
The asymptotic (3) is widely believed (it is a special case of the Chowla conjecture, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.
How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “” measure of independence known as covariance, and the “
” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of
type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of
methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as
methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version
of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound , but the latter bound is much easier to establish than the former).
As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them on average, rather than uniformly. For instance, given two random variables and
of number-theoretic origin (such as the random variables
and
mentioned previously), it can often be extremely difficult to determine the extent to which
behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if
are a large collection of “independent” random variables, and
is a further random variable that is “not too large” in some sense, then
must necessarily be nearly independent (or conditionally independent) to many of the
, even if one cannot pinpoint precisely which of the
the variable
is independent with. In the case of the second moment method, this allows us to compute correlations such as
for “most”
. The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities
for “most”
thanks to tools such as the Pinsker inequality.
Given a random variable that takes on only finitely many values, we can define its Shannon entropy by the formula
with the convention that . (In some texts, one uses the logarithm to base
rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables
taking on finitely many values, the joint variable
is also a random variable taking on finitely many values, and also has an entropy
. It obeys the Shannon inequalities
so we can define some further nonnegative quantities, the mutual information
and the conditional entropies
More generally, given three random variables , one can define the conditional mutual information
and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.
The mutual information is a measure of the extent to which
and
fail to be independent; indeed, it is not difficult to show that
vanishes if and only if
and
are independent. Similarly,
vanishes if and only if
and
are conditionally independent relative to
. At the other extreme,
is a measure of the extent to which
fails to depend on
; indeed, it is not difficult to show that
if and only if
is determined by
in the sense that there is a deterministic function
such that
. In a related vein, if
and
are equivalent in the sense that there are deterministic functional relationships
,
between the two variables, then
is interchangeable with
for the purposes of computing the above quantities, thus for instance
,
,
,
, etc..
One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables being considered come from restricting a single random (and uniformly distributed) boolean function
on a given finite domain
to some subset
of
:
In this case, has the law of a random uniformly distributed boolean function from
to
, and the entropy here can be easily computed to be
, where
denotes the cardinality of
. If
is the restriction of
to
, and
is the restriction of
to
, then the joint variable
is equivalent to the restriction of
to
. If one discards the normalisation factor
, one then obtains the following dictionary between entropy and the combinatorics of finite sets:
Random variables |
Finite sets |
Entropy |
Cardinality |
Joint variable |
Union |
Mutual information |
Intersection cardinality |
Conditional entropy |
Set difference cardinality |
Conditional mutual information |
|
Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality becomes the union bound
, and the definition of mutual information becomes the inclusion-exclusion formula
For a more advanced example, consider the data processing inequality that asserts that if are conditionally independent relative to
, then
. Specialising to sets, this now says that if
are disjoint outside of
, then
; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if
and
are not necessarily disjoint outside of
, then a consideration of Venn diagrams gives the more general inequality
and a further inspection of the diagram then reveals the more precise identity
Using the dictionary in the reverse direction, one is then led to conjecture the identity
which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.
On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set is preserved by unions:
Indeed, one has the union bound
Applying the dictionary in the reverse direction, one might now conjecture that if was independent of
and
was independent of
, then
should also be independent of
, and furthermore that
but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take to be independent, uniformly distributed random elements of the finite field
of two elements, and take
to be the sum of these two field elements. One can easily check that each of
and
is separately independent of
, but the joint variable
determines
and thus is not independent of
.
From the inclusion-exclusion identities
one can check that (1) is equivalent to the trivial lower bound . The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection
. (Even the double intersection
only exists information theoretically in a “virtual” sense; the mutual information
allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)
However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities associated to just two variables
are those that are also necessarily obeyed by their combinatorial analogues
. (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)
One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let be some finite-dimensional vector space over a finite field
, and let
be a random linear functional on
, selected uniformly among all such functions. Every subspace
of
then gives rise to a random variable
formed by restricting
to
. This random variable is also distributed uniformly amongst all linear functions on
, and its entropy can be easily computed to be
. Given two random variables
formed by restricting
to
respectively, the joint random variable
determines the random linear function
on the union
on the two spaces, and thus by linearity on the Minkowski sum
as well; thus
is equivalent to the restriction of
to
. In particular,
. This implies that
and also
, where
is the quotient map. After discarding the normalising constant
, this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:
Random variables |
Subspaces |
Entropy |
Dimension |
Joint variable |
Sum |
Mutual information |
Dimension of intersection |
Conditional entropy |
Dimension of projection |
Conditional mutual information |
|
The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking to be the vector space
over the finite field
of two elements, and only considering those subspaces
that are coordinate subspaces
associated to various subsets
of
.
As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of ). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).
The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces such that
and
are separately transverse to
, but their sum
is not; for instance, in a two-dimensional vector space
, one can take
to be the one-dimensional subspaces spanned by
,
, and
respectively. Note that this is essentially the same counterexample from before (which took
to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces
(as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables
(as well as the corresponding quantities on the above table).
However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality
for any subspaces . This is easiest to see when the three terms on the right-hand side vanish; then
are transverse, which implies that
; similarly
. But
and
are transverse, and this clearly implies that
and
are themselves transverse. To prove the general case of Ingleton’s inequality, one can define
and use
(and similarly for
instead of
) to reduce to establishing the inequality
which can be rearranged using (and similarly for
instead of
) and
as
but this is clear since .
Returning to the entropy setting, the analogue
of (3) is true (exercise!), but the analogue
of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then are conditionally independent relative to
, and relative to
, and
and
are independent, and the claim (4) would then be asserting that
and
are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take
to be independent uniform variables from
, and take
and
to be (say)
and
respectively (thus
are the indicators of the events
and
respectively). Once one conditions on either
or
, one of
has positive conditional entropy and the other has zero entropy, and so
are conditionally independent relative to either
or
; also,
or
are independent of each other. But
and
are not independent of each other (they cannot be simultaneously equal to
). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces
has a well-defined intersection
that is also a subspace, whereas for arbitrary random variables
, there does not necessarily exist the analogue of an intersection, namely a “common information” random variable
that has the entropy of
and is determined either by
or by
.
I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality
One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that
whenever was sufficiently large depending on
, where
is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale
between
and
, one can form certain random variables
. The random variable
is a sign pattern of the form
where
is a random number chosen from
to
(with logarithmic weighting). The random variable
was tuple
of reductions of
to primes
comparable to
. Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of
, the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound
on the mutual information between and
. From translation invariance, this also gives the more general lower bound
for any , where
denotes the shifted sign pattern
. On the other hand, one had the entropy bounds
and from concatenating sign patterns one could see that is equivalent to the joint random variable
for any
. Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once
was allowed to become sufficiently large compared to
, but the bound was quite weak (coming ultimately from the unboundedness of
as the interval
of values of
under consideration becomes large), something of the order of
; the quantity
needs at various junctures to be less than a small power of
, so the relationship between
and
becomes essentially quadruple exponential in nature,
. The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate
of the mean entropy, in that this quantity decreased by
as
increased from
to
, basically by dividing
into
components
,
and observing from (6) each of these shares a bit of common information with the same variable
. This is relatively clear when one works in a set model, in which
is modeled by a set
of size
, and
is modeled by a set of the form
for various sets of size
(also there is some translation symmetry that maps
to a shift
while preserving all of the
).
However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables are basically jointly independent as
ranges over dyadic values that are much smaller than
, which in the set model corresponds to the
all being disjoint. One can then establish a variant
of (6), which in the set model roughly speaking asserts that each claims a portion of the
of cardinality
that is not claimed by previous choices of
. This leads to a more efficient contradiction (relying on the unboundedness of
rather than
) that looks like it removes one order of exponential growth, thus the relationship between
and
is now
. Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form
for a small constant , which on iterating and using the boundedness of
gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges
greater than
or so, although at this range the theorem is not significantly simpler than the general case).
A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:
Theorem 1 (Plünnecke-Ruzsa inequality) Let
be finite non-empty subsets of an additive group
, such that
for all
and some scalars
. Then there exists a subset
of
such that
.
The proof uses graph-theoretic techniques. Setting , we obtain a useful corollary: if
has small doubling in the sense that
, then we have
for all
, where
is the sum of
copies of
.
In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as in
are replaced with discrete random variables
taking values in
, and (the logarithm of) cardinality
of a set
is replaced by Shannon entropy
of a random variable
. (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.
I recently discovered, however, that buried in a classic paper of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be precise) there was the following analogue of Theorem 1:
Theorem 2 (Entropy Plünnecke-Ruzsa inequality) Let
be independent random variables of finite entropy taking values in an additive group
, such that
for all
and some scalars
. Then
.
In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set to a subset
, but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if
, then
for all
, where
are independent copies of
; this improves slightly over the analogous combinatorial inequality. Indeed, the function
is concave (this can be seen by using the
version of Theorem 2 (or (2) below) to show that the quantity
is decreasing in
).
Theorem 2 is actually a quick consequence of the submodularity inequality
in information theory, which is valid whenever are discrete random variables such that
and
each determine
(i.e.
is a function of
, and also a function of
), and
and
jointly determine
(i.e
is a function of
and
). To apply this, let
be independent discrete random variables taking values in
. Observe that the pairs
and
each determine
, and jointly determine
. Applying (1) we conclude that
which after using the independence of simplifies to the sumset submodularity inequality
(this inequality was also recently observed by Madiman; it is the case of Theorem 2). As a corollary of this inequality, we see that if
, then
and Theorem 2 follows by telescoping series.
The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of than
; see this paper and this paper respectively.
It turns out to be a favourable week or two for me to finally finish a number of papers that had been at a nearly completed stage for a while. I have just uploaded to the arXiv my article “Sumset and inverse sumset theorems for Shannon entropy“, submitted to Combinatorics, Probability, and Computing. This paper evolved from a “deleted scene” in my book with Van Vu entitled “Entropy sumset estimates“. In those notes, we developed analogues of the standard Plünnecke-Ruzsa sumset estimates (which relate quantities such as the cardinalities of the sum and difference sets of two finite sets
in an additive group
to each other), to the entropy setting, in which the finite sets
are replaced instead with discrete random variables
taking values in that group G, and the (logarithm of the) cardinality |A| is replaced with the Shannon entropy
This quantity measures the information content of X; for instance, if , then it will take k bits on the average to store the value of X (thus a string of n independent copies of X will require about nk bits of storage in the asymptotic limit
). The relationship between entropy and cardinality is that if X is the uniform distribution on a finite non-empty set A, then
. If instead X is non-uniformly distributed on A, one has
, thanks to Jensen’s inequality.
It turns out that many estimates on sumsets have entropy analogues, which resemble the “logarithm” of the sumset estimates. For instance, the trivial bounds
have the entropy analogue
whenever X, Y are independent discrete random variables in an additive group; this is not difficult to deduce from standard entropy inequalities. Slightly more non-trivially, the sum set estimate
established by Ruzsa, has an entropy analogue
,
and similarly for a number of other standard sumset inequalities in the literature (e.g. the Rusza triangle inequality, the Plünnecke-Rusza inequality, and the Balog-Szemeredi-Gowers theorem, though the entropy analogue of the latter requires a little bit of care to state). These inequalities can actually be deduced fairly easily from elementary arithmetic identities, together with standard entropy inequalities, most notably the submodularity inequality
whenever X,Y,Z,W are discrete random variables such that X and Y each determine W separately (thus for some deterministic functions f, g) and X and Y determine Z jointly (thus
for some deterministic function f). For instance, if X,Y,Z are independent discrete random variables in an additive group G, then
and
each determine
separately, and determine
jointly, leading to the inequality
which soon leads to the entropy Rusza triangle inequality
which is an analogue of the combinatorial Ruzsa triangle inequality
All of this was already in the unpublished notes with Van, though I include it in this paper in order to place it in the literature. The main novelty of the paper, though, is to consider the entropy analogue of Freiman’s theorem, which classifies those sets A for which . Here, the analogous problem is to classify the random variables
such that
, where
are independent copies of X. Let us say that X has small doubling if this is the case.
For instance, the uniform distribution U on a finite subgroup H of G has small doubling (in fact in this case). In a similar spirit, the uniform distribution on a (generalised) arithmetic progression P also has small doubling, as does the uniform distribution on a coset progression H+P. Also, if X has small doubling, and Y has bounded entropy, then X+Y also has small doubling, even if Y and X are not independent. The main theorem is that these are the only cases:
Theorem 1. (Informal statement) X has small doubling if and only if
for some uniform distribution U on a coset progression (of bounded rank), and Y has bounded entropy.
For instance, suppose that X was the uniform distribution on a dense subset A of a finite group G. Then Theorem 1 asserts that X is close in a “transport metric” sense to the uniform distribution U on G, in the sense that it is possible to rearrange or transport the probability distribution of X to the probability distribution of U (or vice versa) by shifting each component of the mass of X by an amount Y which has bounded entropy (which basically means that it primarily ranges inside a set of bounded cardinality). The way one shows this is by randomly translating the mass of X around by a few random shifts to approximately uniformise the distribution, and then deal with the residual fluctuation in the distribution by hand. Theorem 1 as a whole is established by using the Freiman theorem in the combinatorial setting combined with various elementary convexity and entropy inequality arguments to reduce matters to the above model case when X is supported inside a finite group G and has near-maximal entropy.
I also show a variant of the above statement: if X, Y are independent and , then we have
(i.e. X has the same distribution as Y+Z for some Z of bounded entropy (not necessarily independent of X or Y). Thus if two random variables are additively related to each other, then they can be additively transported to each other by using a bounded amount of entropy.
In the last part of the paper I relate these discrete entropies to their continuous counterparts
where X is now a continuous random variable on the real line with density function . There are a number of sum set inequalities known in this setting, for instance
,
for independent copies of a finite entropy random variable X, with equality if and only if X is a Gaussian. Using this inequality and Theorem 1, I show a discrete version, namely that
,
whenever and
are independent copies of a random variable in
(or any other torsion-free abelian group) whose entropy is sufficiently large depending on
. This is somewhat analogous to the classical sumset inequality
though notice that we have a gain of just rather than
here, the point being that there is a Gaussian counterexample in the entropy setting which does not have a combinatorial analogue (except perhaps in the high-dimensional limit). The main idea is to use Theorem 1 to trap most of X inside a coset progression, at which point one can use Fourier-analytic additive combinatorial tools to show that the distribution
is “smooth” in some non-trivial direction r, which can then be used to approximate the discrete distribution by a continuous one.
I also conjecture more generally that the entropy monotonicity inequalities established by Artstein, Barthe, Ball, and Naor in the continuous case also hold in the above sense in the discrete case, though my method of proof breaks down because I no longer can assume small doubling.
Recent Comments