Tag Archive

You are currently browsing the tag archive for the ‘Shannon entropy’ tag.

An abridged proof of Marton’s conjecture

22 June, 2024 in expository, math.CO | Tags: Ben Green, Freddie Manners, Luca Trevisan, Polynomial Freiman-Ruzsa conjecture, Shannon entropy, Timothy Gowers | by Terence Tao | 18 comments

[This post is dedicated to Luca Trevisan, who recently passed away due to cancer. Though far from his most significant contribution to the field, I would like to mention that, as with most of my other blog posts on this site, this page was written with the assistance of Luca’s LaTeX to WordPress converter. Mathematically, his work and insight on pseudorandomness in particular have greatly informed how I myself think about the concept. – T.]

Recently, Timothy Gowers, Ben Green, Freddie Manners, and I were able to establish the following theorem:

Theorem 1 (Marton’s conjecture) Let ${A \subset {\bf F}_2^n}$ be non-empty with ${|A+A| \leq K|A|}$ . Then there exists a subgroup ${H}$ of ${{\bf F}_2^n}$ with ${|H| \leq |A|}$ such that ${A}$ is covered by at most ${2K^C}$ translates of ${H}$ , for some absolute constant ${C}$ .

We established this result with ${C=12}$ , although it has since been improved to ${C=9}$ by Jyun-Jie Liao.

Our proof was written in order to optimize the constant ${C}$ as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant ${C}$ , but just to show that the result holds for some constant ${C}$ . This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.

The key concept here is that of the entropic Ruzsa distance ${d[X;Y]}$ between two random variables ${X,Y}$ taking values ${{\bf F}_2^n}$ , defined as

$\displaystyle d[X;Y] := {\mathbf H}[X'+Y'] - \frac{1}{2} {\mathbf H}[X] - \frac{1}{2} {\mathbf H}[Y]$

where ${X',Y'}$ are independent copies of ${X,Y}$ , and ${{\mathbf H}[X]}$ denotes the Shannon entropy of ${X}$ . This distance is symmetric and non-negative, and obeys the triangle inequality

$\displaystyle d[X;Z] \leq d[X;Y] + d[Y;Z]$

for any random variables ${X,Y,Z}$ ; see the blueprint for a proof. The above theorem then follows from an entropic analogue:

Theorem 2 (Entropic Marton’s conjecture) Let ${X}$ be a ${{\bf F}_2^n}$ -valued random variable with ${d[X;X] \leq \log K}$ . Then there exists a uniform random variable ${U_H}$ on a subgroup ${H}$ of ${{\bf F}_2^n}$ such that ${d[X; U_H] \leq C \log K}$ for some absolute constant ${C}$ .

We were able to establish Theorem 2 with ${C=11}$ , which implies Theorem 1 with ${C=12}$ by fairly standard additive combinatorics manipulations (such as the Ruzsa covering lemma); see the blueprint for details.

The key proposition needed to establish Theorem 2 is the following distance decrement property:

Proposition 3 (Distance decrement) If ${X,Y}$ are ${{\bf F}_2^n}$ -valued random variables, then one can find ${{\bf F}_2^n}$ -valued random variables ${X',Y'}$ such that
$\displaystyle d[X';Y'] \leq (1-\eta) d[X;Y]$
and
$\displaystyle d[X;X'], d[Y,Y'] \leq C d[X;Y]$
for some absolute constants ${C, \eta > 0}$ .

Indeed, suppose this proposition held. Starting with ${X,Y}$ both equal to ${X}$ and iterating, one can then find sequences of random variables ${X_n, Y_n}$ with ${X_0=Y_0=X}$ ,

$\displaystyle d[X_n;Y_n] \leq (1-\eta)^n d[X;X],$

and

$\displaystyle d[X_{n+1};X_n], d[Y_{n+1};Y_n] \leq C (1-\eta)^n d[X;X].$

In particular, from the triangle inequality and geometric series

$\displaystyle d[X_n;X], d[Y_n;X] \leq \frac{C}{\eta} d[X;X].$

By weak compactness, some subsequence of the ${X_n}$ , ${Y_n}$ converge to some limiting random variables ${X_\infty, Y_\infty}$ , and by some simple continuity properties of entropic Ruzsa distance, we conclude that

$\displaystyle d[X_\infty;Y_\infty] = 0$

and

$\displaystyle d[X_\infty;X], d[Y_\infty;X] \leq \frac{C}{\eta} d[X;X].$

Theorem 2 then follows from the “100% inverse theorem” for entropic Ruzsa distance; see the blueprint for details.

To prove Proposition 3, we can reformulate it as follows:

Proposition 4 (Lack of distance decrement implies vanishing) If ${X,Y}$ are ${{\bf F}_2^n}$ -valued random variables, with the property that
$\displaystyle d[X';Y'] > d[X;Y] - \eta ( d[X;Y] + d[X';X] + d[Y',Y] ) \ \ \ \ \ (1)$
for all ${{\bf F}_2^n}$ -valued random variables ${X',Y'}$ and some sufficiently small absolute constant ${\eta > 0}$ , then one can derive a contradiction.

Indeed, we may assume from the above proposition that

$\displaystyle d[X';Y'] \leq d[X;Y] - \eta ( d[X; Y] + d[X';X] + d[Y',Y] )$

for some ${X',Y'}$ , which will imply Proposition 3 with ${C = 1/\eta}$ .

The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for ${\eta}$ small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version

$\displaystyle d[X'|Z;Y'|W] \geq d[X;Y] - \eta ( d[X;Y] + d[X'|Z;X] + d[Y'|W;Y] )$

for any random variables ${Z,W}$ that are possibly coupled with ${X',Y'}$ respectively. In particular, if we define a “relevant” random variable ${X'}$ (conditioned with respect to some auxiliary data ${Z}$ ) to be a random variable for which

$\displaystyle d[X'|Z;X] = O( d[X;Y] )$

or equivalently (by the triangle inequality)

$\displaystyle d[X'|Z;Y] = O( d[X;Y] )$

then we have the useful lower bound

$\displaystyle d[X'|Z;Y'|W] \geq (1-O(\eta)) d[X;Y] \ \ \ \ \ (2)$

whenever ${X'}$ and ${Y'}$ are relevant conditioning on ${Z, W}$ respectively. This is quite a useful bound, since the laws of “entropic Ruzsa calculus” will tell us, roughly speaking, that virtually any random variable that we can create from taking various sums of copies of ${X,Y}$ and conditioning against other sums, will be relevant. (Informally: the space of relevant random variables is ${(1-O(\eta))d[X;Y]}$ -separated with respect to the entropic Ruzsa distance.)

— 1. Main argument —

Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.

Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when ${\eta}$ is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set ${A}$ should be at least as large as the doubling constant of the image ${\pi(A)}$ of that set under a homomorphism, times the doubling constant of a typical fiber ${A \cap \pi^{-1}(\{z\})}$ of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.

Here is the fibring identity:

Proposition 5 (Fibring identity) Let ${\pi: G \rightarrow H}$ be a homomorphism. Then for any independent ${G}$ -valued random variables ${X, Y}$ , one has
$\displaystyle d[X;Y] = d[\pi(X); \pi(Y)] + d[X|\pi(X); Y|\pi(Y)]$

$\displaystyle + I[X-Y : \pi(X),\pi(Y) | \pi(X)-\pi(Y) ].$

The proof is of course in the blueprint, but given that it is a central pillar of the argument, I reproduce it here.

Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule

$\displaystyle {\mathbf H}[X] = {\mathbf H}[\pi(X)] + {\mathbf H}[X|\pi(X)]$

and

$\displaystyle {\mathbf H}[Y] = {\mathbf H}[\pi(Y)] + {\mathbf H}[Y|\pi(Y)],$

it suffices to establish the identity

$\displaystyle {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X), \pi(Y)]$

$\displaystyle + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ].$

But from the chain rule again we have

$\displaystyle {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X)-\pi(Y)]$

and from the definition of conditional mutual information (using the fact that ${\pi(X)-\pi(Y)}$ is determined both by ${X-Y}$ and by ${(\pi(X),\pi(Y))}$ ) one has

$\displaystyle {\mathbf H}[X - Y|\pi(X)-\pi(Y)] = {\mathbf H}[X - Y|\pi(X), \pi(Y)]$

$\displaystyle + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ]$

giving the claim. $\Box$

We will only care about the characteristic ${2}$ setting here, so we will now assume that all groups involved are ${2}$ -torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where ${G = {\bf F}_2^n \times {\bf F}_2^n}$ , ${H = {\bf F}_2^n}$ , ${\pi: G \rightarrow H}$ is the addition map ${\pi(x,y) = x+y}$ , and ${X = (X_1, X_2)}$ , ${Y = (Y_1, Y_2)}$ are pairs of independent random variables in ${{\bf F}_2^n}$ , we obtain the following corollary:

Corollary 6 Let ${X_1,X_2,Y_1,Y_2}$ be independent ${{\bf F}_2^n}$ -valued random variables. Then we have the identity
$\displaystyle d[X_1;Y_1] + d[X_2;Y_2] = d[X_1+X_2;Y_1+Y_2]$

$\displaystyle + d[X_1|X_1+X_2;Y_1|Y_1+Y_2]$

$\displaystyle + I[(X_1+Y_1, X_2+Y_2) : (X_1+X_2,Y_1+Y_2) | X_1+X_2+Y_1+Y_2 ].$

This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality

$\displaystyle d[X_1;Y_1] + d[X_2;Y_2] \geq d[X_1+X_2;Y_1+Y_2]$

$\displaystyle + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].$

If we let ${X_1, Y_1, X_2, Y_2}$ be independent copies of ${X, Y, Y, X}$ respectively (note the swap in the last two variables!) we obtain

$\displaystyle 2 d[X;Y] \geq d[X+Y;X+Y] + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].$

From entropic Ruzsa calculus, one can check that ${X+Y}$ , ${X_1|X_1+X_2}$ , and ${Y_1|Y_1+Y_2}$ are all relevant random variables, so from (2) we now obtain both upper and lower bounds for ${d[X+Y;X+Y]}$ :

$\displaystyle d[X+Y; X+Y] = (1 + O(\eta)) d[X;Y].$

A pleasant upshot of this is that we now get to work in the symmetric case ${X=Y}$ without loss of generality. Indeed, if we set ${X^* := X+Y}$ , we now have from (2) that

$\displaystyle d[X'|Z; Y'|W] \geq (1-O(\eta)) d[X^*;X^*] \ \ \ \ \ (3)$

whenever ${X'|Z, Y'|W}$ are relevant, which by entropic Ruzsa calculus is equivalent to asking that

$\displaystyle d[X'|Z; X^*], d[Y'|W; X^*] = O(d[X^*;X^*]).$

Now we use the fibring identity again, relabeling ${Y_1,Y_2}$ as ${X_3,X_4}$ and requiring ${X_1,X_2,X_3,X_4}$ to be independent copies of ${X^*}$ . We conclude that

$\displaystyle 2d[X^*; X^*] = d[X_1+X_2;X_3+Y_4] + d[X_1|X_1+X_2;X_3|X_1+X_4]$

$\displaystyle + I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ].$

As before, the random variables ${X_1+X_2}$ , ${X_3+X_4}$ , ${X_1|X_1+X_2}$ , ${X_3|X_3+X_4}$ are all relevant, so from (3) we have

$\displaystyle d[X_1+X_2;X_3+X_4], d[X_1|X_1+X_2;X_3|X_1+X_4]$

$\displaystyle \geq (1-O(\eta)) d[X^*;X^*].$

We could now also match these lower bounds with upper bounds, but the more important takeaway from this analysis is a really good bound on the conditional mutual information:

$\displaystyle I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ]$

$\displaystyle = O(\eta) d[X^*;X^*].$

By the data processing inequality, we can discard some of the randomness here, and conclude

$\displaystyle I[X_1+X_3 : X_1+X_2 | X_1+X_2+X_3+X_4 ] = O(\eta) d[X^*;X^*].$

Let us introduce the random variables

$\displaystyle S := X_1+X_2+X_3+X_4; U := X_1+X_2; V = X_1 + X_3$

then we have

$\displaystyle I[U : V | S] = O(\eta) d[X^*;X^*].$

Intuitively, this means that ${U}$ and ${V}$ are very nearly independent given ${S}$ . For sake of argument, let us assume that they are actually independent; one can achieve something resembling this by invoking the entropic Balog-Szemerédi-Gowers theorem, established in the blueprint, after conceding some losses of ${O(\eta) d[X^*,X^*]}$ in the entropy, but we skip over the details for this blog post. The key point now is that because we are in characteristic ${2}$ , ${U+V}$ has the same form as ${U}$ or ${V}$ :

$\displaystyle U + V = X_2 + X_3.$

In particular, by permutation symmetry, we have

$\displaystyle {\mathbf H}[U+V|S] ={\mathbf H}[U|S] ={\mathbf H}[V|S],$

and so by the definition of conditional Ruzsa distance we have a massive distance decrement

$\displaystyle {\bf E}_s d[U|S=s; V|S=s] = 0,$

(where ${s}$ is drawn from the distribution of ${S}$ ), contradicting (1) as desired. (In reality, we end up decreasing the distance not all the way to zero, but instead to ${O(\eta d[X^*,X^*])}$ due to losses in the Balog-Szemerédi-Gowers theorem, but this is still enough to reach a contradiction. The quantity ${{\bf E}_s d[U|S=s; V|S=s]}$ is very similar to ${d[U|S; V|S]}$ , but is slightly different; the latter quantity is ${{\bf E}_{s,s'}d[U|S=s; V|S=s']}$ .)

Remark 7 A similar argument works in the ${m}$ -torsion case for general ${m}$ . Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”
$\displaystyle {\mathbf H}[X_1 + \dots + X_m] - \frac{1}{m} ({\mathbf H}[X_1] + \dots + {\mathbf H}[X_m])$
for independent ${X_1,\dots,X_m}$ . By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable ${X^*}$ . If one then takes ${X_{i,j}}$ , ${i,j=1,\dots,m}$ to be an array of ${m^2}$ copies of ${X^*}$ , one can get to the point where the row sums ${\sum_i X_{i,j}}$ and the column sums ${\sum_j X_{i,j}}$ have small conditional mutual information with respect to the double sum ${S := \sum_i \sum_j X_{i,j}}$ . If we then set ${U := \sum_i \sum_j j X_{i,j}}$ and ${V := \sum_i \sum_j i X_{i,j}}$ , the data processing inequality again shows that ${U}$ and ${V}$ are nearly independent given ${S}$ . The ${m}$ -torsion now crucially intervenes as before to ensure that ${U+V = \sum_i \sum_j (i+j) X_{i,j}}$ has the same form as ${U}$ or ${V}$ , leading to a contradiction as before. See this previous blog post for more discussion.

Marton’s conjecture in abelian groups with bounded torsion

4 April, 2024 in math.CO, paper | Tags: additive combinatorics, Ben Green, Freddie Manners, Polynomial Freiman-Ruzsa conjecture, Shannon entropy, Timothy Gowers | by Terence Tao | 14 comments

Theorem 1 (Marton’s conjecture) Let ${A \subset {\bf F}_2^n}$ be non-empty with ${|A+A| \leq K|A|}$ . Then there exists a subgroup ${H}$ of ${{\bf F}_2^n}$ with ${|H| \leq |A|}$ such that ${A}$ is covered by at most ${2K^C}$ translates of ${H}$ , for some absolute constant ${C}$ .

We established this result with ${C=12}$ , although it has since been improved to ${C=9}$ by Jyun-Jie Liao.
Our proof was written in order to optimize the constant ${C}$ as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant ${C}$ , but just to show that the result holds for some constant ${C}$ . This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.
The key concept here is that of the entropic Ruzsa distance ${d[X;Y]}$ between two random variables ${X,Y}$ taking values ${{\bf F}_2^n}$ , defined as

$\displaystyle d[X;Y] := {\mathbf H}[X'+Y'] - \frac{1}{2} {\mathbf H}[X] - \frac{1}{2} {\mathbf H}[Y]$

where ${X',Y'}$ are independent copies of ${X,Y}$ , and ${{\mathbf H}[X]}$ denotes the Shannon entropy of ${X}$ . This distance is symmetric and non-negative, and obeys the triangle inequality

$\displaystyle d[X;Z] \leq d[X;Y] + d[Y;Z]$

for any random variables ${X,Y,Z}$ ; see the blueprint for a proof. The above theorem then follows from an entropic analogue:

Theorem 2 (Entropic Marton’s conjecture) Let ${X}$ be a ${{\bf F}_2^n}$ -valued random variable with ${d[X;X] \leq \log K}$ . Then there exists a uniform random variable ${U_H}$ on a subgroup ${H}$ of ${{\bf F}_2^n}$ such that ${d[X; U_H] \leq C \log K}$ for some absolute constant ${C}$ .

We were able to establish Theorem 2 with ${C=11}$ , which implies Theorem 1 with ${C=12}$ by fairly standard additive combinatorics manipulations; see the blueprint for details.
The key proposition needed to establish Theorem 2 is the following distance decrement property:

Proposition 3 (Distance decrement) If ${X,Y}$ are ${{\bf F}_2^n}$ -valued random variables, then one can find ${{\bf F}_2^n}$ -valued random variables ${X',Y'}$ such that

$\displaystyle d[X';Y'] \leq (1-\eta) d[X;Y]$

and

$\displaystyle d[X;X'], d[Y,Y'] \leq C d[X;Y]$

for some absolute constants ${C, \eta > 0}$ .

Indeed, suppose this proposition held. Starting with ${X,Y}$ both equal to ${X}$ and iterating, one can then find sequences of random variables ${X_n, Y_n}$ with ${X_0=Y_0=X}$ ,

$\displaystyle d[X_n;Y_n] \leq (1-\eta)^n d[X;X],$

and

$\displaystyle d[X_{n+1};X_n], d[Y_{n+1};Y_n] \leq C (1-\eta)^n d[X;X].$

In particular, from the triangle inequality and geometric series

$\displaystyle d[X_n;X], d[Y_n;X] \leq \frac{C}{\eta} d[X;X].$

$\displaystyle d[X_\infty;Y_\infty] = 0$

and

$\displaystyle d[X_\infty;X], d[Y_\infty;X] \leq \frac{C}{\eta} d[X;X].$

Theorem 2 then follows from the “100% inverse theorem” for entropic Ruzsa distance; see the blueprint for details.
To prove Proposition 3, we can reformulate it as follows:

Proposition 4 (Lack of distance decrement implies vanishing) If ${X,Y}$ are ${{\bf F}_2^n}$ -valued random variables, with the property that

$\displaystyle d[X';Y'] > d[X;Y] - \eta ( d[X;Y] + d[X';X] + d[Y',Y] ) \ \ \ \ \ (1)$

for all ${{\bf F}_2^n}$ -valued random variables ${X',Y'}$ and some sufficiently small absolute constant ${\eta > 0}$ , then one can derive a contradiction.

Indeed, we may assume from the above proposition that

$\displaystyle d[X';Y'] \leq d[X;Y] - \eta ( d[X; Y] + d[X';X] + d[Y',Y] )$

for some ${X',Y'}$ , which will imply Proposition 3 with ${C = 1/\eta}$ .
The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for ${\eta}$ small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version

$\displaystyle d[X'|Z;Y'|W] \geq d[X;Y] - \eta ( d[X;Y] + d[X'|Z;X] + d[Y'|W;Y] )$

$\displaystyle d[X'|Z;X] = O( d[X;Y] )$

or equivalently (by the triangle inequality)

$\displaystyle d[X'|Z;Y] = O( d[X;Y] )$

then we have the useful lower bound

$\displaystyle d[X'|Z;Y'|W] \geq (1-O(\eta)) d[X;Y] \ \ \ \ \ (2)$

— 1. Main argument —

Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.
Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when ${\eta}$ is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set ${A}$ should be at least as large as the doubling constant of the image ${\pi(A)}$ of that set under a homomorphism, times the doubling constant of a typical fiber ${A \cap \pi^{-1}(\{z\})}$ of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.
Here is the fibring identity:

Proposition 5 (Fibring identity) Let ${\pi: G \rightarrow H}$ be a homomorphism. Then for any independent ${G}$ -valued random variables ${X, Y}$ , one has

$\displaystyle d[X;Y] = d[\pi(X); \pi(Y)] + d[X|\pi(X); Y|\pi(Y)]$

$\displaystyle + I[X-Y : \pi(X),\pi(Y) | \pi(X)-\pi(Y) ].$

The proof is of course in the blueprint, but given that it is a central pillar of the argumnt, I reproduce it here.
Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule

$\displaystyle {\mathbf H}[X] = {\mathbf H}[\pi(X)] + {\mathbf H}[X|\pi(X)]$

and

$\displaystyle {\mathbf H}[Y] = {\mathbf H}[\pi(Y)] + {\mathbf H}[Y|\pi(Y)],$

it suffices to establish the identity

$\displaystyle {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X), \pi(Y)]$

$\displaystyle + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ].$

But from the chain rule again we have

$\displaystyle {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X)-\pi(Y)]$

and from the definition of conditional mutual information (using the fact that ${\pi(X)-\pi(Y)}$ is determined both by ${X-Y}$ and by ${(\pi(X),\pi(Y))}$ ) one has

$\displaystyle {\mathbf H}[X - Y|\pi(X)-\pi(Y)] = {\mathbf H}[X - Y|\pi(X), \pi(Y)]$

$\displaystyle + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ]$

giving the claim. $\Box$
We will only care about the characteristic ${2}$ setting here, so we will now assume that all groups involved are ${2}$ -torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where ${G = {\bf F}_2^n \times {\bf F}_2^n}$ , ${H = {\bf F}_2^n}$ , ${\pi: G \rightarrow H}$ is the addition map ${\pi(x,y) = x+y}$ , and ${X = (X_1, X_2)}$ , ${Y = (Y_1, Y_2)}$ are pairs of independent random variables in ${{\bf F}_2^n}$ , we obtain the following corollary:

Corollary 6 Let ${X_1,X_2,Y_1,Y_2}$ be independent ${{\bf F}_2^n}$ -valued random variables. Then we have the identity

$\displaystyle d[X_1;Y_1] + d[X_2;Y_2] = d[X_1+X_2;Y_1+Y_2]$

$\displaystyle + d[X_1|X_1+X_2;Y_1|Y_1+Y_2]$

$\displaystyle + I[(X_1+Y_1, X_2+Y_2) : (X_1+X_2,Y_1+Y_2) | X_1+X_2+Y_1+Y_2 ].$

This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality

$\displaystyle d[X_1;Y_1] + d[X_2;Y_2] \geq d[X_1+X_2;Y_1+Y_2]$

$\displaystyle + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].$

If we let ${X_1, Y_1, X_2, Y_2}$ be independent copies of ${X, Y, Y, X}$ respectively (note the swap in the last two variables!) we obtain

$\displaystyle 2 d[X;Y] \geq d[X+Y;X+Y] + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].$

$\displaystyle d[X+Y; X+Y] = (1 + O(\eta)) d[X;Y].$

A pleasant upshot of this is that we now get to work in the symmetric case ${X=Y}$ without loss of generality. Indeed, if we set ${X^* := X+Y}$ , we now have from (2) that

$\displaystyle d[X'|Z; Y'|W] \geq (1-O(\eta)) d[X^*;X^*] \ \ \ \ \ (3)$

whenever ${X'|Z, Y'|W}$ are relevant, which by entropic Ruzsa calculus is equivalent to asking that

$\displaystyle d[X'|Z; X^*], d[Y'|W; X^*] = O(d[X^*;X^*]).$

Now we use the fibring identity again, relabeling ${Y_1,Y_2}$ as ${X_3,X_4}$ and requiring ${X_1,X_2,X_3,X_4}$ to be independent copies of ${X^*}$ . We conclude that

$\displaystyle 2d[X^*; X^*] = d[X_1+X_2;X_3+Y_4] + d[X_1|X_1+X_2;X_3|X_1+X_4]$

$\displaystyle + I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ].$

As before, the random variables ${X_1+X_2}$ , ${X_3+X_4}$ , ${X_1|X_1+X_2}$ , ${X_3|X_3+X_4}$ are all relevant, so from (3) we have

$\displaystyle d[X_1+X_2;X_3+X_4], d[X_1|X_1+X_2;X_3|X_1+X_4]$

$\displaystyle \geq (1-O(\eta)) d[X^*;X^*].$

We could now also match these lower bounds with upper bounds, but the more important takeaway from this analysis is a really good bound on the conditional mutual information:

$\displaystyle I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ]$

$\displaystyle = O(\eta) d[X^*;X^*].$

By the data processing inequality, we can discard some of the randomness here, and conclude

$\displaystyle I[X_1+X_3 : X_1+X_2 | X_1+X_2+X_3+X_4 ] = O(\eta) d[X^*;X^*].$

Let us introduce the random variables

$\displaystyle Z := X_1+X_2+X_3+X_4; U := X_1+X_2; V = X_1 + X_3$

then we have

$\displaystyle I[U : V | Z] = O(\eta) d[X^*;X^*].$

Intuitively, this means that ${U}$ and ${V}$ are very nearly independent given ${Z}$ . For sake of argument, let us assume that they are actually independent; one can achieve something resembling this by invoking the entropic Balog-Szemerédi-Gowers theorem, established in the blueprint, after conceding some losses of ${O(\eta) d[X^*,X^*]}$ in the entropy, but we skip over the details for this blog post. The key point now is that because we are in characteristic ${2}$ , ${U+V}$ has the same form as ${U}$ or ${V}$ :

$\displaystyle U + V = X_2 + X_3.$

In particular, by permutation symmetry, we have

$\displaystyle {\mathbf H}[U+V|S] ={\mathbf H}[U|S] ={\mathbf H}[V|S],$

and so by the definition of conditional Ruzsa distance we have a massive distance decrement

$\displaystyle d[U|S; V|S] = 0,$

contradicting (1) as desired. (In reality, we end up decreasing the distance not all the way to zero, but instead to ${O(\eta d[X^*,X^*])}$ due to losses in the Balog-Szemerédi-Gowers theorem, but this is still enough to reach a contradiction.)

Remark 7 A similar argument works in the ${m}$ -torsion case for general ${m}$ . Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”

$\displaystyle {\mathbf H}[X_1 + \dots + X_m] - \frac{1}{m} ({\mathbf H}[X_1] + \dots + {\mathbf H}[X_m])$

for independent ${X_1,\dots,X_m}$ . By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable ${X^*}$ . If one then takes ${X_{i,j}}$ , ${i,j=1,\dots,m}$ to be an array of ${m^2}$ copies of ${X^*}$ , one can get to the point where the row sums ${\sum_i X_{i,j}}$ and the column sums ${\sum_j X_{i,j}}$ have small conditional mutual information with respect to the double sum ${S := \sum_i \sum_j X_{i,j}}$ . If we then set ${U := \sum_i \sum_j j X_{i,j}}$ and ${V := \sum_i \sum_j i X_{i,j}}$ , the data processing inequality again shows that ${U}$ and ${V}$ are nearly independent given ${S}$ . The ${m}$ -torsion now crucially intervenes as before to ensure that ${U+V = \sum_i \sum_j (i+j) X_{i,j}}$ has the same form as ${U}$ or ${V}$ , leading to a contradiction as before. See this previous blog post for more discussion.

A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

10 December, 2023 in expository, math.CA, math.PR | Tags: Anthony Carbery, Cauchy-Schwarz, Gibbs variational formula, Shannon entropy | by Terence Tao | 30 comments

Let ${S}$ be a non-empty finite set. If ${X}$ is a random variable taking values in ${S}$ , the Shannon entropy ${H[X]}$ of ${X}$ is defined as

$\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].$

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let ${f: S \rightarrow {\bf R}}$ be a function. Then
$\displaystyle \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)$

Proof: Note that shifting ${f}$ by a constant affects both sides of (1) the same way, so we may normalize ${\sum_{s \in S} \exp(f(s)) = 1}$ . Then ${\exp(f(s))}$ is now the probability distribution of some random variable ${Y}$ , and the inequality can be rewritten as

$\displaystyle 0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].$

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as ${-D_{KL}(X||Y)}$ , where ${D_{KL}}$ denotes Kullback-Leibler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions ${x \mapsto e^x}$ and ${y \mapsto y \log y - y}$ .) $\Box$

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let ${n \geq 0}$ , let ${S, T_1,\dots,T_n}$ be finite non-empty sets, and let ${\pi_i: S \rightarrow T_i}$ be functions for each ${i=1,\dots,n}$ . Let ${K: S \rightarrow {\bf R}^+}$ and ${f_i: T_i \rightarrow {\bf R}^+}$ be positive functions for each ${i=1,\dots,n}$ . Then
$\displaystyle \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}$
where ${Q}$ is the quantity
$\displaystyle Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}$
where ${\Omega_n}$ is the set of all tuples ${(s_0,\dots,s_n) \in S^{n+1}}$ such that ${\pi_i(s_{i-1}) = \pi_i(s_i)}$ for ${i=1,\dots,n}$ .

Thus for instance, the identity is trivial for ${n=0}$ . When ${n=1}$ , the inequality reads

$\displaystyle \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}$

$\displaystyle ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},$

which is easily proven by Cauchy-Schwarz, while for ${n=2}$ the inequality reads

$\displaystyle \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))$

$\displaystyle \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}$

$\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}$

which can also be proven by elementary means. However even for ${n=3}$ , the existing proofs require the “tensor power trick” in order to reduce to the case when the ${f_i}$ are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write ${K(s) = \exp(k(s))}$ and ${f_i(t_i) = \exp(g_i(t_i))}$ for some functions ${k: S \rightarrow {\bf R}}$ and ${g_i: T_i \rightarrow {\bf R}}$ . If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

$\displaystyle \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]$

$\displaystyle \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]$

$\displaystyle + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]$

where ${X}$ ranges over random variables taking values in ${S}$ , ${X_0,\dots,X_n}$ range over tuples of random variables taking values in ${\Omega_n}$ , and ${Y_i}$ range over random variables taking values in ${T_i}$ . Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let ${X}$ be an ${S}$ -valued random variable. Then there exists a ${\Omega_n}$ -valued random variable ${(X_0,\dots,X_n)}$ , where each ${X_i}$ has the same distribution as ${X}$ , and
$\displaystyle {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]$

$\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].$

Proof: We induct on ${n}$ . When ${n=0}$ we just take ${X_0 = X}$ . Now suppose that ${n \geq 1}$ , and the claim has already been proven for ${n-1}$ , thus one has already obtained a tuple ${(X_0,\dots,X_{n-1}) \in \Omega_{n-1}}$ with each ${X_0,\dots,X_{n-1}}$ having the same distribution as ${X}$ , and

$\displaystyle {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].$

By hypothesis, ${\pi_n(X_{n-1})}$ has the same distribution as ${\pi_n(X)}$ . For each value ${t_n}$ attained by ${\pi_n(X)}$ , we can take conditionally independent copies of ${(X_0,\dots,X_{n-1})}$ and ${X}$ conditioned to the events ${\pi_n(X_{n-1}) = t_n}$ and ${\pi_n(X) = t_n}$ respectively, and then concatenate them to form a tuple ${(X_0,\dots,X_n)}$ in ${\Omega_n}$ , with ${X_n}$ a further copy of ${X}$ that is conditionally independent of ${(X_0,\dots,X_{n-1})}$ relative to ${\pi_n(X_{n-1}) = \pi_n(X)}$ . One can the use the entropy chain rule to compute

$\displaystyle {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])$

$\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])$

$\displaystyle ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]$

and the claim now follows from the induction hypothesis. $\Box$

With a little more effort, one can replace ${S}$ by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.

On a conjecture of Marton

13 November, 2023 in math.CO, paper | Tags: additive combinatorics, Ben Green, Freddie Manners, Freiman's theorem, Polynomial Freiman-Ruzsa conjecture, Shannon entropy | by Terence Tao | 34 comments

Tim Gowers, Ben Green, Freddie Manners, and I have just uploaded to the arXiv our paper “On a conjecture of Marton“. This paper establishes a version of the notorious Polynomial Freiman–Ruzsa conjecture (first proposed by Katalin Marton):

Theorem 1 (Polynomial Freiman–Ruzsa conjecture) Let ${A \subset {\bf F}_2^n}$ be such that ${|A+A| \leq K|A|}$ . Then ${A}$ can be covered by at most ${2K^{12}}$ translates of a subspace ${H}$ of ${{\bf F}_2^n}$ of cardinality at most ${A}$ .

The previous best known result towards this conjecture was by Konyagin (as communicated in this paper of Sanders), who obtained a similar result but with ${2K^{12}}$ replaced by ${\exp(O_\varepsilon(\log^{3+\varepsilon} K))}$ for any ${\varepsilon>0}$ (assuming that say ${K \geq 3/2}$ to avoid some degeneracies as ${K}$ approaches ${1}$ , which is not the difficult case of the conjecture). The conjecture (with ${12}$ replaced by an unspecified constant ${C}$ ) has a number of equivalent forms; see this survey of Green, and these papers of Lovett and of Green and myself for some examples; in particular, as discussed in the latter two references, the constants in the inverse ${U^3({\bf F}_2^n)}$ theorem are now polynomial in nature (although we did not try to optimize the constant).

The exponent ${12}$ here was the product of a large number of optimizations to the argument (our original exponent here was closer to ${1000}$ ), but can be improved even further with additional effort (our current argument, for instance, allows one to replace it with ${7+\sqrt{17} = 11.123\dots}$ , but we decided to state our result using integer exponents instead).

In this paper we will focus exclusively on the characteristic ${2}$ case (so we will be cavalier in identifying addition and subtraction), but in a followup paper we will establish similar results in other finite characteristics.

Much of the prior progress on this sort of result has proceeded via Fourier analysis. Perhaps surprisingly, our approach uses no Fourier analysis whatsoever, being conducted instead entirely in “physical space”. Broadly speaking, it follows a natural strategy, which is to induct on the doubling constant ${|A+A|/|A|}$ . Indeed, suppose for instance that one could show that every set ${A}$ of doubling constant ${K}$ was “commensurate” in some sense to a set ${A'}$ of doubling constant at most ${K^{0.99}}$ . One measure of commensurability, for instance, might be the Ruzsa distance ${\log \frac{|A+A'|}{|A|^{1/2} |A'|^{1/2}}}$ , which one might hope to control by ${O(\log K)}$ . Then one could iterate this procedure until doubling constant dropped below say ${3/2}$ , at which point the conjecture is known to hold (there is an elementary argument that if ${A}$ has doubling constant less than ${3/2}$ , then ${A+A}$ is in fact a subspace of ${{\bf F}_2^n}$ ). One can then use several applications of the Ruzsa triangle inequality

$\displaystyle \log \frac{|A+C|}{|A|^{1/2} |C|^{1/2}} \leq \log \frac{|A+B|}{|A|^{1/2} |B|^{1/2}} + \log \frac{|B+C|}{|B|^{1/2} |C|^{1/2}}$

to conclude (the fact that we reduce ${K}$ to ${K^{0.99}}$ means that the various Ruzsa distances that need to be summed are controlled by a convergent geometric series).

There are a number of possible ways to try to “improve” a set ${A}$ of not too large doubling by replacing it with a commensurate set of better doubling. We note two particular potential improvements:

(i) Replacing ${A}$ with ${A+A}$ . For instance, if ${A}$ was a random subset (of density ${1/K}$ ) of a large subspace ${H}$ of ${{\bf F}_2^n}$ , then replacing ${A}$ with ${A+A}$ usually drops the doubling constant from ${K}$ down to nearly ${1}$ (under reasonable choices of parameters).
(ii) Replacing ${A}$ with ${A \cap (A+h)}$ for a “typical” ${h \in A+A}$ . For instance, if ${A}$ was the union of ${K}$ random cosets of a subspace ${H}$ of large codimension, then replacing ${A}$ with ${A \cap (A+h)}$ again usually drops the doubling constant from ${K}$ down to nearly ${1}$ .

Unfortunately, there are sets ${A}$ where neither of the above two operations (i), (ii) significantly improves the doubling constant. For instance, if ${A}$ is a random density ${1/\sqrt{K}}$ subset of ${\sqrt{K}}$ random translates of a medium-sized subspace ${H}$ , one can check that the doubling constant stays close to ${K}$ if one applies either operation (i) or operation (ii). But in this case these operations don’t actually worsen the doubling constant much either, and by applying some combination of (i) and (ii) (either intersecting ${A+A}$ with a translate, or taking a sumset of ${A \cap (A+h)}$ with itself) one can start lowering the doubling constant again.

This begins to suggest a potential strategy: show that at least one of the operations (i) or (ii) will improve the doubling constant, or at least not worsen it too much; and in the latter case, perform some more complicated operation to locate the desired doubling constant improvement.

A sign that this strategy might have a chance of working is provided by the following heuristic argument. If ${A}$ has doubling constant ${K}$ , then the Cartesian product ${A \times A}$ has doubling constant ${K^2}$ . On the other hand, by using the projection map ${\pi: {\bf F}_2^n \times {\bf F}_2^n \rightarrow {\bf F}_2^n}$ defined by ${\pi(x,y) := x+y}$ , we see that ${A \times A}$ projects to ${\pi(A \times A) = A+A}$ , with fibres ${\pi^{-1}(\{h\})}$ being essentially a copy of ${A \cap (A+h)}$ . So, morally, ${A \times A}$ also behaves like a “skew product” of ${A+A}$ and the fibres ${A \cap (A+h)}$ , which suggests (non-rigorously) that the doubling constant ${K^2}$ of ${A \times A}$ is also something like the doubling constant of ${A + A}$ , times the doubling constant of a typical fibre ${A \cap (A+h)}$ . This would imply that at least one of ${A +A}$ and ${A \cap (A+h)}$ would have doubling constant at most ${K}$ , and thus that at least one of operations (i), (ii) would not worsen the doubling constant.

Unfortunately, this argument does not seem to be easily made rigorous using the traditional doubling constant; even the significantly weaker statement that ${A+A}$ has doubling constant at most ${K^2}$ is false (see comments for more discussion). However, it turns out (as discussed in this recent paper of myself with Green and Manners) that things are much better. Here, the analogue of a subset ${A}$ in ${{\bf F}_2^n}$ is a random variable ${X}$ taking values in ${{\bf F}_2^n}$ , and the analogue of the (logarithmic) doubling constant ${\log \frac{|A+A|}{|A|}}$ is the entropic doubling constant ${d(X;X) := {\bf H}(X_1+X_2)-{\bf H}(X)}$ , where ${X_1,X_2}$ are independent copies of ${X}$ . If ${X}$ is a random variable in some additive group ${G}$ and ${\pi: G \rightarrow H}$ is a homomorphism, one then has what we call the fibring inequality

$\displaystyle d(X;X) \geq d(\pi(X);\pi(X)) + d(X|\pi(X); X|\pi(X)),$

where the conditional doubling constant ${d(X|\pi(X); X|\pi(X))}$ is defined as

$\displaystyle d(X|\pi(X); X|\pi(X)) = {\bf H}(X_1 + X_2 | \pi(X_1), \pi(X_2)) - {\bf H}( X | \pi(X) ).$

Indeed, from the chain rule for Shannon entropy one has

$\displaystyle {\bf H}(X) = {\bf H}(\pi(X)) + {\bf H}(X|\pi(X))$

and

$\displaystyle {\bf H}(X_1+X_2) = {\bf H}(\pi(X_1) + \pi(X_2)) + {\bf H}(X_1 + X_2|\pi(X_1) + \pi(X_2))$

while from the non-negativity of conditional mutual information one has

$\displaystyle {\bf H}(X_1 + X_2|\pi(X_1) + \pi(X_2)) \geq {\bf H}(X_1 + X_2|\pi(X_1), \pi(X_2))$

and it is an easy matter to combine all these identities and inequalities to obtain the claim.

Applying this inequality with ${X}$ replaced by two independent copies ${(X_1,X_2)}$ of itself, and using the addition map ${(x,y) \mapsto x+y}$ for ${\pi}$ , we obtain in particular that

$\displaystyle 2 d(X;X) \geq d(X_1+X_2; X_1+X_2) + d(X_1,X_2|X_1+X_2; X_1,X_2|X_1+X_2)$

or (since ${X_2}$ is determined by ${X_1}$ once one fixes ${X_1+X_2}$ )

$\displaystyle 2 d(X;X) \geq d(X_1+X_2; X_1+X_2) + d(X_1|X_1+X_2; X_1|X_1+X_2).$

So if ${d(X;X) = \log K}$ , then at least one of ${d(X_1+X_2; X_1+X_2)}$ or ${d(X_1|X_1+X_2; X_1|X_1+X_2)}$ will be less than or equal to ${\log K}$ . This is the entropy analogue of at least one of (i) or (ii) improving, or at least not degrading the doubling constant, although there are some minor technicalities involving how one deals with the conditioning to ${X_1+X_2}$ in the second term ${d(X_1|X_1+X_2; X_1|X_1+X_2)}$ that we will gloss over here (one can pigeonhole the instances of ${X_1}$ to different events ${X_1+X_2=x}$ , ${X_1+X_2=x'}$ , and “depolarise” the induction hypothesis to deal with distances ${d(X;Y)}$ between pairs of random variables ${X,Y}$ that do not necessarily have the same distribution). Furthermore, we can even calculate the defect in the above inequality: a careful inspection of the above argument eventually reveals that

$\displaystyle 2 d(X;X) = d(X_1+X_2; X_1+X_2) + d(X_1|X_1+X_2; X_1|X_1+X_2)$

$\displaystyle + {\bf I}( X_1 + X_2 : X_1 + X_3 | X_1 + X_2 + X_3 + X_4)$

where we now take four independent copies ${X_1,X_2,X_3,X_4}$ . This leads (modulo some technicalities) to the following interesting conclusion: if neither (i) nor (ii) leads to an improvement in the entropic doubling constant, then ${X_1+X_2}$ and ${X_2+X_3}$ are conditionally independent relative to ${X_1+X_2+X_3+X_4}$ . This situation (or an approximation to this situation) is what we refer to in the paper as the “endgame”.

A version of this endgame conclusion is in fact valid in any characteristic. But in characteristic ${2}$ , we can take advantage of the identity

$\displaystyle (X_1+X_2) + (X_2+X_3) = X_1 + X_3.$

Conditioning on ${X_1+X_2+X_3+X_4}$ , and using symmetry we now conclude that if we are in the endgame exactly (so that the mutual information is zero), then the independent sum of two copies of ${(X_1+X_2|X_1+X_2+X_3+X_4)}$ has exactly the same distribution; in particular, the entropic doubling constant here is zero, which is certainly a reduction in the doubling constant.

To deal with the situation where the conditional mutual information is small but not completely zero, we have to use an entropic version of the Balog-Szemeredi-Gowers lemma, but fortunately this was already worked out in an old paper of mine (although in order to optimise the final constant, we ended up using a slight variant of that lemma).

I am planning to formalize this paper in the Lean4 language. Further discussion of this project will take place on this Zulip stream, and the project itself will be held at this Github repository.

The sunflower lemma via Shannon entropy

20 July, 2020 in expository, math.CO, math.IT | Tags: Shannon entropy, sunflowers | by Terence Tao | 31 comments

A family ${A_1,\dots,A_r}$ of sets for some ${r \geq 1}$ is a sunflower if there is a core set ${A_0}$ contained in each of the ${A_i}$ such that the petal sets ${A_i \backslash A_0, i=1,\dots,r}$ are disjoint. If ${k,r \geq 1}$ , let ${\mathrm{Sun}(k,r)}$ denote the smallest natural number with the property that any family of ${\mathrm{Sun}(k,r)}$ distinct sets of cardinality at most ${k}$ contains ${r}$ distinct elements ${A_1,\dots,A_r}$ that form a sunflower. The celebrated Erdös-Rado theorem asserts that ${\mathrm{Sun}(k,r)}$ is finite; in fact Erdös and Rado gave the bounds

$\displaystyle (r-1)^k \leq \mathrm{Sun}(k,r) \leq (r-1)^k k! + 1. \ \ \ \ \ (1)$

The sunflower conjecture asserts in fact that the upper bound can be improved to ${\mathrm{Sun}(k,r) \leq O(1)^k r^k}$ . This remains open at present despite much effort (including a Polymath project); after a long series of improvements to the upper bound, the best general bound known currently is

$\displaystyle \mathrm{Sun}(k,r) \leq O( r \log(kr) )^k \ \ \ \ \ (2)$

for all ${k,r \geq 2}$ , established in 2019 by Rao (building upon a recent breakthrough a month previously of Alweiss, Lovett, Wu, and Zhang). Here we remove the easy cases ${k=1}$ or ${r=1}$ in order to make the logarithmic factor ${\log(kr)}$ a little cleaner.

Rao’s argument used the Shannon noiseless coding theorem. It turns out that the argument can be arranged in the very slightly different language of Shannon entropy, and I would like to present it here. The argument proceeds by locating the core and petals of the sunflower separately (this strategy is also followed in Alweiss-Lovett-Wu-Zhang). In both cases the following definition will be key. In this post all random variables, such as random sets, will be understood to be discrete random variables taking values in a finite range. We always use boldface symbols to denote random variables, and non-boldface for deterministic quantities.

Definition 1 (Spread set) Let ${R > 1}$ . A random set ${{\bf A}}$ is said to be ${R}$ -spread if one has
$\displaystyle {\mathbb P}( S \subset {\bf A}) \leq R^{-|S|}$
for all sets ${S}$ . A family ${(A_i)_{i \in I}}$ of sets is said to be ${R}$ -spread if ${I}$ is non-empty and the random variable ${A_{\bf i}}$ is ${R}$ -spread, where ${{\bf i}}$ is drawn uniformly from ${I}$ .

The core can then be selected greedily in such a way that the remainder of a family becomes spread:

Lemma 2 (Locating the core) Let ${(A_i)_{i \in I}}$ be a family of subsets of a finite set ${X}$ , each of cardinality at most ${k}$ , and let ${R > 1}$ . Then there exists a “core” set ${S_0}$ of cardinality at most ${k}$ such that the set
$\displaystyle J := \{ i \in I: S_0 \subset A_i \} \ \ \ \ \ (3)$
has cardinality at least ${R^{-|S_0|} |I|}$ , and such that the family ${(A_j \backslash S_0)_{j \in J}}$ is ${R}$ -spread. Furthermore, if ${|I| > R^k}$ and the ${A_i}$ are distinct, then ${|S_0| < k}$ .

Proof: We may assume ${I}$ is non-empty, as the claim is trivial otherwise. For any ${S \subset X}$ , define the quantity

$\displaystyle Q(S) := R^{|S|} |\{ i \in I: S \subset A_i\}|,$

and let ${S_0}$ be a subset of ${X}$ that maximizes ${Q(S_0)}$ . Since ${Q(\emptyset) = |I| > 0}$ and ${Q(S)=0}$ when ${|S| >k}$ , we see that ${0 \leq |S_0| \leq K}$ . If the ${A_i}$ are distinct and ${|I| > R^k}$ , then we also have ${Q(S) \leq R^k < |I| = Q(\emptyset)}$ when ${|S|=k}$ , thus in this case we have ${|S_0| < k}$ .

Let ${J}$ be the set (3). Since ${Q(S_0) \geq Q(\emptyset)>0}$ , ${J}$ is non-empty. It remains to check that the family ${(A_j \backslash S_0)_{j \in J}}$ is ${R}$ -spread. But for any ${S \subset X}$ and ${{\bf j}}$ drawn uniformly at random from ${J}$ one has

$\displaystyle {\mathbb P}( S \subset A_{\bf j} \backslash S_0 ) = \frac{|\{ i \in I: S_0 \cup S \subset A_i\}|}{|\{ i \in I: S_0 \subset A_i\}|} = R^{|S_0|-|S_0 \cup S|} \frac{Q(S)}{Q(S_0)}.$

Observe that ${Q(S) \leq Q(S_0)}$ , and the probability is only non-empty when ${S, S_0}$ are disjoint, so that ${|S_0|-|S_0 \cup S| = - |S|}$ . The claim follows. $\Box$

In view of the above lemma, the bound (2) will then follow from

Proposition 3 (Locating the petals) Let ${r, k \geq 2}$ be natural numbers, and suppose that ${R \geq C r \log(kr)}$ for a sufficiently large constant ${C}$ . Let ${(A_i)_{i \in I}}$ be a finite family of subsets of a finite set ${X}$ , each of cardinality at most ${k}$ which is ${R}$ -spread. Then there exist ${i_1,\dots,i_r \in I}$ such that ${A_{i_1},\dots,A_{i_r}}$ is disjoint.

Indeed, to prove (2), we assume that ${(A_i)_{i \in I}}$ is a family of sets of cardinality greater than ${R^k}$ for some ${R \geq Cr \log(kr)}$ ; by discarding redundant elements and sets we may assume that ${I}$ is finite and that all the ${A_i}$ are contained in a common finite set ${X}$ . Apply Lemma 2 to find a set ${S_0 \subset X}$ of cardinality ${|S_0| < k}$ such that the family ${(A_j \backslash S_0)_{j \in J}}$ is ${R}$ -spread. By Proposition 3 we can find ${j_1,\dots,j_r \in J}$ such that ${A_{j_1} \backslash S_0,\dots,A_{j_r} \backslash S_0}$ are disjoint; since these sets have cardinality ${k - |S_0| > 0}$ , this implies that the ${j_1,\dots,j_r}$ are distinct. Hence ${A_{j_1},\dots,A_{j_r}}$ form a sunflower as required.

Remark 4 Proposition 3 is easy to prove if we strengthen the condition on ${R}$ to ${R > k(r-1)}$ . In this case, we have ${\mathop{\bf P}_{i \in I}( x \in A_i) < 1/k(r-1)}$ for every ${x \in X}$ , hence by the union bound we see that for any ${i_1,\dots,i_j \in I}$ with ${j \leq r-1}$ there exists ${i_{j+1} \in I}$ such that ${A_{i_{j+1}}}$ is disjoint from the set ${A_{i_1} \cup \dots \cup A_{i_j}}$ , which has cardinality at most ${k(r-1)}$ . Iterating this, we obtain the conclusion of Proposition 3 in this case. This recovers a bound of the form ${\mathrm{Sun}(k,r) \leq (k(r-1))^k+1}$ , and by pursuing this idea a little further one can recover the original upper bound (1) of Erdös and Rado.

It remains to prove Proposition 3. In fact we can locate the petals one at a time, placing each petal inside a random set.

Proposition 5 (Locating a single petal) Let the notation and hypotheses be as in Proposition 3. Let ${{\bf V}}$ be a random subset of ${X}$ , such that each ${x \in X}$ lies in ${{\bf V}}$ with an independent probability of ${1/r}$ . Then with probability greater than ${1-1/r}$ , ${{\bf V}}$ contains one of the ${A_i}$ .

To see that Proposition 5 implies Proposition 3, we randomly partition ${X}$ into ${{\bf V}_1 \cup \dots \cup {\bf V}_r}$ by placing each ${x \in X}$ into one of the ${{\bf V}_j}$ , ${j=1,\dots,r}$ chosen uniformly and independently at random. By Proposition 5 and the union bound, we see that with positive probability, it is simultaneously true for all ${j=1,\dots,r}$ that each ${{\bf V}_j}$ contains one of the ${A_i}$ . Selecting one such ${A_i}$ for each ${{\bf V}_j}$ , we obtain the required disjoint petals.

We will prove Proposition 5 by gradually increasing the density of the random set and arranging the sets ${A_i}$ to get quickly absorbed by this random set. The key iteration step is

Proposition 6 (Refinement inequality) Let ${R > 1}$ and ${0 < \delta < 1}$ . Let ${{\bf A}}$ be a random subset of a finite set ${X}$ which is ${R}$ -spread, and let ${{\bf V}}$ be a random subset of ${X}$ independent of ${{\bf A}}$ , such that each ${x \in X}$ lies in ${{\bf V}}$ with an independent probability of ${\delta}$ . Then there exists another ${R}$ -spread random subset ${{\bf A}'}$ of ${X}$ whose support is contained in the support of ${{\bf A}}$ , such that ${{\bf A}' \backslash {\bf V} \subset {\bf A}}$ and
$\displaystyle {\mathbb E} |{\bf A}' \backslash {\bf V}| \leq \frac{5}{\log(R\delta)} {\mathbb E} |{\bf A}|.$

Note that a direct application of the first moment method gives only the bound

$\displaystyle {\mathbb E} |{\bf A} \backslash {\bf V}| \leq (1-\delta) {\mathbb E} |{\bf A}|,$

but the point is that by switching from ${{\bf A}}$ to an equivalent ${{\bf A}'}$ we can replace the ${1-\delta}$ factor by a quantity significantly smaller than ${1}$ .

One can iterate the above proposition, repeatedly replacing ${{\bf A}, X}$ with ${{\bf A}' \backslash {\bf V}, X \backslash {\bf V}}$ (noting that this preserves the ${R}$ -spread nature of ${{\bf A}}$ ) to conclude

Corollary 7 (Iterated refinement inequality) Let ${R > 1}$ , ${0 < \delta < 1}$ , and ${m \geq 1}$ . Let ${{\bf A}}$ be a random subset of a finite set ${X}$ which is ${R}$ -spread, and let ${{\bf V}}$ be a random subset of ${X}$ independent of ${{\bf A}}$ , such that each ${x \in X}$ lies in ${{\bf V}}$ with an independent probability of ${1-(1-\delta)^m}$ . Then there exists another random subset ${{\bf A}'}$ of ${X}$ with support contained in the support of ${{\bf A}}$ , such that
$\displaystyle {\mathbb E} |{\bf A}' \backslash {\bf V}| \leq (\frac{5}{\log(R\delta)})^m {\mathbb E} |{\bf A}|.$

Now we can prove Proposition 5. Let ${m}$ be chosen shortly. Applying Corollary 7 with ${{\bf A}}$ drawn uniformly at random from the ${(A_i)_{i \in I}}$ , and setting ${1-(1-\delta)^m = 1/r}$ , or equivalently ${\delta = 1 - (1 - 1/r)^{1/m}}$ , we have

$\displaystyle {\mathbb E} |{\bf A}' \backslash {\bf V}| \leq (\frac{5}{\log(R\delta)})^m k.$

In particular, if we set ${m = \lceil \log kr \rceil}$ , so that ${\delta \sim \frac{1}{r \log kr}}$ , then by choice of ${R}$ we have ${\frac{5}{\log(R\delta)} < \frac{1}{2}}$ , hence

$\displaystyle {\mathbb E} |{\bf A}' \backslash {\bf V}| < \frac{1}{r}.$

In particular with probability at least ${1 - \frac{1}{r}}$ , there must exist ${A_i}$ such that ${|A_i \backslash {\bf V}| = 0}$ , giving the proposition.

It remains to establish Proposition 6. This is the difficult step, and requires a clever way to find the variant ${{\bf A}'}$ of ${{\bf A}}$ that has better containment properties in ${{\bf V}}$ than ${{\bf A}}$ does. The main trick is to make a conditional copy ${({\bf A}', {\bf V}')}$ of ${({\bf A}, {\bf V})}$ that is conditionally independent of ${({\bf A}, {\bf V})}$ subject to the constraint ${{\bf A} \cup {\bf V} = {\bf A}' \cup {\bf V}'}$ . The point here is that this constrant implies the inclusions

$\displaystyle {\bf A}' \backslash {\bf V} \subset {\bf A} \cap {\bf A}' \subset {\bf A} \ \ \ \ \ (4)$

and

$\displaystyle {\bf A}' \backslash {\bf A} \subset {\bf V}. \ \ \ \ \ (5)$

Because of the ${R}$ -spread hypothesis, it is hard for ${{\bf A}}$ to contain any fixed large set. If we could apply this observation in the contrapositive to ${{\bf A} \cap {\bf A}'}$ we could hope to get a good upper bound on the size of ${{\bf A} \cap {\bf A}'}$ and hence on ${{\bf A} \backslash {\bf V}}$ thanks to (4). One can also hope to improve such an upper bound by also employing (5), since it is also hard for the random set ${{\bf V}}$ to contain a fixed large set. There are however difficulties with implementing this approach due to the fact that the random sets ${{\bf A} \cap {\bf A}', {\bf A}' \backslash {\bf A}}$ are coupled with ${{\bf A}, {\bf V}}$ in a moderately complicated fashion. In Rao’s argument a somewhat complicated encoding scheme was created to give information-theoretic control on these random variables; below the fold we accomplish a similar effect by using Shannon entropy inequalities in place of explicit encoding. A certain amount of information-theoretic sleight of hand is required to decouple certain random variables to the extent that the Shannon inequalities can be effectively applied. The argument bears some resemblance to the “entropy compression method” discussed in this previous blog post; there may be a way to more explicitly express the argument below in terms of that method. (There is also some kinship with the method of dependent random choice, which is used for instance to establish the Balog-Szemerédi-Gowers lemma, and was also translated into information theoretic language in these unpublished notes of Van Vu and myself.)

Read the rest of this entry »

254A, Notes 9 – second moment and entropy methods

12 November, 2019 in 254A - analytic prime number theory, math.IT, math.NT | Tags: correlation, entropy decrement argument, multiplicative functions, second moment method, Shannon entropy | by Terence Tao | 33 comments

In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.

The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form

$\displaystyle \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)$

of the prime number theorem (where we take the limit ${x \rightarrow \infty}$ ). One can interpret this estimate probabilistically as

$\displaystyle {\mathbb E} \mu(\mathbf{n}) = o(1) \ \ \ \ \ (2)$

where ${\mathbf{n} = \mathbf{n}_{\leq x}}$ is a random variable drawn uniformly from the natural numbers up to ${x}$ , and ${{\mathbb E}}$ denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+1).$

After dividing by ${x}$ , this is essentially

$\displaystyle {\mathbb E} \mu(\mathbf{n}) \mu(\mathbf{n}+1).$

With probabilistic intuition, one may expect the random variables ${\mu(\mathbf{n}), \mu(\mathbf{n}+1)}$ to be approximately independent (there is no obvious relationship between the number of prime factors of ${\mathbf{n}}$ , and of ${\mathbf{n}+1}$ ), and so the above average would be expected to be approximately equal to

$\displaystyle ({\mathbb E} \mu(\mathbf{n})) ({\mathbb E} \mu(\mathbf{n}+1))$

which by (2) is equal to ${o(1)}$ . Thus we are led to the prediction

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+1) = o(x). \ \ \ \ \ (3)$

The asymptotic (3) is widely believed (it is a special case of the Chowla conjecture, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.
How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “ ${L^2}$ ” measure of independence known as covariance, and the “ ${L \log L}$ ” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of ${L^2}$ type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of ${L \log L}$ methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as ${L^2}$ methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version

$\displaystyle \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = o(\log x) \ \ \ \ \ (4)$

of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound ${\sum_{n \leq x} \frac{\mu(n)}{n} = o(\log x)}$ , but the latter bound is much easier to establish than the former).
As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them on average, rather than uniformly. For instance, given two random variables ${\mathbf{X}}$ and ${\mathbf{Y}}$ of number-theoretic origin (such as the random variables ${\mu(\mathbf{n})}$ and ${\mu(\mathbf{n}+1)}$ mentioned previously), it can often be extremely difficult to determine the extent to which ${\mathbf{X},\mathbf{Y}}$ behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if ${\mathbf{Y}_1,\dots,\mathbf{Y}_k}$ are a large collection of “independent” random variables, and ${\mathbf{X}}$ is a further random variable that is “not too large” in some sense, then ${\mathbf{X}}$ must necessarily be nearly independent (or conditionally independent) to many of the ${\mathbf{Y}_i}$ , even if one cannot pinpoint precisely which of the ${\mathbf{Y}_i}$ the variable ${\mathbf{X}}$ is independent with. In the case of the second moment method, this allows us to compute correlations such as ${{\mathbb E} {\mathbf X} \mathbf{Y}_i}$ for “most” ${i}$ . The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities ${{\mathbb E} F( {\mathbf X}, \mathbf{Y}_i )}$ for “most” ${i}$ thanks to tools such as the Pinsker inequality.

Read the rest of this entry »

Special cases of Shannon entropy

1 March, 2017 in expository, math.IT, math.NT | Tags: Liouville function, randomness, Shannon entropy | by Terence Tao | 29 comments

Given a random variable ${X}$ that takes on only finitely many values, we can define its Shannon entropy by the formula

$\displaystyle H(X) := \sum_x \mathbf{P}(X=x) \log \frac{1}{\mathbf{P}(X=x)}$

with the convention that ${0 \log \frac{1}{0} = 0}$ . (In some texts, one uses the logarithm to base ${2}$ rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables ${X,Y}$ taking on finitely many values, the joint variable ${(X,Y)}$ is also a random variable taking on finitely many values, and also has an entropy ${H(X,Y)}$ . It obeys the Shannon inequalities

$\displaystyle H(X), H(Y) \leq H(X,Y) \leq H(X) + H(Y)$

so we can define some further nonnegative quantities, the mutual information

$\displaystyle I(X:Y) := H(X) + H(Y) - H(X,Y)$

and the conditional entropies

$\displaystyle H(X|Y) := H(X,Y) - H(Y); \quad H(Y|X) := H(X,Y) - H(X).$

More generally, given three random variables ${X,Y,Z}$ , one can define the conditional mutual information

$\displaystyle I(X:Y|Z) := H(X|Z) + H(Y|Z) - H(X,Y|Z)$

and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.

The mutual information ${I(X:Y)}$ is a measure of the extent to which ${X}$ and ${Y}$ fail to be independent; indeed, it is not difficult to show that ${I(X:Y)}$ vanishes if and only if ${X}$ and ${Y}$ are independent. Similarly, ${I(X:Y|Z)}$ vanishes if and only if ${X}$ and ${Y}$ are conditionally independent relative to ${Z}$ . At the other extreme, ${H(X|Y)}$ is a measure of the extent to which ${X}$ fails to depend on ${Y}$ ; indeed, it is not difficult to show that ${H(X|Y)=0}$ if and only if ${X}$ is determined by ${Y}$ in the sense that there is a deterministic function ${f}$ such that ${X = f(Y)}$ . In a related vein, if ${X}$ and ${X'}$ are equivalent in the sense that there are deterministic functional relationships ${X = f(X')}$ , ${X' = g(X)}$ between the two variables, then ${X}$ is interchangeable with ${X'}$ for the purposes of computing the above quantities, thus for instance ${H(X) = H(X')}$ , ${H(X,Y) = H(X',Y)}$ , ${I(X:Y) = I(X':Y)}$ , ${I(X:Y|Z) = I(X':Y|Z)}$ , etc..

One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables ${X}$ being considered come from restricting a single random (and uniformly distributed) boolean function ${F: \Omega \rightarrow \{0,1\}}$ on a given finite domain ${\Omega}$ to some subset ${A}$ of ${\Omega}$ :

$\displaystyle X = F \downharpoonright_A.$

In this case, ${X}$ has the law of a random uniformly distributed boolean function from ${A}$ to ${\{0,1\}}$ , and the entropy here can be easily computed to be ${|A| \log 2}$ , where ${|A|}$ denotes the cardinality of ${A}$ . If ${X}$ is the restriction of ${F}$ to ${A}$ , and ${Y}$ is the restriction of ${F}$ to ${B}$ , then the joint variable ${(X,Y)}$ is equivalent to the restriction of ${F}$ to ${A \cup B}$ . If one discards the normalisation factor ${\log 2}$ , one then obtains the following dictionary between entropy and the combinatorics of finite sets:

Random variables ${X,Y,Z}$	Finite sets ${A,B,C}$
Entropy ${H(X)}$	Cardinality ${\|A\|}$
Joint variable ${(X,Y)}$	Union ${A \cup B}$
Mutual information ${I(X:Y)}$	Intersection cardinality ${\|A \cap B\|}$
Conditional entropy ${H(X\|Y)}$	Set difference cardinality ${\|A \backslash B\|}$
Conditional mutual information ${I(X:Y\|Z)}$	${\|(A \cap B) \backslash C\|}$
${X, Y}$ independent	${A, B}$ disjoint
${X}$ determined by ${Y}$	${A}$ a subset of ${B}$
${X,Y}$ conditionally independent relative to ${Z}$	${A \cap B \subset C}$

Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality ${H(X,Y) \leq H(X)+H(Y)}$ becomes the union bound ${|A \cup B| \leq |A| + |B|}$ , and the definition of mutual information becomes the inclusion-exclusion formula

$\displaystyle |A \cap B| = |A| + |B| - |A \cup B|.$

For a more advanced example, consider the data processing inequality that asserts that if ${X, Z}$ are conditionally independent relative to ${Y}$ , then ${I(X:Z) \leq I(X:Y)}$ . Specialising to sets, this now says that if ${A, C}$ are disjoint outside of ${B}$ , then ${|A \cap C| \leq |A \cap B|}$ ; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if ${A}$ and ${C}$ are not necessarily disjoint outside of ${B}$ , then a consideration of Venn diagrams gives the more general inequality

$\displaystyle |A \cap C| \leq |A \cap B| + |(A \cap C) \backslash B|$

and a further inspection of the diagram then reveals the more precise identity

$\displaystyle |A \cap C| + |(A \cap B) \backslash C| = |A \cap B| + |(A \cap C) \backslash B|.$

Using the dictionary in the reverse direction, one is then led to conjecture the identity

$\displaystyle I( X : Z ) + I( X : Y | Z ) = I( X : Y ) + I( X : Z | Y )$

which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.

On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set ${C}$ is preserved by unions:

$\displaystyle A \cap C = B \cap C = \emptyset \implies (A \cup B) \cap C = \emptyset.$

Indeed, one has the union bound

$\displaystyle |(A \cup B) \cap C| \leq |A \cap C| + |B \cap C|. \ \ \ \ \ (1)$

Applying the dictionary in the reverse direction, one might now conjecture that if ${X}$ was independent of ${Z}$ and ${Y}$ was independent of ${Z}$ , then ${(X,Y)}$ should also be independent of ${Z}$ , and furthermore that

$\displaystyle I(X,Y:Z) \leq I(X:Z) + I(Y:Z)$

but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take ${X, Y \in {\bf F}_2}$ to be independent, uniformly distributed random elements of the finite field ${{\bf F}_2}$ of two elements, and take ${Z := X+Y}$ to be the sum of these two field elements. One can easily check that each of ${X}$ and ${Y}$ is separately independent of ${Z}$ , but the joint variable ${(X,Y)}$ determines ${Z}$ and thus is not independent of ${Z}$ .

From the inclusion-exclusion identities

$\displaystyle |A \cap C| = |A| + |C| - |A \cup C|$

$\displaystyle |B \cap C| = |B| + |C| - |B \cup C|$

$\displaystyle |(A \cup B) \cap C| = |A \cup B| + |C| - |A \cup B \cup C|$

$\displaystyle |A \cap B \cap C| = |A| + |B| + |C| - |A \cup B| - |B \cup C| - |A \cup C|$

$\displaystyle + |A \cup B \cup C|$

one can check that (1) is equivalent to the trivial lower bound ${|A \cap B \cap C| \geq 0}$ . The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection ${A \cap B \cap C}$ . (Even the double intersection ${A \cap B}$ only exists information theoretically in a “virtual” sense; the mutual information ${I(X:Y)}$ allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)

However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities ${H(X), H(Y), H(X,Y), I(X:Y), H(X|Y), H(Y|X)}$ associated to just two variables ${X,Y}$ are those that are also necessarily obeyed by their combinatorial analogues ${|A|, |B|, |A \cup B|, |A \cap B|, |A \backslash B|, |B \backslash A|}$ . (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)

One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let ${S}$ be some finite-dimensional vector space over a finite field ${{\mathbf F}}$ , and let ${f: S \rightarrow {\mathbf F}}$ be a random linear functional on ${S}$ , selected uniformly among all such functions. Every subspace ${U}$ of ${S}$ then gives rise to a random variable ${X = X_U: U \rightarrow {\mathbf F}}$ formed by restricting ${f}$ to ${U}$ . This random variable is also distributed uniformly amongst all linear functions on ${U}$ , and its entropy can be easily computed to be ${\mathrm{dim}(U) \log |\mathbf{F}|}$ . Given two random variables ${X, Y}$ formed by restricting ${f}$ to ${U, V}$ respectively, the joint random variable ${(X,Y)}$ determines the random linear function ${f}$ on the union ${U \cup V}$ on the two spaces, and thus by linearity on the Minkowski sum ${U+V}$ as well; thus ${(X,Y)}$ is equivalent to the restriction of ${f}$ to ${U+V}$ . In particular, ${H(X,Y) = \mathrm{dim}(U+V) \log |\mathbf{F}|}$ . This implies that ${I(X:Y) = \mathrm{dim}(U \cap V) \log |\mathbf{F}|}$ and also ${H(X|Y) = \mathrm{dim}(\pi_V(U)) \log |\mathbf{F}|}$ , where ${\pi_V: S \rightarrow S/V}$ is the quotient map. After discarding the normalising constant ${\log |\mathbf{F}|}$ , this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:

Random variables ${X,Y,Z}$	Subspaces ${U,V,W}$
Entropy ${H(X)}$	Dimension ${\mathrm{dim}(U)}$
Joint variable ${(X,Y)}$	Sum ${U+V}$
Mutual information ${I(X:Y)}$	Dimension of intersection ${\mathrm{dim}(U \cap V)}$
Conditional entropy ${H(X\|Y)}$	Dimension of projection ${\mathrm{dim}(\pi_V(U))}$
Conditional mutual information ${I(X:Y\|Z)}$	${\mathrm{dim}(\pi_W(U) \cap \pi_W(V))}$
${X, Y}$ independent	${U, V}$ transverse ( ${U \cap V = \{0\}}$ )
${X}$ determined by ${Y}$	${U}$ a subspace of ${V}$
${X,Y}$ conditionally independent relative to ${Z}$	${\pi_W(U)}$ , ${\pi_W(V)}$ transverse.

The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking ${S}$ to be the vector space ${\mathbf{F}_2^\Omega}$ over the finite field ${\mathbf{F}_2}$ of two elements, and only considering those subspaces ${U}$ that are coordinate subspaces ${U = {\bf F}_2^A}$ associated to various subsets ${A}$ of ${\Omega}$ .

As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of ${\log |\mathbf{F}|}$ ). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).

The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces ${U,V,W}$ such that ${U}$ and ${V}$ are separately transverse to ${W}$ , but their sum ${U+V}$ is not; for instance, in a two-dimensional vector space ${{\bf F}^2}$ , one can take ${U,V,W}$ to be the one-dimensional subspaces spanned by ${(0,1)}$ , ${(1,0)}$ , and ${(1,1)}$ respectively. Note that this is essentially the same counterexample from before (which took ${{\bf F}}$ to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces ${U,V,W}$ (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables ${X,Y,Z}$ (as well as the corresponding quantities on the above table).

However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality

$\displaystyle \mathrm{dim}(U \cap V) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V)) + \mathrm{dim}(\pi_X(U) \cap \pi_X(V)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (2)$

for any subspaces ${U,V,W,X}$ . This is easiest to see when the three terms on the right-hand side vanish; then ${\pi_W(U), \pi_W(V)}$ are transverse, which implies that ${U\cap V \subset W}$ ; similarly ${U \cap V \subset X}$ . But ${W}$ and ${X}$ are transverse, and this clearly implies that ${U}$ and ${V}$ are themselves transverse. To prove the general case of Ingleton’s inequality, one can define ${Y := U \cap V}$ and use ${\mathrm{dim}(\pi_W(Y)) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V))}$ (and similarly for ${X}$ instead of ${W}$ ) to reduce to establishing the inequality

$\displaystyle \mathrm{dim}(Y) \leq \mathrm{dim}(\pi_W(Y)) + \mathrm{dim}(\pi_X(Y)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (3)$

which can be rearranged using ${\mathrm{dim}(\pi_W(Y)) = \mathrm{dim}(Y) - \mathrm{dim}(W) + \mathrm{dim}(\pi_Y(W))}$ (and similarly for ${X}$ instead of ${W}$ ) and ${\mathrm{dim}(W \cap X) = \mathrm{dim}(W) + \mathrm{dim}(X) - \mathrm{dim}(W + X)}$ as

$\displaystyle \mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W)) + \mathrm{dim}(\pi_Y(X)) + \mathrm{dim}(Y)$

but this is clear since ${\mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W) + \pi_Y(X)) + \mathrm{dim}(Y)}$ .

Returning to the entropy setting, the analogue

$\displaystyle H( V ) \leq H( V | Z ) + H(V | W ) + I(Z:W)$

of (3) is true (exercise!), but the analogue

$\displaystyle I(X:Y) \leq I(X:Y|Z) + I(X:Y|W) + I(Z:W) \ \ \ \ \ (4)$

of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then ${X,Y}$ are conditionally independent relative to ${Z}$ , and relative to ${W}$ , and ${Z}$ and ${W}$ are independent, and the claim (4) would then be asserting that ${X}$ and ${Y}$ are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take ${Z,W}$ to be independent uniform variables from ${\mathbf{F}_2}$ , and take ${X}$ and ${Y}$ to be (say) ${ZW}$ and ${(1-Z)(1-W)}$ respectively (thus ${X, Y}$ are the indicators of the events ${Z=W=1}$ and ${Z=W=0}$ respectively). Once one conditions on either ${Z}$ or ${W}$ , one of ${X,Y}$ has positive conditional entropy and the other has zero entropy, and so ${X, Y}$ are conditionally independent relative to either ${Z}$ or ${W}$ ; also, ${Z}$ or ${W}$ are independent of each other. But ${X}$ and ${Y}$ are not independent of each other (they cannot be simultaneously equal to ${1}$ ). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces ${U, V}$ has a well-defined intersection ${U \cap V}$ that is also a subspace, whereas for arbitrary random variables ${X, Y}$ , there does not necessarily exist the analogue of an intersection, namely a “common information” random variable ${V}$ that has the entropy of ${I(X:Y)}$ and is determined either by ${X}$ or by ${Y}$ .

I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality

$\displaystyle I(X:Y) \leq 2 I(X:Y|Z) + I(X:Z|Y) + I(Y:Z|X)$

$\displaystyle + I(X:Y|W) + I(Z:W).$

One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that

$\displaystyle |\sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n}| \leq \varepsilon \log x \ \ \ \ \ (5)$

whenever ${x}$ was sufficiently large depending on ${\varepsilon>0}$ , where ${\lambda}$ is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale ${H}$ between ${1}$ and ${x}$ , one can form certain random variables ${X_H, Y_H}$ . The random variable ${X_H}$ is a sign pattern of the form ${(\lambda(n+1),\dots,\lambda(n+H))}$ where ${n}$ is a random number chosen from ${1}$ to ${x}$ (with logarithmic weighting). The random variable ${Y_H}$ was tuple ${(n \hbox{ mod } p)_{p \sim \varepsilon^2 H}}$ of reductions of ${n}$ to primes ${p}$ comparable to ${\varepsilon^2 H}$ . Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of ${\lambda}$ , the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound

$\displaystyle I( X_H : Y_H ) \gg \varepsilon^7 \frac{H}{\log H}$

on the mutual information between ${X_H}$ and ${Y_H}$ . From translation invariance, this also gives the more general lower bound

$\displaystyle I( X_{H_0,H} : Y_H ) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (6)$

for any ${H_0}$ , where ${X_{H_0,H}}$ denotes the shifted sign pattern ${(\lambda(n+H_0+1),\dots,\lambda(n+H_0+H))}$ . On the other hand, one had the entropy bounds

$\displaystyle H( X_{H_0,H} ), H(Y_H) \ll H$

and from concatenating sign patterns one could see that ${X_{H_0,H+H'}}$ is equivalent to the joint random variable ${(X_{H_0,H}, X_{H_0+H,H'})}$ for any ${H_0,H,H'}$ . Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once ${H}$ was allowed to become sufficiently large compared to ${\varepsilon}$ , but the bound was quite weak (coming ultimately from the unboundedness of ${\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}$ as the interval ${[H_-,H_+]}$ of values of ${H}$ under consideration becomes large), something of the order of ${H \sim \exp\exp\exp(\varepsilon^{-7})}$ ; the quantity ${H}$ needs at various junctures to be less than a small power of ${\log x}$ , so the relationship between ${x}$ and ${\varepsilon}$ becomes essentially quadruple exponential in nature, ${x \sim \exp\exp\exp\exp(\varepsilon^{-7})}$ . The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate ${H(X_{kH})/kH}$ of the mean entropy, in that this quantity decreased by ${\gg \frac{\varepsilon^7}{\log H}}$ as ${k}$ increased from ${1}$ to ${\log H}$ , basically by dividing ${X_{kH}}$ into ${k}$ components ${X_{jH, H}}$ , ${j=0,\dots,k-1}$ and observing from (6) each of these shares a bit of common information with the same variable ${Y_H}$ . This is relatively clear when one works in a set model, in which ${Y_H}$ is modeled by a set ${B_H}$ of size ${O(H)}$ , and ${X_{H_0,H}}$ is modeled by a set of the form

$\displaystyle X_{H_0,H} = \bigcup_{H_0 < h \leq H_0+H} A_h$

for various sets ${A_h}$ of size ${O(1)}$ (also there is some translation symmetry that maps ${A_h}$ to a shift ${A_{h+1}}$ while preserving all of the ${B_H}$ ).

However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables ${Y_H}$ are basically jointly independent as ${H}$ ranges over dyadic values that are much smaller than ${\log x}$ , which in the set model corresponds to the ${B_H}$ all being disjoint. One can then establish a variant

$\displaystyle I( X_{H_0,H} : Y_H | (Y_{H'})_{H' < H}) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (7)$

of (6), which in the set model roughly speaking asserts that each ${B_H}$ claims a portion of the ${\bigcup_{H_0 < h \leq H_0+H} A_h}$ of cardinality ${\gg \varepsilon^7 \frac{H}{\log H}}$ that is not claimed by previous choices of ${B_H}$ . This leads to a more efficient contradiction (relying on the unboundedness of ${\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j}}$ rather than ${\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}$ ) that looks like it removes one order of exponential growth, thus the relationship between ${x}$ and ${\varepsilon}$ is now ${x \sim \exp\exp\exp(\varepsilon^{-7})}$ . Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form

$\displaystyle \frac{1}{2H} H(X_{2H} | (Y_{H'})_{H' \leq 2H}) \leq \frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H}) - \frac{c \varepsilon^7}{\log H}$

for a small constant ${c>0}$ , which on iterating and using the boundedness of ${\frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H})}$ gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges ${H}$ greater than ${\exp( (\log\log x)^{\varepsilon^{7}} )}$ or so, although at this range the theorem is not significantly simpler than the general case).

An entropy Plünnecke-Ruzsa inequality

27 October, 2009 in expository, math.CO, math.IT | Tags: additive combinatorics, Plunnecke-Ruzsa inequality, Shannon entropy | by Terence Tao | 11 comments

A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:

Theorem 1 (Plünnecke-Ruzsa inequality) Let ${A, B_1, \ldots, B_m}$ be finite non-empty subsets of an additive group ${G}$ , such that ${|A+B_i| \leq K_i |A|}$ for all ${1 \leq i \leq m}$ and some scalars ${K_1,\ldots,K_m \geq 1}$ . Then there exists a subset ${A'}$ of ${A}$ such that ${|A' + B_1 + \ldots + B_m| \leq K_1 \ldots K_m |A'|}$ .

The proof uses graph-theoretic techniques. Setting ${A=B_1=\ldots=B_m}$ , we obtain a useful corollary: if ${A}$ has small doubling in the sense that ${|A+A| \leq K|A|}$ , then we have ${|mA| \leq K^m |A|}$ for all ${m \geq 1}$ , where ${mA = A + \ldots + A}$ is the sum of ${m}$ copies of ${A}$ .

In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as ${A}$ in ${G}$ are replaced with discrete random variables ${X}$ taking values in ${G}$ , and (the logarithm of) cardinality ${|A|}$ of a set ${A}$ is replaced by Shannon entropy ${{\Bbb H}(X)}$ of a random variable ${X}$ . (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.

I recently discovered, however, that buried in a classic paper of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be precise) there was the following analogue of Theorem 1:

Theorem 2 (Entropy Plünnecke-Ruzsa inequality) Let ${X, Y_1, \ldots, Y_m}$ be independent random variables of finite entropy taking values in an additive group ${G}$ , such that ${{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i}$ for all ${1 \leq i \leq m}$ and some scalars ${K_1,\ldots,K_m \geq 1}$ . Then ${{\Bbb H}(X+Y_1+\ldots+Y_m) \leq {\Bbb H}(X) + \log K_1 \ldots K_m}$ .

In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set ${A}$ to a subset ${A'}$ , but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if ${{\Bbb H}(X_1+X_2) \leq {\Bbb H}(X) + \log K}$ , then ${{\Bbb H}(X_1+\ldots+X_m) \leq {\Bbb H}(X) + (m-1) \log K}$ for all ${m \geq 1}$ , where ${X_1,\ldots,X_m}$ are independent copies of ${X}$ ; this improves slightly over the analogous combinatorial inequality. Indeed, the function ${m \mapsto {\Bbb H}(X_1+\ldots+X_m)}$ is concave (this can be seen by using the ${m=2}$ version of Theorem 2 (or (2) below) to show that the quantity ${{\Bbb H}(X_1+\ldots+X_{m+1})-{\Bbb H}(X_1+\ldots+X_m)}$ is decreasing in ${m}$ ).

Theorem 2 is actually a quick consequence of the submodularity inequality

$\displaystyle {\Bbb H}(W) + {\Bbb H}(X) \leq {\Bbb H}(Y) + {\Bbb H}(Z) \ \ \ \ \ (1)$

in information theory, which is valid whenever ${X,Y,Z,W}$ are discrete random variables such that ${Y}$ and ${Z}$ each determine ${X}$ (i.e. ${X}$ is a function of ${Y}$ , and also a function of ${Z}$ ), and ${Y}$ and ${Z}$ jointly determine ${W}$ (i.e ${W}$ is a function of ${Y}$ and ${Z}$ ). To apply this, let ${X, Y, Z}$ be independent discrete random variables taking values in ${G}$ . Observe that the pairs ${(X,Y+Z)}$ and ${(X+Y,Z)}$ each determine ${X+Y+Z}$ , and jointly determine ${(X,Y,Z)}$ . Applying (1) we conclude that

$\displaystyle {\Bbb H}(X,Y,Z) + {\Bbb H}(X+Y+Z) \leq {\Bbb H}(X,Y+Z) + {\Bbb H}(X+Y,Z)$

which after using the independence of ${X,Y,Z}$ simplifies to the sumset submodularity inequality

$\displaystyle {\Bbb H}(X+Y+Z) + {\Bbb H}(Y) \leq {\Bbb H}(X+Y) + {\Bbb H}(Y+Z) \ \ \ \ \ (2)$

(this inequality was also recently observed by Madiman; it is the ${m=2}$ case of Theorem 2). As a corollary of this inequality, we see that if ${{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i}$ , then

$\displaystyle {\Bbb H}(X+Y_1+\ldots+Y_i) \leq {\Bbb H}(X+Y_1+\ldots+Y_{i-1}) + \log K_i,$

and Theorem 2 follows by telescoping series.

The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of ${X,Y_1,\ldots,Y_m}$ than ${X+Y_i}$ ; see this paper and this paper respectively.

Read the rest of this entry »

Sumset and inverse sumset theorems for Shannon entropy

25 June, 2009 in math.CO, math.PR, paper | Tags: Freiman's theorem, Shannon entropy, sum set estimates | by Terence Tao | 11 comments

It turns out to be a favourable week or two for me to finally finish a number of papers that had been at a nearly completed stage for a while. I have just uploaded to the arXiv my article “Sumset and inverse sumset theorems for Shannon entropy“, submitted to Combinatorics, Probability, and Computing. This paper evolved from a “deleted scene” in my book with Van Vu entitled “Entropy sumset estimates“. In those notes, we developed analogues of the standard Plünnecke-Ruzsa sumset estimates (which relate quantities such as the cardinalities $|A+B|, |A-B|$ of the sum and difference sets of two finite sets $A, B$ in an additive group $G$ to each other), to the entropy setting, in which the finite sets $A \subset G$ are replaced instead with discrete random variables $X$ taking values in that group G, and the (logarithm of the) cardinality |A| is replaced with the Shannon entropy

${\textbf H}(X) := \sum_{x \in G} {\Bbb P}(x \in X) \log \frac{1}{{\Bbb P}(x \in X)}.$

This quantity measures the information content of X; for instance, if ${\textbf H}(X) = k \log 2$ , then it will take k bits on the average to store the value of X (thus a string of n independent copies of X will require about nk bits of storage in the asymptotic limit $n \to \infty$ ). The relationship between entropy and cardinality is that if X is the uniform distribution on a finite non-empty set A, then ${\textbf H}(X) = \log |A|$ . If instead X is non-uniformly distributed on A, one has $0 < {\textbf H}(X) < \log |A|$ , thanks to Jensen’s inequality.

It turns out that many estimates on sumsets have entropy analogues, which resemble the “logarithm” of the sumset estimates. For instance, the trivial bounds

$|A|, |B| \leq |A+B| \leq |A| |B|$

have the entropy analogue

${\textbf H}(X), {\textbf H}(Y) \leq {\textbf H}(X+Y) \leq {\textbf H}(X) + {\textbf H}(Y)$

whenever X, Y are independent discrete random variables in an additive group; this is not difficult to deduce from standard entropy inequalities. Slightly more non-trivially, the sum set estimate

$|A+B| \leq \frac{|A-B|^3}{|A| |B|}$

established by Ruzsa, has an entropy analogue

${\textbf H}(X+Y) \leq 3 {\textbf H}(X-Y) - {\textbf H}(X) - {\textbf H}(Y)$ ,

and similarly for a number of other standard sumset inequalities in the literature (e.g. the Rusza triangle inequality, the Plünnecke-Rusza inequality, and the Balog-Szemeredi-Gowers theorem, though the entropy analogue of the latter requires a little bit of care to state). These inequalities can actually be deduced fairly easily from elementary arithmetic identities, together with standard entropy inequalities, most notably the submodularity inequality

${\textbf H}(Z) + {\textbf H}(W) \leq {\textbf H}(X) + {\textbf H}(Y)$

whenever X,Y,Z,W are discrete random variables such that X and Y each determine W separately (thus $W = f(X) = g(Y)$ for some deterministic functions f, g) and X and Y determine Z jointly (thus $Z = h(X,Y)$ for some deterministic function f). For instance, if X,Y,Z are independent discrete random variables in an additive group G, then $(X-Y,Y-Z)$ and $(X,Z)$ each determine $X-Z$ separately, and determine $X,Y,Z$ jointly, leading to the inequality

${\textbf H}(X,Y,Z) + {\textbf H}(X-Z) \leq {\textbf H}(X-Y,Y-Z) + {\textbf H}(X,Z)$

which soon leads to the entropy Rusza triangle inequality

${\textbf H}(X-Z) \leq {\textbf H}(X-Y) + {\textbf H}(Y-Z) - {\textbf H}(Y)$

which is an analogue of the combinatorial Ruzsa triangle inequality

$|A-C| \leq \frac{|A-B| |B-C|}{|B|}.$

All of this was already in the unpublished notes with Van, though I include it in this paper in order to place it in the literature. The main novelty of the paper, though, is to consider the entropy analogue of Freiman’s theorem, which classifies those sets A for which $|A+A| = O(|A|)$ . Here, the analogous problem is to classify the random variables $X$ such that ${\textbf H}(X_1+X_2) = {\textbf H}(X) + O(1)$ , where $X_1,X_2$ are independent copies of X. Let us say that X has small doubling if this is the case.

For instance, the uniform distribution U on a finite subgroup H of G has small doubling (in fact ${\textbf H}(U_1+U_2)={\textbf H}(U) = \log |H|$ in this case). In a similar spirit, the uniform distribution on a (generalised) arithmetic progression P also has small doubling, as does the uniform distribution on a coset progression H+P. Also, if X has small doubling, and Y has bounded entropy, then X+Y also has small doubling, even if Y and X are not independent. The main theorem is that these are the only cases:

Theorem 1. (Informal statement) X has small doubling if and only if $X = U + Y$ for some uniform distribution U on a coset progression (of bounded rank), and Y has bounded entropy.

For instance, suppose that X was the uniform distribution on a dense subset A of a finite group G. Then Theorem 1 asserts that X is close in a “transport metric” sense to the uniform distribution U on G, in the sense that it is possible to rearrange or transport the probability distribution of X to the probability distribution of U (or vice versa) by shifting each component of the mass of X by an amount Y which has bounded entropy (which basically means that it primarily ranges inside a set of bounded cardinality). The way one shows this is by randomly translating the mass of X around by a few random shifts to approximately uniformise the distribution, and then deal with the residual fluctuation in the distribution by hand. Theorem 1 as a whole is established by using the Freiman theorem in the combinatorial setting combined with various elementary convexity and entropy inequality arguments to reduce matters to the above model case when X is supported inside a finite group G and has near-maximal entropy.

I also show a variant of the above statement: if X, Y are independent and ${\textbf H}(X+Y) = {\textbf H}(X)+O(1) = {\textbf H}(Y)+O(1)$ , then we have $X \equiv Y+Z$ (i.e. X has the same distribution as Y+Z for some Z of bounded entropy (not necessarily independent of X or Y). Thus if two random variables are additively related to each other, then they can be additively transported to each other by using a bounded amount of entropy.

In the last part of the paper I relate these discrete entropies to their continuous counterparts

${\textbf H}_{\Bbb R}(X) := \int_{{\Bbb R}} p(x) \log \frac{1}{p(x)}\ dx,$

where X is now a continuous random variable on the real line with density function $p(x)\ dx$ . There are a number of sum set inequalities known in this setting, for instance

${\textbf H}_{\Bbb R}(X_1 + X_2) \geq {\textbf H}_{\Bbb R}(X) + \frac{1}{2} \log 2$ ,

for independent copies $X_1,X_2$ of a finite entropy random variable X, with equality if and only if X is a Gaussian. Using this inequality and Theorem 1, I show a discrete version, namely that

${\textbf H}(X_1 + X_2) \geq {\textbf H}(X) + \frac{1}{2} \log 2 - \varepsilon$ ,

whenever $\varepsilon> 0$ and $X_1,X_2$ are independent copies of a random variable in ${\Bbb Z}$ (or any other torsion-free abelian group) whose entropy is sufficiently large depending on $\varepsilon$ . This is somewhat analogous to the classical sumset inequality

$|A+A| \geq 2 |A| - 1$

though notice that we have a gain of just $\frac{1}{2} \log 2$ rather than $\log 2$ here, the point being that there is a Gaussian counterexample in the entropy setting which does not have a combinatorial analogue (except perhaps in the high-dimensional limit). The main idea is to use Theorem 1 to trap most of X inside a coset progression, at which point one can use Fourier-analytic additive combinatorial tools to show that the distribution $X_1+X_2$ is “smooth” in some non-trivial direction r, which can then be used to approximate the discrete distribution by a continuous one.

I also conjecture more generally that the entropy monotonicity inequalities established by Artstein, Barthe, Ball, and Naor in the continuous case also hold in the above sense in the discrete case, though my method of proof breaks down because I no longer can assume small doubling.

	Jianfeng Li on An introduction to measure…
	Jianfeng Li on An introduction to measure…
	Jianfeng Li on An introduction to measure…
	Jianfeng Li on An introduction to measure…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on Dense sets of natural numbers…
	Anonymous on Dense sets of natural numbers…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on Dense sets of natural numbers…
	Fabrice Imparato on Dense sets of natural numbers…
	Anonymous on A computation-outsourced discu…
	Anonymous on Dense sets of natural numbers…
	Anonymous on 245A, Notes 3: Integration on…
	Samuel German on Analysis I

Tag Archive

An abridged proof of Marton’s conjecture

Marton’s conjecture in abelian groups with bounded torsion

A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

On a conjecture of Marton

The sunflower lemma via Shannon entropy

254A, Notes 9 – second moment and entropy methods

Special cases of Shannon entropy

An entropy Plünnecke-Ruzsa inequality

Sumset and inverse sumset theorems for Shannon entropy

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

For commenters