You are currently browsing Terence Tao’s articles.

I’ve just uploaded to the arXiv my paper “Dense sets of natural numbers with unusually large least common multiples“. This short paper answers (in the negative) a somewhat obscure question of Erdös and Graham:

Problem 1 Is it true that if {A} is a set of natural numbers for which

\displaystyle  \frac{1}{\log\log x} \sum_{n \in A: n \leq x} \frac{1}{n} \ \ \ \ \ (1)

goes to infinity as {x \rightarrow \infty}, then the quantity

\displaystyle  \frac{1}{(\sum_{n \in A: n \leq x} \frac{1}{n})^2} \sum_{n,m \in A: n < m \leq x} \frac{1}{\mathrm{lcm}(n,m)} \ \ \ \ \ (2)

also goes to infinity as {x \rightarrow \infty}?

At first glance, this problem may seem rather arbitrary, but it can be motivated as follows. The hypothesis that (1) goes to infinity is a largeness condition on {A}; in view of Mertens’ theorem, it can be viewed as an assertion that {A} is denser than the set of primes. On the other hand, the conclusion that (2) grows is an assertion that {\frac{1}{\mathrm{lcm}(n,m)}} becomes significantly larger than {\frac{1}{nm}} on the average for large {n,m \in A}; that is to say, that many pairs of numbers in {A} share a common factor. Intuitively, the problem is then asking whether sets that are significantly denser than the primes must start having lots of common factors on average.

For sake of comparison, it is easy to see that if (1) goes to infinity, then at least one pair {(n,m)} of distinct elements in {A} must have a non-trivial common factor. For if this were not the case, then the elements of {A} are pairwise coprime, so each prime {p} has at most one multiple in {A}, and so can contribute at most {1/p} to the sum in (1), and hence by Mertens’ theorem, and the fact that every natural number greater than one is divisible by at least one prime {p}, the quantity (1) stays bounded, a contradiction.

It turns out, though, that the answer to the above problem is negative; one can find sets {A} that are denser than the primes, but for which (2) stays bounded, so that the least common multiples in the set are unusually large. It was a bit surprising to me that this question had not been resolved long ago (in fact, I was not able to find any prior literature on the problem beyond the original reference of Erdös and Graham); in contrast, another problem of Erdös and Graham concerning sets with unusually small least common multiples was extensively studied (and essentially solved) about twenty years ago, while the study of sets with unusually large greatest common divisor for many pairs in the set has recently become somewhat popular, due to their role in the proof of the Duffin-Schaeffer conjecture by Koukoulopoulos and Maynard.

To search for counterexamples, it is natural to look for numbers with relatively few prime factors, in order to reduce their common factors and increase their least common multiple. A particularly simple example, whose verification is on the level of an exercise in a graduate analytic number theory course, is the set of semiprimes (products of two primes), for which one can readily verify that (1) grows like {\log\log x} but (2) stays bounded. With a bit more effort, I was able to optimize the construction and uncover the true threshold for boundedness of (2), which was a little unexpected:

Theorem 2
  • (i) For any {C>0}, there exists a set of natural numbers {A} with

    \displaystyle  \sum_{n \in A: n \leq x} \frac{1}{n} = \exp( (C+o(1)) (\log\log x)^{1/2} \log\log\log x )

    for all large {x}, for which (2) stays bounded.
  • (ii) Conversely, if (2) stays bounded, then

    \displaystyle  \sum_{n \in A: n \leq x} \frac{1}{n} \ll \exp( O( (\log\log x)^{1/2} \log\log\log x ) )

    for all large {x}.

The proofs are not particularly long or deep, but I thought I would record here some of the process towards finding them. My first step was to try to simplify the condition that (2) stays bounded. In order to use probabilistic intuition, I first expressed this condition in probabilistic terms as

\displaystyle  \mathbb{E} \frac{\mathbf{n} \mathbf{m}}{\mathrm{lcm}(\mathbf{n}, \mathbf{m})} \ll 1

for large {x}, where {\mathbf{n}, \mathbf{m}} are independent random variables drawn from {\{ n \in A: n \leq x \}} with probability density function

\displaystyle  \mathbb{P} (\mathbf{n} = n) = \frac{1}{\sum_{m \in A: m \leq x} \frac{1}{m}} \frac{1}{n}.

The presence of the least common multiple in the denominator is annoying, but one can easily flip the expression to the greatest common divisor:

\displaystyle  \mathbb{E} \mathrm{gcd}(\mathbf{n}, \mathbf{m}) \ll 1.

If the expression {\mathrm{gcd}(\mathbf{n}, \mathbf{m})} was a product of a function of {\mathbf{n}} and a function of {\mathbf{m}}, then by independence this expectation would decouple into simpler averages involving just one random variable instead of two. Of course, the greatest common divisor is not of this form, but there is a standard trick in analytic number theory to decouple the greatest common divisor, namely to use the classic Gauss identity {n = \sum_{d|n} \varphi(d)}, with {\varphi} the Euler totient function, to write

\displaystyle  \mathrm{gcd}(\mathbf{n}, \mathbf{m}) = \sum_{d | \mathbf{n}, \mathbf{m}} \varphi(d).

Inserting this formula and interchanging the sum and expectation, we can now express the condition as bounding a sum of squares:

\displaystyle  \sum_d \varphi(d) \mathbb{P}(d|\mathbf{n})^2 \ll 1.

Thus, the condition (2) is really an assertion to the effect that typical elements of {A} do not have many divisors. From experience in sieve theory, the probabilities {\mathbb{P}(d|\mathbf{n})} tend to behave multiplicatively in {d}, so the expression here heuristically behaves like an Euler product that looks something like

\displaystyle  \prod_p (1 + \varphi(p) \mathbb{P}(p|\mathbf{n})^2)

and so the condition (2) is morally something like

\displaystyle  \sum_p p \mathbb{P}(p|\mathbf{n})^2 \ll 1. \ \ \ \ \ (3)

Comparing this with the Mertens’ theorems, this leads to the heuristic prediction that {\mathbb{P}(p|\mathbf{n})} (for a typical prie {p} much smaller than {x}) should decay somewhat like {\frac{1}{p (\log\log p)^{1/2}}} (ignoring for now factors of {\log\log\log p}). This can be compared to the example of the set of primes or semiprimes on one hand, where the probability is like {\frac{1}{p \log\log p}}, and the set of all natural numbers on the other hand, where the probability is like {\frac{1}{p}}. So the critical behavior should come from sets that are in some sense “halfway” between the primes and the natural numbers.

It is then natural to try a random construction, in which one sieves out the natural numbers by permitting each natural number {n} to survive with a probability resembling {\prod_{p|n} \frac{1}{(\log\log p)^{1/2}}}, in order to get the predicted behavior for {\mathbb{P}(p|\mathbf{n})}. Performing some standard calculations, this construction could ensure (2) bounded with a density a little bit less than the one stated in the main theorem; after optimizing the parameters, I could only get something like

\displaystyle  \sum_{n \in A: n \leq x} \frac{1}{n} = \exp( (\log\log x)^{1/2} (\log\log\log x)^{-1/2-o(1)} ).

I was stuck on optimising the construction further, so I turned my attention to a positive result in the spirit of (ii) of the main theorem. On playing around with (3), I observed that one could use Cauchy-Schwarz and Mertens’ theorem to obtain the bound

\displaystyle  \sum_{p \leq x} \mathbb{P}(p|\mathbf{n}) \ll (\log\log x)^{1/2}

which was in line with the previous heuristic that {\mathbb{P}(p|\mathbf{n})} should behave like {\frac{1}{p (\log\log p)^{1/2}}}. The left-hand side had a simple interpretation: by linearity of expectation, it was the expected number {\mathbb{E} \omega(\mathbf{n})} of prime factors of {\mathbf{n}}. So the boundedness of (2) implied that a typical element of {A} only had about {(\log\log x)^{1/2}} prime factors, in contrast to the {\log\log x} predicted by the Hardy-Ramanujan law. Standard methods from the anatomy of integers can then be used to see how dense a set with that many prime factors could be, and this soon led to a short proof of part (ii) of the main theorem (I eventually found for instance that Jensen’s inequality could be used to create a particularly slick argument).

It then remained to improve the lower bound construction to eliminate the {\log\log\log x} losses in the exponents. By deconstructing the proof of the upper bound, it became natural to consider something like the set of natural numbers {n} that had at most {(\log\log n)^{1/2}} prime factors. This construction actually worked for some scales {x} – namely those {x} for which {(\log\log x)^{1/2}} was a natural number – but there was some strange “discontinuities” in the analysis that prevented me from establishing the boundedness of (2) for arbitrary scales {x}. The basic problem was that increasing the number of permitted prime factors from one natural number threshold {k} to another {k+1} ended up increasing the density of the set by an unbounded factor (of the order of {k}, in practice), which heavily disrupted the task of trying to keep the ratio (2) bounded. Usually the resolution to these sorts of discontinuities is to use some sort of random “average” of two or more deterministic constructions – for instance, by taking some random union of some numbers with {k} prime factors and some numbers with {k+1} prime factors – but the numerology turned out to be somewhat unfavorable, allowing for some improvement in the lower bounds over my previous construction, but not enough to close the gap entirely. It was only after substantial trial and error that I was able to find a working deterministic construction, where at a given scale one collected either numbers with at most {k} prime factors, or numbers with {k+1} prime factors but with the largest prime factor in a specific range, in which I could finally get the numerator and denominator in (2) to be in balance for every {x}. But once the construction was written down, the verification of the required properties ended up being quite routine.

Many modern mathematical proofs are a combination of conceptual arguments and technical calculations. There is something of a tradeoff between the two: one can add more conceptual arguments to try to reduce the technical computations, or vice versa. (Among other things, this leads to a Berkson paradox-like phenomenon in which a negative correlation can be observed between the two aspects of a proof; see this recent Mastodon post of mine for more discussion.)

In a recent article, Heather Macbeth argues that the preferred balance between conceptual and computational arguments is quite different for a computer-assisted proof than it is for a purely human-readable proof. In the latter, there is a strong incentive to minimize the amount of calculation to the point where it can be checked by hand, even if this requires a certain amount of ad hoc rearrangement of cases, unmotivated parameter selection, or otherwise non-conceptual additions to the arguments in order to reduce the calculation. But in the former, once one is willing to outsource any tedious verification or optimization task to a computer, the incentives are reversed: freed from the need to arrange the argument to reduce the amount of calculation, one can now describe an argument by listing the main ingredients and then letting the computer figure out a suitable way to combine them to give the stated result. The two approaches can thus be viewed as complementary ways to describe a result, with neither necessarily being superior to the other.

In this post, I would like to illustrate this computation-outsourced approach with the topic of zero-density theorems for the Riemann zeta function, in which all computer verifiable calculations (as well as other routine but tedious arguments) are performed “off-stage”, with the intent of focusing only on the conceptual inputs to these theorems.

Zero-density theorems concern upper bounds for the quantity {N(\sigma,T)} for a given {1/2 \leq \sigma \leq 1} and large {T}, which is defined as the number of zeroes of the Riemann zeta function in the rectangle {\{ \beta+i\gamma: \sigma \leq \beta \leq 1; 0 \leq \gamma \leq T \}}. (There is also an important generalization of this quantity to {L}-functions, but for simplicity we will focus on the classical zeta function case here). Such quantities are important in analytic number theory for many reasons, one of which is through explicit formulae such as the Riemann-von Mangoldt explicit formula

\displaystyle  \sum_{n \leq x}^{\prime} \Lambda(n) = x - \sum_{\rho:\zeta(\rho)=0} \frac{x^\rho}{\rho} - \log(2\pi) - \frac{1}{2} \log(1-x^{-2}) \ \ \ \ \ (1)

relating the prime numbers to the zeroes of the zeta function (the “music of the primes”). The better bounds one has on {N(\sigma,T)}, the more control one has on the complicated term {\sum_{\rho:\zeta(\rho)=0} \frac{x^\rho}{\rho}} on the right-hand side.

Clearly {N(\sigma,T)} is non-increasing in {\sigma}. The Riemann-von Mangoldt formula, together with the functional equation, gives us the asymptotic

\displaystyle  N(1/2,T) \asymp T \log T

in the {\sigma=1/2} case, while the prime number theorem tells us that

\displaystyle  N(1,T) = 0. \ \ \ \ \ (2)

The various zero free regions for the zeta function can be viewed as slight improvements to (2); for instance, the classical zero-free region is equivalent to the assertion that {N(\sigma,T)} vanishes if {\sigma > 1 - c/\log T} for some small absolute constant {c>0}, and the Riemann hypothesis is equivalent to the assertion that {N(\sigma,T)=0} for all {\sigma>1/2}.

Experience has shown that the most important quantity to control here is the exponent {A(\sigma)}, defined as the least constant for which one has an asymptotic

\displaystyle  N(\sigma,T) = T^{A(\sigma)(1-\sigma)+o(1)}

as {T \rightarrow \infty}. Thus, for instance,

\displaystyle  A(1/2) = 2, \ \ \ \ \ (3)

{A(1) = 0}, and {A(\sigma)(1-\sigma)} is a non-increasing function of {\sigma}, so we obtain the trivial “von Mangoldt” zero density theorem

\displaystyle  A(\sigma) \leq \frac{1}{1-\sigma}.

Of particular interest is the supremal value {\|A\|_\infty := \sup_{1/2 \leq \sigma \leq 1} A(\sigma)} of {A}, which has to be at least {2} thanks to (3). The density hypothesis asserts that the maximum is in fact exactly {2}, or equivalently that

\displaystyle  A(\sigma) \leq 2, \ \ \ \ \ (4)

for all {1/2 \leq \sigma \leq 1}. This is of course implied by the Riemann hypothesis (which clearly implies that {A(\sigma)=0} for all {\sigma>1/2}), but is a more tractable hypothesis to work with; for instance, the hypothesis is already known to hold for {\sigma \geq 25/32 = 0.78125} by the work of Bourgain (building upon many previous authors). The quantity {\|A\|_\infty} directly impacts our understanding of the prime number theorem in short intervals; indeed, it is not difficult using (1) (as well as the Vinogradov-Korobov zero-free region) to establish a short interval prime number theorem

\displaystyle  \sum_{x \leq n \leq x + x^\theta} \Lambda(n) = (1+o(1)) x^\theta

for all {x \rightarrow \infty} if {1 - \frac{1}{\|A\|_\infty} < \theta < 1} is a fixed exponent, or for almost all {x \rightarrow \infty} if {1 - \frac{2}{\|A\|_\infty} < \theta < 1} is a fixed exponent. Until recently, the best upper bound for {\|A\|_\infty} was {12/5 = 2.4}, thanks to a 1972 result of Huxley; but this was recently lowered to {30/13=2.307\ldots} in a breakthrough work of Guth and Maynard.

In between the papers of Huxley and Guth-Maynard are dozens of additional improvements on {A(\sigma)}, though it is only the Guth-Maynard paper that actually lowered the supremum norm {\|A\|_\infty}. A summary of most of the state of the art before Guth-Maynard may be found in Table 2 of this recent paper of Trudgian and Yang; it is complicated, but it is easy enough to get a computer to illustrate it with a plot:

(For an explanation of what is going on under the assumption of the Lindelöf hypothesis, see below the fold.) This plot represents the combined effort of nearly a dozen papers, each one of which claims one or more components of the depicted piecewise smooth curve, and is written in the “human-readable” style mentioned above, where the argument is arranged to reduce the amount of tedious computation to human-verifiable levels, even if this comes the cost of obscuring the conceptual ideas. (For an animation of how this bound improved over time, see here.) Below the fold, I will try to describe (in sketch form) some of the standard ingredients that go into these papers, in particular the routine reduction of deriving zero density estimates from large value theorems for Dirichlet series. We will not attempt to rewrite the entire literature of zero-density estimates in this fashion, but focus on some illustrative special cases.

Read the rest of this entry »

The Salem prize was established in 1968 and named in honor of Raphaël Salem (1898-1963), a mathematician famous notably for his deep study of the links between Fourier series and number theory and for pioneering applications of probabilistic methods to these fields. It was not awarded from 2019-2022, due to both the COVID pandemic and the death of Jean Bourgain who had been almost single-handedly administering the prize, but is now active again, being administered by Akshay Ventakesh and the IAS. I chair the scientific committee for this prize, whose other members are Guy David and Mikhail Sodin. Last year, the prize was awarded to Sarah Peluse and Julian Sahasrabudhe.

Nominations for the 2024 Salem Prize are now open until September 1st. Nominations should include a CV of the nominee and a nomination letter explaining the significance of the nominee’s work. Supplementary documentation, such as supporting letters of recommendation or key publications, can additionally be provided, but are not required.

Nominees may be individuals from any country or institution. Preference will be given to nominees who have received their PhD in the last ten years, although this rule may be relaxed if there are mitigating personal circumstances, or if there have been few Salem prize winners in recent years.  Self-nominations will not be considered, nor are past Prize winners or Scientific Committee members eligible.

The prize does not come with a direct monetary award, but winners will be invited to visit the IAS and to give a lecture associated with the award of the prize.

See also the previous year’s announcement of the Salem prize nomination period.

[This post is dedicated to Luca Trevisan, who recently passed away due to cancer. Though far from his most significant contribution to the field, I would like to mention that, as with most of my other blog posts on this site, this page was written with the assistance of Luca’s LaTeX to WordPress converter. Mathematically, his work and insight on pseudorandomness in particular have greatly informed how I myself think about the concept. – T.]

Recently, Timothy Gowers, Ben Green, Freddie Manners, and I were able to establish the following theorem:

Theorem 1 (Marton’s conjecture) Let {A \subset {\bf F}_2^n} be non-empty with {|A+A| \leq K|A|}. Then there exists a subgroup {H} of {{\bf F}_2^n} with {|H| \leq |A|} such that {A} is covered by at most {2K^C} translates of {H}, for some absolute constant {C}.

We established this result with {C=12}, although it has since been improved to {C=9} by Jyun-Jie Liao.

Our proof was written in order to optimize the constant {C} as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant {C}, but just to show that the result holds for some constant {C}. This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.

The key concept here is that of the entropic Ruzsa distance {d[X;Y]} between two random variables {X,Y} taking values {{\bf F}_2^n}, defined as

\displaystyle  d[X;Y] := {\mathbf H}[X'+Y'] - \frac{1}{2} {\mathbf H}[X] - \frac{1}{2} {\mathbf H}[Y]

where {X',Y'} are independent copies of {X,Y}, and {{\mathbf H}[X]} denotes the Shannon entropy of {X}. This distance is symmetric and non-negative, and obeys the triangle inequality

\displaystyle  d[X;Z] \leq d[X;Y] + d[Y;Z]

for any random variables {X,Y,Z}; see the blueprint for a proof. The above theorem then follows from an entropic analogue:

Theorem 2 (Entropic Marton’s conjecture) Let {X} be a {{\bf F}_2^n}-valued random variable with {d[X;X] \leq \log K}. Then there exists a uniform random variable {U_H} on a subgroup {H} of {{\bf F}_2^n} such that {d[X; U_H] \leq C \log K} for some absolute constant {C}.

We were able to establish Theorem 2 with {C=11}, which implies Theorem 1 with {C=12} by fairly standard additive combinatorics manipulations (such as the Ruzsa covering lemma); see the blueprint for details.

The key proposition needed to establish Theorem 2 is the following distance decrement property:

Proposition 3 (Distance decrement) If {X,Y} are {{\bf F}_2^n}-valued random variables, then one can find {{\bf F}_2^n}-valued random variables {X',Y'} such that

\displaystyle  d[X';Y'] \leq (1-\eta) d[X;Y]

and

\displaystyle  d[X;X'], d[Y,Y'] \leq C d[X;Y]

for some absolute constants {C, \eta > 0}.

Indeed, suppose this proposition held. Starting with {X,Y} both equal to {X} and iterating, one can then find sequences of random variables {X_n, Y_n} with {X_0=Y_0=X},

\displaystyle  d[X_n;Y_n] \leq (1-\eta)^n d[X;X],

and

\displaystyle  d[X_{n+1};X_n], d[Y_{n+1};Y_n] \leq C (1-\eta)^n d[X;X].

In particular, from the triangle inequality and geometric series

\displaystyle  d[X_n;X], d[Y_n;X] \leq \frac{C}{\eta} d[X;X].

By weak compactness, some subsequence of the {X_n}, {Y_n} converge to some limiting random variables {X_\infty, Y_\infty}, and by some simple continuity properties of entropic Ruzsa distance, we conclude that

\displaystyle  d[X_\infty;Y_\infty] = 0

and

\displaystyle  d[X_\infty;X], d[Y_\infty;X] \leq \frac{C}{\eta} d[X;X].

Theorem 2 then follows from the “100% inverse theorem” for entropic Ruzsa distance; see the blueprint for details.

To prove Proposition 3, we can reformulate it as follows:

Proposition 4 (Lack of distance decrement implies vanishing) If {X,Y} are {{\bf F}_2^n}-valued random variables, with the property that

\displaystyle  d[X';Y'] > d[X;Y] - \eta ( d[X;Y] + d[X';X] + d[Y',Y] ) \ \ \ \ \ (1)

for all {{\bf F}_2^n}-valued random variables {X',Y'} and some sufficiently small absolute constant {\eta > 0}, then one can derive a contradiction.

Indeed, we may assume from the above proposition that

\displaystyle  d[X';Y'] \leq d[X;Y] - \eta ( d[X; Y] + d[X';X] + d[Y',Y] )

for some {X',Y'}, which will imply Proposition 3 with {C = 1/\eta}.

The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for {\eta} small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version

\displaystyle  d[X'|Z;Y'|W] \geq d[X;Y] - \eta ( d[X;Y] + d[X'|Z;X] + d[Y'|W;Y] )

for any random variables {Z,W} that are possibly coupled with {X',Y'} respectively. In particular, if we define a “relevant” random variable {X'} (conditioned with respect to some auxiliary data {Z}) to be a random variable for which

\displaystyle  d[X'|Z;X] = O( d[X;Y] )

or equivalently (by the triangle inequality)

\displaystyle  d[X'|Z;Y] = O( d[X;Y] )

then we have the useful lower bound

\displaystyle  d[X'|Z;Y'|W] \geq (1-O(\eta)) d[X;Y] \ \ \ \ \ (2)

whenever {X'} and {Y'} are relevant conditioning on {Z, W} respectively. This is quite a useful bound, since the laws of “entropic Ruzsa calculus” will tell us, roughly speaking, that virtually any random variable that we can create from taking various sums of copies of {X,Y} and conditioning against other sums, will be relevant. (Informally: the space of relevant random variables is {(1-O(\eta))d[X;Y]}-separated with respect to the entropic Ruzsa distance.)

— 1. Main argument —

Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.

Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when {\eta} is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set {A} should be at least as large as the doubling constant of the image {\pi(A)} of that set under a homomorphism, times the doubling constant of a typical fiber {A \cap \pi^{-1}(\{z\})} of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.

Here is the fibring identity:

Proposition 5 (Fibring identity) Let {\pi: G \rightarrow H} be a homomorphism. Then for any independent {G}-valued random variables {X, Y}, one has

\displaystyle  d[X;Y] = d[\pi(X); \pi(Y)] + d[X|\pi(X); Y|\pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-\pi(Y) ].

The proof is of course in the blueprint, but given that it is a central pillar of the argument, I reproduce it here.

Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule

\displaystyle  {\mathbf H}[X] = {\mathbf H}[\pi(X)] + {\mathbf H}[X|\pi(X)]

and

\displaystyle  {\mathbf H}[Y] = {\mathbf H}[\pi(Y)] + {\mathbf H}[Y|\pi(Y)],

it suffices to establish the identity

\displaystyle  {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X), \pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ].

But from the chain rule again we have

\displaystyle  {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X)-\pi(Y)]

and from the definition of conditional mutual information (using the fact that {\pi(X)-\pi(Y)} is determined both by {X-Y} and by {(\pi(X),\pi(Y))}) one has

\displaystyle  {\mathbf H}[X - Y|\pi(X)-\pi(Y)] = {\mathbf H}[X - Y|\pi(X), \pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ]

giving the claim. \Box

We will only care about the characteristic {2} setting here, so we will now assume that all groups involved are {2}-torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where {G = {\bf F}_2^n \times {\bf F}_2^n}, {H = {\bf F}_2^n}, {\pi: G \rightarrow H} is the addition map {\pi(x,y) = x+y}, and {X = (X_1, X_2)}, {Y = (Y_1, Y_2)} are pairs of independent random variables in {{\bf F}_2^n}, we obtain the following corollary:

Corollary 6 Let {X_1,X_2,Y_1,Y_2} be independent {{\bf F}_2^n}-valued random variables. Then we have the identity

\displaystyle  d[X_1;Y_1] + d[X_2;Y_2] = d[X_1+X_2;Y_1+Y_2]

\displaystyle  + d[X_1|X_1+X_2;Y_1|Y_1+Y_2]

\displaystyle  + I[(X_1+Y_1, X_2+Y_2) : (X_1+X_2,Y_1+Y_2) | X_1+X_2+Y_1+Y_2 ].

This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality

\displaystyle  d[X_1;Y_1] + d[X_2;Y_2] \geq d[X_1+X_2;Y_1+Y_2]

\displaystyle  + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].

If we let {X_1, Y_1, X_2, Y_2} be independent copies of {X, Y, Y, X} respectively (note the swap in the last two variables!) we obtain

\displaystyle  2 d[X;Y] \geq d[X+Y;X+Y] + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].

From entropic Ruzsa calculus, one can check that {X+Y}, {X_1|X_1+X_2}, and {Y_1|Y_1+Y_2} are all relevant random variables, so from (2) we now obtain both upper and lower bounds for {d[X+Y;X+Y]}:

\displaystyle  d[X+Y; X+Y] = (1 + O(\eta)) d[X;Y].

A pleasant upshot of this is that we now get to work in the symmetric case {X=Y} without loss of generality. Indeed, if we set {X^* := X+Y}, we now have from (2) that

\displaystyle  d[X'|Z; Y'|W] \geq (1-O(\eta)) d[X^*;X^*] \ \ \ \ \ (3)

whenever {X'|Z, Y'|W} are relevant, which by entropic Ruzsa calculus is equivalent to asking that

\displaystyle  d[X'|Z; X^*], d[Y'|W; X^*] = O(d[X^*;X^*]).

Now we use the fibring identity again, relabeling {Y_1,Y_2} as {X_3,X_4} and requiring {X_1,X_2,X_3,X_4} to be independent copies of {X^*}. We conclude that

\displaystyle  2d[X^*; X^*] = d[X_1+X_2;X_3+Y_4] + d[X_1|X_1+X_2;X_3|X_1+X_4]

\displaystyle  + I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ].

As before, the random variables {X_1+X_2}, {X_3+X_4}, {X_1|X_1+X_2}, {X_3|X_3+X_4} are all relevant, so from (3) we have

\displaystyle  d[X_1+X_2;X_3+X_4], d[X_1|X_1+X_2;X_3|X_1+X_4]

\displaystyle  \geq (1-O(\eta)) d[X^*;X^*].

We could now also match these lower bounds with upper bounds, but the more important takeaway from this analysis is a really good bound on the conditional mutual information:

\displaystyle  I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ]

\displaystyle = O(\eta) d[X^*;X^*].

By the data processing inequality, we can discard some of the randomness here, and conclude

\displaystyle  I[X_1+X_3 : X_1+X_2 | X_1+X_2+X_3+X_4 ] = O(\eta) d[X^*;X^*].

Let us introduce the random variables

\displaystyle  S := X_1+X_2+X_3+X_4; U := X_1+X_2; V = X_1 + X_3

then we have

\displaystyle  I[U : V | S] = O(\eta) d[X^*;X^*].

Intuitively, this means that {U} and {V} are very nearly independent given {S}. For sake of argument, let us assume that they are actually independent; one can achieve something resembling this by invoking the entropic Balog-Szemerédi-Gowers theorem, established in the blueprint, after conceding some losses of {O(\eta) d[X^*,X^*]} in the entropy, but we skip over the details for this blog post. The key point now is that because we are in characteristic {2}, {U+V} has the same form as {U} or {V}:

\displaystyle  U + V = X_2 + X_3.

In particular, by permutation symmetry, we have

\displaystyle  {\mathbf H}[U+V|S] ={\mathbf H}[U|S] ={\mathbf H}[V|S],

and so by the definition of conditional Ruzsa distance we have a massive distance decrement

\displaystyle  {\bf E}_s d[U|S=s; V|S=s] = 0,

(where {s} is drawn from the distribution of {S}), contradicting (1) as desired. (In reality, we end up decreasing the distance not all the way to zero, but instead to {O(\eta d[X^*,X^*])} due to losses in the Balog-Szemerédi-Gowers theorem, but this is still enough to reach a contradiction. The quantity {{\bf E}_s d[U|S=s; V|S=s]} is very similar to {d[U|S; V|S]}, but is slightly different; the latter quantity is {{\bf E}_{s,s'}d[U|S=s; V|S=s']}.)

Remark 7 A similar argument works in the {m}-torsion case for general {m}. Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”

\displaystyle  {\mathbf H}[X_1 + \dots + X_m] - \frac{1}{m} ({\mathbf H}[X_1] + \dots + {\mathbf H}[X_m])

for independent {X_1,\dots,X_m}. By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable {X^*}. If one then takes {X_{i,j}}, {i,j=1,\dots,m} to be an array of {m^2} copies of {X^*}, one can get to the point where the row sums {\sum_i X_{i,j}} and the column sums {\sum_j X_{i,j}} have small conditional mutual information with respect to the double sum {S := \sum_i \sum_j X_{i,j}}. If we then set {U := \sum_i \sum_j j X_{i,j}} and {V := \sum_i \sum_j i X_{i,j}}, the data processing inequality again shows that {U} and {V} are nearly independent given {S}. The {m}-torsion now crucially intervenes as before to ensure that {U+V = \sum_i \sum_j (i+j) X_{i,j}} has the same form as {U} or {V}, leading to a contradiction as before. See this previous blog post for more discussion.

I’ve just uploaded to the arXiv my paper “On product representations of squares“. This short paper answers (in the negative) a (somewhat obscure) question of Erdös. Namely, for any {k \geq 1}, let {F_k(N)} be the size of the largest subset {A} of {\{1,\dots,N\}} with the property that no {k} distinct elements of {A} multiply to a square. In a paper by Erdös, Sárközy, and Sós, the following asymptotics were shown for fixed {k}:

  • {F_1(N) = (1+o(1)) N}.
  • {F_2(N) = (\frac{6}{\pi^2} + o(1)) N}.
  • {F_3(N) = (1+o(1)) N}.
  • {F_{4k}(N) = (1+o(1)) \frac{N}{\log N}} for {k \geq 1}.
  • {F_{4k+2}(N) = (\frac{3}{2}+o(1)) \frac{N}{\log N}} for {k \geq 1}.
  • {(\log 2 + o(1)) N \leq F_{2k+1}(N) \leq N} for {k \geq 2}.
Thus the asymptotics for {F_k(N)} for odd {k \geq 5} were not completely settled. Erdös asked if one had {F_k(N) = (1-o(1)) N} for odd {k \geq 5}. The main result of this paper is that this is not the case; that is to say, there exists {c_k>0} such that any subset {A} of {\{1,\dots,N\}} of cardinality at least {(1-c_k) N} will contain {k} distinct elements that multiply to a square, if {N} is large enough. In fact, the argument works for all {k \geq 4}, although it is not new in the even case. I will also note that there are now quite sharp upper and lower bounds on {F_k} for even {k \geq 4}, using methods from graph theory: see this recent paper of Pach and Vizer for the latest results in this direction. Thanks to the results of Granville and Soundararajan, we know that the constant {c_k} cannot exceed the Hall-Montgomery constant

\displaystyle  1 - \log(1+\sqrt{e}) + 2 \int_1^{\sqrt{e}} \frac{\log t}{t+1}\ dt = 0.171500\dots

and I (very tentatively) conjecture that this is in fact the optimal value for this constant. This looks somewhat difficult, but a more feasible conjecture would be that the {c_k} asymptotically approach the Hall-Montgomery constant as {k \rightarrow \infty}, since the aforementioned result of Granville and Soundararajan morally corresponds to the {k=\infty} case.

In the end, the argument turned out to be relatively simple; no advanced results from additive combinatorics, graph theory, or analytic number theory were required. I found it convenient to proceed via the probabilistic method (although the more combinatorial technique of double counting would also suffice here). The main idea is to generate a tuple {(\mathbf{n}_1,\dots,\mathbf{n}_k)} of distinct random natural numbers in {\{1,\dots,N\}} which multiply to a square, and which are reasonably uniformly distributed throughout {\{1,\dots,N\}}, in that each individual number {1 \leq n \leq N} is attained by one of the random variables {\mathbf{n}_i} with a probability of {O(1/N)}. If one can find such a distribution, then if the density of {A} is sufficienly close to {1}, it will happen with positive probability that each of the {\mathbf{n}_i} will lie in {A}, giving the claim.

When {k=3}, this strategy cannot work, as it contradicts the arguments of Erdös, Särközy, and Sós. The reason can be explained as follows. The most natural way to generate a triple {(\mathbf{n}_1,\mathbf{n}_2,\mathbf{n}_3)} of random natural numbers in {\{1,\dots,N\}} which multiply to a square is to set

\displaystyle  \mathbf{n}_1 := \mathbf{d}_{12} \mathbf{d}_{13}, \mathbf{n}_2 := \mathbf{d}_{12} \mathbf{d}_{23}, \mathbf{n}_3 := \mathbf{d}_{13} \mathbf{d}_{23}

for some random natural numbers {\mathbf{d}_{12} \mathbf{d}_{13}, \mathbf{d}_{23}}. But if one wants all these numbers to have magnitude {\asymp N}, one sees on taking logarithms that one would need

\displaystyle  \log \mathbf{d}_{12} + \log \mathbf{d}_{13}, \log \mathbf{d}_{12} + \log \mathbf{d}_{23}, \log \mathbf{d}_{13} + \log \mathbf{d}_{23} = \log N + O(1)

which by elementary linear algebra forces

\displaystyle  \log \mathbf{d}_{12}, \log \mathbf{d}_{13}, \log \mathbf{d}_{23} = \frac{1}{2} \log N + O(1),

so in particular each of the {\mathbf{n}_i} would have a factor comparable to {\sqrt{N}}. However, it follows from known results on the “multiplication table problem” (how many distinct integers are there in the {n \times n} multiplication table?) that most numbers up to {N} do not have a factor comparable to {\sqrt{N}}. (Quick proof: by the Hardy–Ramanujan law, a typical number of size {N} or of size {\sqrt{N}} has {(1+o(1)) \log\log N} factors, hence typically a number of size {N} will not factor into two factors of size {\sqrt{N}}.) So the above strategy cannot work for {k=3}.

However, the situation changes for larger {k}. For instance, for {k=4}, we can try the same strategy with the ansatz

\displaystyle \mathbf{n}_1 = \mathbf{d}_{12} \mathbf{d}_{13} \mathbf{d}_{14}; \quad \mathbf{n}_2 = \mathbf{d}_{12} \mathbf{d}_{23} \mathbf{d}_{24}; \quad \mathbf{n}_3 = \mathbf{d}_{13} \mathbf{d}_{23} \mathbf{d}_{34}; \quad \mathbf{n}_4 = \mathbf{d}_{14} \mathbf{d}_{24} \mathbf{d}_{34}.

Whereas before there were three (approximate) equations constraining three unknowns, now we would have four equations and six unknowns, and so we no longer have strong constraints on any of the {\mathbf{d}_{ij}}. So in principle we now have a chance to find a suitable random choice of the {\mathbf{d}_{ij}}. The most significant remaining obstacle is the Hardy–Ramanujan law: since the {\mathbf{n}_i} typically have {(1+o(1))\log\log N} prime factors, it is natural in this {k=4} case to choose each {\mathbf{d}_{ij}} to have {(\frac{1}{3}+o(1)) \log\log N} prime factors. As it turns out, if one does this (basically by requiring each prime {p \leq N^{\varepsilon^2}} to divide {\mathbf{d}_{ij}} with an independent probability of about {\frac{1}{3p}}, for some small {\varepsilon>0}, and then also adding in one large prime to bring the magnitude of the {\mathbf{n}_i} to be comparable to {N}), the calculations all work out, and one obtains the claimed result.

The purpose of this post is to report an erratum to the 2012 paper “An inverse theorem for the Gowers {U^{s+1}[N]}-norm” of Ben Green, myself, and Tamar Ziegler (previously discussed in this blog post). The main results of this paper have been superseded with stronger quantitative results, first in work of Manners (using somewhat different methods), and more recently in a remarkable paper of Leng, Sah, and Sawhney which combined the methods of our paper with several new innovations to obtain quite strong bounds (of quasipolynomial type); see also an alternate proof of our main results (again by quite different methods) by Candela and Szegedy. In the course of their work, they discovered some fixable but nontrivial errors in our paper. These (rather technical) issues were already implicitly corrected in this followup work which supersedes our own paper, but for the sake of completeness we are also providing a formal erratum for our original paper, which can be found here. We thank Leng, Sah, and Sawhney for bringing these issues to our attention.

Excluding some minor (mostly typographical) issues which we also have reported in this erratum, the main issues stemmed from a conflation of two notions of a degree {s} filtration

\displaystyle  G = G_0 \geq G_1 \geq \dots \geq G_s \geq G_{s+1} = \{1\}

of a group {G}, which is a nested sequence of subgroups that obey the relation {[G_i,G_j] \leq G_{i+j}} for all {i,j}. The weaker notion (sometimes known as a prefiltration) permits the group {G_1} to be strictly smaller than {G_0}, while the stronger notion requires {G_0} and {G_1} to equal. In practice, one can often move between the two concepts, as {G_1} is always normal in {G_0}, and a prefiltration behaves like a filtration on every coset of {G_1} (after applying a translation and perhaps also a conjugation). However, we did not clarify this issue sufficiently in the paper, and there are some places in the text where results that were only proven for filtrations were applied for prefiltrations. The erratum fixes this issues, mostly by clarifying that we work with filtrations throughout (which requires some decomposition into cosets in places where prefiltrations are generated). Similar adjustments need to be made for multidegree filtrations and degree-rank filtrations, which we also use heavily on our paper.

In most cases, fixing this issue only required minor changes to the text, but there is one place (Section 8) where there was a non-trivial problem: we used the claim that the final group {G_s} was a central group, which is true for filtrations, but not necessarily for prefiltrations. This fact (or more precisely, a multidegree variant of it) was used to claim a factorization for a certain product of nilcharacters, which is in fact not true as stated. In the erratum, a substitute factorization for a slightly different product of nilcharacters is provided, which is still sufficient to conclude the main result of this part of the paper (namely, a statistical linearization of a certain family of nilcharacters in the shift parameter {h}).

Again, we stress that these issues do not impact the paper of Leng, Sah, and Sawhney, as they adapted the methods in our paper in a fashion that avoids these errors.

A recent paper of Kra, Moreira, Richter, and Robertson established the following theorem, resolving a question of Erdös. Given a discrete amenable group {G = (G,+)}, and a subset {A} of {G}, we define the Banach density of {A} to be the quantity

\displaystyle  \sup_\Phi \limsup_{N \rightarrow \infty} |A \cap \Phi_N|/|\Phi_N|,

where the supremum is over all Følner sequences {\Phi = (\Phi_N)_{N=1}^\infty} of {G}. Given a set {B} in {G}, we define the restricted sumset {B \oplus B} to be the set of all pairs {b_1+b_2} where {b_1, b_2} are distinct elements of {B}.

Theorem 1 Let {G} be a countably infinite abelian group with the index {[G:2G]} finite. Let {A} be a positive Banach density subset of {G}. Then there exists an infinite set {B \subset A} and {t \in G} such that {B \oplus B + t \subset A}.

Strictly speaking, the main result of Kra et al. only claims this theorem for the case of the integers {G={\bf Z}}, but as noted in the recent preprint of Charamaras and Mountakis, the argument in fact applies for all countable abelian {G} in which the subgroup {2G := \{ 2x: x \in G \}} has finite index. This condition is in fact necessary (as observed by forthcoming work of Ethan Acklesberg): if {2G} has infinite index, then one can find a subgroup {H_j} of {G} of index {2^j} for any {j \geq 1} that contains {2G} (or equivalently, {G/H_j} is {2}-torsion). If one lets {y_1,y_2,\dots} be an enumeration of {G}, and one can then check that the set

\displaystyle  A := G \backslash \bigcup_{j=1}^\infty (H_{j+1} + y_j) \backslash \{y_1,\dots,y_j\}

has positive Banach density, but does not contain any set of the form {B \oplus B + t} for any {t} (indeed, from the pigeonhole principle and the {2}-torsion nature of {G/H_{j+1}} one can show that {B \oplus B + y_j} must intersect {H_{j+1} + y_j \backslash \{y_1,\dots,y_j\}} whenever {B} has cardinality larger than {j 2^{j+1}}). It is also necessary to work with restricted sums {B \oplus B} rather than full sums {B+B}: a counterexample to the latter is provided for instance by the example with {G = {\bf Z}} and {A := \bigcup_{j=1}^\infty [10^j, 1.1 \times 10^j]}. Finally, the presence of the shift {t} is also necessary, as can be seen by considering the example of {A} being the odd numbers in {G ={\bf Z}}, though in the case {G=2G} one can of course delete the shift {t} at the cost of giving up the containment {B \subset A}.

Theorem 1 resembles other theorems in density Ramsey theory, such as Szemerédi’s theorem, but with the notable difference that the pattern located in the dense set {A} is infinite rather than merely arbitrarily large but finite. As such, it does not seem that this theorem can be proven by purely finitary means. However, one can view this result as the conjunction of an infinite number of statements, each of which is a finitary density Ramsey theory statement. To see this, we need some more notation. Observe from Tychonoff’s theorem that the collection {2^G := \{ B: B \subset G \}} is a compact topological space (with the topology of pointwise convergence) (it is also metrizable since {G} is countable). Subsets {{\mathcal F}} of {2^G} can be thought of as properties of subsets of {G}; for instance, the property a subset {B} of {G} of being finite is of this form, as is the complementary property of being infinite. A property of subsets of {G} can then be said to be closed or open if it corresponds to a closed or open subset of {2^G}. Thus, a property is closed and only if if it is closed under pointwise limits, and a property is open if, whenever a set {B} has this property, then any other set {B'} that shares a sufficiently large (but finite) initial segment with {B} will also have this property. Since {2^G} is compact and Hausdorff, a property is closed if and only if it is compact.

The properties of being finite or infinite are neither closed nor open. Define a smallness property to be a closed (or compact) property of subsets of {G} that is only satisfied by finite sets; the complement to this is a largeness property, which is an open property of subsets of {G} that is satisfied by all infinite sets. (One could also choose to impose other axioms on these properties, for instance requiring a largeness property to be an upper set, but we will not do so here.) Examples of largeness properties for a subset {B} of {G} include:

  • {B} has at least {10} elements.
  • {B} is non-empty and has at least {b_1} elements, where {b_1} is the smallest element of {B}.
  • {B} is non-empty and has at least {b_{b_1}} elements, where {b_n} is the {n^{\mathrm{th}}} element of {B}.
  • {T} halts when given {B} as input, where {T} is a given Turing machine that halts whenever given an infinite set as input. (Note that this encompasses the preceding three examples as special cases, by selecting {T} appropriately.)
We will call a set obeying a largeness property {{\mathcal P}} an {{\mathcal P}}-large set.

Theorem 1 is then equivalent to the following “almost finitary” version (cf. this previous discussion of almost finitary versions of the infinite pigeonhole principle):

Theorem 2 (Almost finitary form of main theorem) Let {G} be a countably infinite abelian group with {[G:2G]} finite. Let {\Phi_n} be a Følner sequence in {G}, let {\delta>0}, and let {{\mathcal P}_t} be a largeness property for each {t \in G}. Then there exists {N} such that if {A \subset G} is such that {|A \cap \Phi_n| / |\Phi_n| \geq \delta} for all {n \leq N}, then there exists a shift {t \in G} and {A} contains a {{\mathcal P}_t}-large set {B} such that {B \oplus B + t \subset A}.

Proof of Theorem 2 assuming Theorem 1. Let {G, \Phi_n}, {\delta}, {{\mathcal P}_t} be as in Theorem 2. Suppose for contradiction that Theorem 2 failed, then for each {N} we can find {A_N} with {|A_N \cap \Phi_n| / |\Phi_n| \geq \delta} for all {n \leq N}, such that there is no {t} and {{\mathcal P}_t}-large {B} such that {B, B \oplus B + t \subset A_N}. By compactness, a subsequence of the {A_N} converges pointwise to a set {A}, which then has Banach density at least {\delta}. By Theorem 1, there is an infinite set {B} and a {t} such that {B, B \oplus B + t \subset A}. By openness, we conclude that there exists a finite {{\mathcal P}_t}-large set {B'} contained in {B}, thus {B', B' \oplus B' + t \subset A}. This implies that {B', B' \oplus B' + t \subset A_N} for infinitely many {N}, a contradiction.

Proof of Theorem 1 assuming Theorem 2. Let {G, A} be as in Theorem 1. If the claim failed, then for each {t}, the property {{\mathcal P}_t} of being a set {B} for which {B, B \oplus B + t \subset A} would be a smallness property. By Theorem 2, we see that there is a {t} and a {B} obeying the complement of this property such that {B, B \oplus B + t \subset A}, a contradiction.

Remark 3 Define a relation {R} between {2^G} and {2^G \times G} by declaring {A\ R\ (B,t)} if {B \subset A} and {B \oplus B + t \subset A}. The key observation that makes the above equivalences work is that this relation is continuous in the sense that if {U} is an open subset of {2^G \times G}, then the inverse image

\displaystyle R^{-1} U := \{ A \in 2^G: A\ R\ (B,t) \hbox{ for some } (B,t) \in U \}

is also open. Indeed, if {A\ R\ (B,t)} for some {(B,t) \in U}, then {B} contains a finite set {B'} such that {(B',t) \in U}, and then any {A'} that contains both {B'} and {B' \oplus B' + t} lies in {R^{-1} U}.

For each specific largeness property, such as the examples listed previously, Theorem 2 can be viewed as a finitary assertion (at least if the property is “computable” in some sense), but if one quantifies over all largeness properties, then the theorem becomes infinitary. In the spirit of the Paris-Harrington theorem, I would in fact expect some cases of Theorem 2 to undecidable statements of Peano arithmetic, although I do not have a rigorous proof of this assertion.

Despite the complicated finitary interpretation of this theorem, I was still interested in trying to write the proof of Theorem 1 in some sort of “pseudo-finitary” manner, in which one can see analogies with finitary arguments in additive combinatorics. The proof of Theorem 1 that I give below the fold is my attempt to achieve this, although to avoid a complete explosion of “epsilon management” I will still use at one juncture an ergodic theory reduction from the original paper of Kra et al. that relies on such infinitary tools as the ergodic decomposition, the ergodic theory, and the spectral theorem. Also some of the steps will be a little sketchy, and assume some familiarity with additive combinatorics tools (such as the arithmetic regularity lemma).

Read the rest of this entry »

This post contains two unrelated announcements. Firstly, I would like to promote a useful list of resources for AI in Mathematics, that was initiated by Talia Ringer (with the crowdsourced assistance of many others) during the National Academies workshop on “AI in mathematical reasoning” last year. This list is now accepting new contributions, updates, or corrections; please feel free to submit them directly to the list (which I am helping Talia to edit). Incidentally, next week there will be a second followup webinar to the aforementioned workshop, building on the topics covered there. (The first webinar may be found here.)

Secondly, I would like to advertise the erdosproblems.com website, launched recently by Thomas Bloom. This is intended to be a living repository of the many mathematical problems proposed in various venues by Paul Erdős, who was particularly noted for his influential posing of such problems. For a tour of the site and an explanation of its purpose, I can recommend Thomas’s recent talk on this topic at a conference last week in honor of Timothy Gowers.

Thomas is currently issuing a call for help to develop the erdosproblems.com website in a number of ways (quoting directly from that page):

  • You know Github and could set a suitable project up to allow people to contribute new problems (and corrections to old ones) to the database, and could help me maintain the Github project;
  • You know things about web design and have suggestions for how this website could look or perform better;
  • You know things about Python/Flask/HTML/SQL/whatever and want to help me code cool new features on the website;
  • You know about accessibility and have an idea how I can make this website more accessible (to any group of people);
  • You are a mathematician who has thought about some of the problems here and wants to write an expanded commentary for one of them, with lots of references, comparisons to other problems, and other miscellaneous insights (mathematician here is interpreted broadly, in that if you have thought about the problems on this site and are willing to write such a commentary you qualify);
  • You knew Erdős and have any memories or personal correspondence concerning a particular problem;
  • You have solved an Erdős problem and I’ll update the website accordingly (and apologies if you solved this problem some time ago);
  • You have spotted a mistake, typo, or duplicate problem, or anything else that has confused you and I’ll correct things;
  • You are a human being with an internet connection and want to volunteer a particular Erdős paper or problem list to go through and add new problems from (please let me know before you start, to avoid duplicate efforts);
  • You have any other ideas or suggestions – there are probably lots of things I haven’t thought of, both in ways this site can be made better, and also what else could be done from this project. Please get in touch with any ideas!

I for instance contributed a problem to the site (#587) that Erdős himself gave to me personally (this was the topic of a somewhat well known photo of Paul and myself, and which he communicated again to be shortly afterwards on a postcard; links to both images can be found by following the above link). As it turns out, this particular problem was essentially solved in 2010 by Nguyen and Vu.

(Incidentally, I also spoke at the same conference that Thomas spoke at, on my recent work with Gowers, Green, and Manners; here is the video of my talk, and here are my slides.)

[This post is dedicated to Luca Trevisan, who recently passed away due to cancer. Though far from his most significant contribution to the field, I would like to mention that, as with most of my other blog posts on this site, this page was written with the assistance of Luca’s LaTeX to WordPress converter. Mathematically, his work and insight on pseudorandomness in particular have greatly informed how I myself think about the concept. – T.]
Recently, Timothy Gowers, Ben Green, Freddie Manners, and I were able to establish the following theorem:

Theorem 1 (Marton’s conjecture) Let {A \subset {\bf F}_2^n} be non-empty with {|A+A| \leq K|A|}. Then there exists a subgroup {H} of {{\bf F}_2^n} with {|H| \leq |A|} such that {A} is covered by at most {2K^C} translates of {H}, for some absolute constant {C}.

We established this result with {C=12}, although it has since been improved to {C=9} by Jyun-Jie Liao.
Our proof was written in order to optimize the constant {C} as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant {C}, but just to show that the result holds for some constant {C}. This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.
The key concept here is that of the entropic Ruzsa distance {d[X;Y]} between two random variables {X,Y} taking values {{\bf F}_2^n}, defined as

\displaystyle  d[X;Y] := {\mathbf H}[X'+Y'] - \frac{1}{2} {\mathbf H}[X] - \frac{1}{2} {\mathbf H}[Y]

where {X',Y'} are independent copies of {X,Y}, and {{\mathbf H}[X]} denotes the Shannon entropy of {X}. This distance is symmetric and non-negative, and obeys the triangle inequality

\displaystyle  d[X;Z] \leq d[X;Y] + d[Y;Z]

for any random variables {X,Y,Z}; see the blueprint for a proof. The above theorem then follows from an entropic analogue:

Theorem 2 (Entropic Marton’s conjecture) Let {X} be a {{\bf F}_2^n}-valued random variable with {d[X;X] \leq \log K}. Then there exists a uniform random variable {U_H} on a subgroup {H} of {{\bf F}_2^n} such that {d[X; U_H] \leq C \log K} for some absolute constant {C}.

We were able to establish Theorem 2 with {C=11}, which implies Theorem 1 with {C=12} by fairly standard additive combinatorics manipulations; see the blueprint for details.
The key proposition needed to establish Theorem 2 is the following distance decrement property:

Proposition 3 (Distance decrement) If {X,Y} are {{\bf F}_2^n}-valued random variables, then one can find {{\bf F}_2^n}-valued random variables {X',Y'} such that

\displaystyle  d[X';Y'] \leq (1-\eta) d[X;Y]

and

\displaystyle  d[X;X'], d[Y,Y'] \leq C d[X;Y]

for some absolute constants {C, \eta > 0}.

Indeed, suppose this proposition held. Starting with {X,Y} both equal to {X} and iterating, one can then find sequences of random variables {X_n, Y_n} with {X_0=Y_0=X},

\displaystyle  d[X_n;Y_n] \leq (1-\eta)^n d[X;X],

and

\displaystyle  d[X_{n+1};X_n], d[Y_{n+1};Y_n] \leq C (1-\eta)^n d[X;X].

In particular, from the triangle inequality and geometric series

\displaystyle  d[X_n;X], d[Y_n;X] \leq \frac{C}{\eta} d[X;X].

By weak compactness, some subsequence of the {X_n}, {Y_n} converge to some limiting random variables {X_\infty, Y_\infty}, and by some simple continuity properties of entropic Ruzsa distance, we conclude that

\displaystyle  d[X_\infty;Y_\infty] = 0

and

\displaystyle  d[X_\infty;X], d[Y_\infty;X] \leq \frac{C}{\eta} d[X;X].

Theorem 2 then follows from the “100% inverse theorem” for entropic Ruzsa distance; see the blueprint for details.
To prove Proposition 3, we can reformulate it as follows:

Proposition 4 (Lack of distance decrement implies vanishing) If {X,Y} are {{\bf F}_2^n}-valued random variables, with the property that

\displaystyle  d[X';Y'] > d[X;Y] - \eta ( d[X;Y] + d[X';X] + d[Y',Y] ) \ \ \ \ \ (1)

for all {{\bf F}_2^n}-valued random variables {X',Y'} and some sufficiently small absolute constant {\eta > 0}, then one can derive a contradiction.

Indeed, we may assume from the above proposition that

\displaystyle  d[X';Y'] \leq d[X;Y] - \eta ( d[X; Y] + d[X';X] + d[Y',Y] )

for some {X',Y'}, which will imply Proposition 3 with {C = 1/\eta}.
The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for {\eta} small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version

\displaystyle  d[X'|Z;Y'|W] \geq d[X;Y] - \eta ( d[X;Y] + d[X'|Z;X] + d[Y'|W;Y] )

for any random variables {Z,W} that are possibly coupled with {X',Y'} respectively. In particular, if we define a “relevant” random variable {X'} (conditioned with respect to some auxiliary data {Z}) to be a random variable for which

\displaystyle  d[X'|Z;X] = O( d[X;Y] )

or equivalently (by the triangle inequality)

\displaystyle  d[X'|Z;Y] = O( d[X;Y] )

then we have the useful lower bound

\displaystyle  d[X'|Z;Y'|W] \geq (1-O(\eta)) d[X;Y] \ \ \ \ \ (2)

whenever {X'} and {Y'} are relevant conditioning on {Z, W} respectively. This is quite a useful bound, since the laws of “entropic Ruzsa calculus” will tell us, roughly speaking, that virtually any random variable that we can create from taking various sums of copies of {X,Y} and conditioning against other sums, will be relevant. (Informally: the space of relevant random variables is {(1-O(\eta))d[X;Y]}-separated with respect to the entropic Ruzsa distance.)

— 1. Main argument —

Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.
Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when {\eta} is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set {A} should be at least as large as the doubling constant of the image {\pi(A)} of that set under a homomorphism, times the doubling constant of a typical fiber {A \cap \pi^{-1}(\{z\})} of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.
Here is the fibring identity:

Proposition 5 (Fibring identity) Let {\pi: G \rightarrow H} be a homomorphism. Then for any independent {G}-valued random variables {X, Y}, one has

\displaystyle  d[X;Y] = d[\pi(X); \pi(Y)] + d[X|\pi(X); Y|\pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-\pi(Y) ].

The proof is of course in the blueprint, but given that it is a central pillar of the argumnt, I reproduce it here.
Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule

\displaystyle  {\mathbf H}[X] = {\mathbf H}[\pi(X)] + {\mathbf H}[X|\pi(X)]

and

\displaystyle  {\mathbf H}[Y] = {\mathbf H}[\pi(Y)] + {\mathbf H}[Y|\pi(Y)],

it suffices to establish the identity

\displaystyle  {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X), \pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ].

But from the chain rule again we have

\displaystyle  {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X)-\pi(Y)]

and from the definition of conditional mutual information (using the fact that {\pi(X)-\pi(Y)} is determined both by {X-Y} and by {(\pi(X),\pi(Y))}) one has

\displaystyle  {\mathbf H}[X - Y|\pi(X)-\pi(Y)] = {\mathbf H}[X - Y|\pi(X), \pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ]

giving the claim. \Box
We will only care about the characteristic {2} setting here, so we will now assume that all groups involved are {2}-torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where {G = {\bf F}_2^n \times {\bf F}_2^n}, {H = {\bf F}_2^n}, {\pi: G \rightarrow H} is the addition map {\pi(x,y) = x+y}, and {X = (X_1, X_2)}, {Y = (Y_1, Y_2)} are pairs of independent random variables in {{\bf F}_2^n}, we obtain the following corollary:

Corollary 6 Let {X_1,X_2,Y_1,Y_2} be independent {{\bf F}_2^n}-valued random variables. Then we have the identity

\displaystyle  d[X_1;Y_1] + d[X_2;Y_2] = d[X_1+X_2;Y_1+Y_2]

\displaystyle  + d[X_1|X_1+X_2;Y_1|Y_1+Y_2]

\displaystyle  + I[(X_1+Y_1, X_2+Y_2) : (X_1+X_2,Y_1+Y_2) | X_1+X_2+Y_1+Y_2 ].

This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality

\displaystyle  d[X_1;Y_1] + d[X_2;Y_2] \geq d[X_1+X_2;Y_1+Y_2]

\displaystyle  + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].

If we let {X_1, Y_1, X_2, Y_2} be independent copies of {X, Y, Y, X} respectively (note the swap in the last two variables!) we obtain

\displaystyle  2 d[X;Y] \geq d[X+Y;X+Y] + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].

From entropic Ruzsa calculus, one can check that {X+Y}, {X_1|X_1+X_2}, and {Y_1|Y_1+Y_2} are all relevant random variables, so from (2) we now obtain both upper and lower bounds for {d[X+Y;X+Y]}:

\displaystyle  d[X+Y; X+Y] = (1 + O(\eta)) d[X;Y].

A pleasant upshot of this is that we now get to work in the symmetric case {X=Y} without loss of generality. Indeed, if we set {X^* := X+Y}, we now have from (2) that

\displaystyle  d[X'|Z; Y'|W] \geq (1-O(\eta)) d[X^*;X^*] \ \ \ \ \ (3)

whenever {X'|Z, Y'|W} are relevant, which by entropic Ruzsa calculus is equivalent to asking that

\displaystyle  d[X'|Z; X^*], d[Y'|W; X^*] = O(d[X^*;X^*]).

Now we use the fibring identity again, relabeling {Y_1,Y_2} as {X_3,X_4} and requiring {X_1,X_2,X_3,X_4} to be independent copies of {X^*}. We conclude that

\displaystyle  2d[X^*; X^*] = d[X_1+X_2;X_3+Y_4] + d[X_1|X_1+X_2;X_3|X_1+X_4]

\displaystyle  + I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ].

As before, the random variables {X_1+X_2}, {X_3+X_4}, {X_1|X_1+X_2}, {X_3|X_3+X_4} are all relevant, so from (3) we have

\displaystyle  d[X_1+X_2;X_3+X_4], d[X_1|X_1+X_2;X_3|X_1+X_4]

\displaystyle  \geq (1-O(\eta)) d[X^*;X^*].

We could now also match these lower bounds with upper bounds, but the more important takeaway from this analysis is a really good bound on the conditional mutual information:

\displaystyle  I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ]

\displaystyle = O(\eta) d[X^*;X^*].

By the data processing inequality, we can discard some of the randomness here, and conclude

\displaystyle  I[X_1+X_3 : X_1+X_2 | X_1+X_2+X_3+X_4 ] = O(\eta) d[X^*;X^*].

Let us introduce the random variables

\displaystyle  Z := X_1+X_2+X_3+X_4; U := X_1+X_2; V = X_1 + X_3

then we have

\displaystyle  I[U : V | Z] = O(\eta) d[X^*;X^*].

Intuitively, this means that {U} and {V} are very nearly independent given {Z}. For sake of argument, let us assume that they are actually independent; one can achieve something resembling this by invoking the entropic Balog-Szemerédi-Gowers theorem, established in the blueprint, after conceding some losses of {O(\eta) d[X^*,X^*]} in the entropy, but we skip over the details for this blog post. The key point now is that because we are in characteristic {2}, {U+V} has the same form as {U} or {V}:

\displaystyle  U + V = X_2 + X_3.

In particular, by permutation symmetry, we have

\displaystyle  {\mathbf H}[U+V|S] ={\mathbf H}[U|S] ={\mathbf H}[V|S],

and so by the definition of conditional Ruzsa distance we have a massive distance decrement

\displaystyle  d[U|S; V|S] = 0,

contradicting (1) as desired. (In reality, we end up decreasing the distance not all the way to zero, but instead to {O(\eta d[X^*,X^*])} due to losses in the Balog-Szemerédi-Gowers theorem, but this is still enough to reach a contradiction.)

Remark 7 A similar argument works in the {m}-torsion case for general {m}. Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”

\displaystyle  {\mathbf H}[X_1 + \dots + X_m] - \frac{1}{m} ({\mathbf H}[X_1] + \dots + {\mathbf H}[X_m])

for independent {X_1,\dots,X_m}. By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable {X^*}. If one then takes {X_{i,j}}, {i,j=1,\dots,m} to be an array of {m^2} copies of {X^*}, one can get to the point where the row sums {\sum_i X_{i,j}} and the column sums {\sum_j X_{i,j}} have small conditional mutual information with respect to the double sum {S := \sum_i \sum_j X_{i,j}}. If we then set {U := \sum_i \sum_j j X_{i,j}} and {V := \sum_i \sum_j i X_{i,j}}, the data processing inequality again shows that {U} and {V} are nearly independent given {S}. The {m}-torsion now crucially intervenes as before to ensure that {U+V = \sum_i \sum_j (i+j) X_{i,j}} has the same form as {U} or {V}, leading to a contradiction as before. See this previous blog post for more discussion.

The first progress prize competition for the AI Mathematical Olympiad has now launched. (Disclosure: I am on the advisory committee for the prize.) This is a competition in which contestants submit an AI model which, after the submissions deadline on June 27, will be tested (on a fixed computational resource, without internet access) on a set of 50 “private” test math problems, each of which has an answer as an integer between 0 and 999. Prior to the close of submission, the models can be tested on 50 “public” test math problems (where the results of the model are public, but not the problems themselves), as well as 10 training problems that are available to all contestants. As of this time of writing, the leaderboard shows that the best-performing model has solved 4 out of 50 of the questions (a standard benchmark, Gemma 7B, had previously solved 3 out of 50). A total of $2^{20} ($1.048 million) has been allocated for various prizes associated to this competition. More detailed rules can be found here.

Archives