You are currently browsing the category archive for the ‘expository’ category.

Many modern mathematical proofs are a combination of conceptual arguments and technical calculations. There is something of a tradeoff between the two: one can add more conceptual arguments to try to reduce the technical computations, or vice versa. (Among other things, this leads to a Berkson paradox-like phenomenon in which a negative correlation can be observed between the two aspects of a proof; see this recent Mastodon post of mine for more discussion.)

In a recent article, Heather Macbeth argues that the preferred balance between conceptual and computational arguments is quite different for a computer-assisted proof than it is for a purely human-readable proof. In the latter, there is a strong incentive to minimize the amount of calculation to the point where it can be checked by hand, even if this requires a certain amount of ad hoc rearrangement of cases, unmotivated parameter selection, or otherwise non-conceptual additions to the arguments in order to reduce the calculation. But in the former, once one is willing to outsource any tedious verification or optimization task to a computer, the incentives are reversed: freed from the need to arrange the argument to reduce the amount of calculation, one can now describe an argument by listing the main ingredients and then letting the computer figure out a suitable way to combine them to give the stated result. The two approaches can thus be viewed as complementary ways to describe a result, with neither necessarily being superior to the other.

In this post, I would like to illustrate this computation-outsourced approach with the topic of zero-density theorems for the Riemann zeta function, in which all computer verifiable calculations (as well as other routine but tedious arguments) are performed “off-stage”, with the intent of focusing only on the conceptual inputs to these theorems.

Zero-density theorems concern upper bounds for the quantity {N(\sigma,T)} for a given {1/2 \leq \sigma \leq 1} and large {T}, which is defined as the number of zeroes of the Riemann zeta function in the rectangle {\{ \beta+i\gamma: \sigma \leq \beta \leq 1; 0 \leq \gamma \leq T \}}. (There is also an important generalization of this quantity to {L}-functions, but for simplicity we will focus on the classical zeta function case here). Such quantities are important in analytic number theory for many reasons, one of which is through explicit formulae such as the Riemann-von Mangoldt explicit formula

\displaystyle  \sum_{n \leq x}^{\prime} \Lambda(n) = x - \sum_{\rho:\zeta(\rho)=0} \frac{x^\rho}{\rho} - \log(2\pi) - \frac{1}{2} \log(1-x^{-2}) \ \ \ \ \ (1)

relating the prime numbers to the zeroes of the zeta function (the “music of the primes”). The better bounds one has on {N(\sigma,T)}, the more control one has on the complicated term {\sum_{\rho:\zeta(\rho)=0} \frac{x^\rho}{\rho}} on the right-hand side.

Clearly {N(\sigma,T)} is non-increasing in {\sigma}. The Riemann-von Mangoldt formula, together with the functional equation, gives us the asymptotic

\displaystyle  N(1/2,T) \asymp T \log T

in the {\sigma=1/2} case, while the prime number theorem tells us that

\displaystyle  N(1,T) = 0. \ \ \ \ \ (2)

The various zero free regions for the zeta function can be viewed as slight improvements to (2); for instance, the classical zero-free region is equivalent to the assertion that {N(\sigma,T)} vanishes if {\sigma > 1 - c/\log T} for some small absolute constant {c>0}, and the Riemann hypothesis is equivalent to the assertion that {N(\sigma,T)=0} for all {\sigma>1/2}.

Experience has shown that the most important quantity to control here is the exponent {A(\sigma)}, defined as the least constant for which one has an asymptotic

\displaystyle  N(\sigma,T) = T^{A(\sigma)(1-\sigma)+o(1)}

as {T \rightarrow \infty}. Thus, for instance,

\displaystyle  A(1/2) = 2, \ \ \ \ \ (3)

{A(1) = 0}, and {A(\sigma)(1-\sigma)} is a non-increasing function of {\sigma}, so we obtain the trivial “von Mangoldt” zero density theorem

\displaystyle  A(\sigma) \leq \frac{1}{1-\sigma}.

Of particular interest is the supremal value {\|A\|_\infty := \sup_{1/2 \leq \sigma \leq 1} A(\sigma)} of {A}, which has to be at least {2} thanks to (3). The density hypothesis asserts that the maximum is in fact exactly {2}, or equivalently that

\displaystyle  A(\sigma) \leq 2, \ \ \ \ \ (4)

for all {1/2 \leq \sigma \leq 1}. This is of course implied by the Riemann hypothesis (which clearly implies that {A(\sigma)=0} for all {\sigma>1/2}), but is a more tractable hypothesis to work with; for instance, the hypothesis is already known to hold for {\sigma \geq 25/32 = 0.78125} by the work of Bourgain (building upon many previous authors). The quantity {\|A\|_\infty} directly impacts our understanding of the prime number theorem in short intervals; indeed, it is not difficult using (1) (as well as the Vinogradov-Korobov zero-free region) to establish a short interval prime number theorem

\displaystyle  \sum_{x \leq n \leq x + x^\theta} \Lambda(n) = (1+o(1)) x^\theta

for all {x \rightarrow \infty} if {1 - \frac{1}{\|A\|_\infty} < \theta < 1} is a fixed exponent, or for almost all {x \rightarrow \infty} if {1 - \frac{2}{\|A\|_\infty} < \theta < 1} is a fixed exponent. Until recently, the best upper bound for {\|A\|_\infty} was {12/5 = 2.4}, thanks to a 1972 result of Huxley; but this was recently lowered to {30/13=2.307\ldots} in a breakthrough work of Guth and Maynard.

In between the papers of Huxley and Guth-Maynard are dozens of additional improvements on {A(\sigma)}, though it is only the Guth-Maynard paper that actually lowered the supremum norm {\|A\|_\infty}. A summary of most of the state of the art before Guth-Maynard may be found in Table 2 of this recent paper of Trudgian and Yang; it is complicated, but it is easy enough to get a computer to illustrate it with a plot:

(For an explanation of what is going on under the assumption of the Lindelöf hypothesis, see below the fold.) This plot represents the combined effort of nearly a dozen papers, each one of which claims one or more components of the depicted piecewise smooth curve, and is written in the “human-readable” style mentioned above, where the argument is arranged to reduce the amount of tedious computation to human-verifiable levels, even if this comes the cost of obscuring the conceptual ideas. (For an animation of how this bound improved over time, see here.) Below the fold, I will try to describe (in sketch form) some of the standard ingredients that go into these papers, in particular the routine reduction of deriving zero density estimates from large value theorems for Dirichlet series. We will not attempt to rewrite the entire literature of zero-density estimates in this fashion, but focus on some illustrative special cases.

Read the rest of this entry »

[This post is dedicated to Luca Trevisan, who recently passed away due to cancer. Though far from his most significant contribution to the field, I would like to mention that, as with most of my other blog posts on this site, this page was written with the assistance of Luca’s LaTeX to WordPress converter. Mathematically, his work and insight on pseudorandomness in particular have greatly informed how I myself think about the concept. – T.]

Recently, Timothy Gowers, Ben Green, Freddie Manners, and I were able to establish the following theorem:

Theorem 1 (Marton’s conjecture) Let {A \subset {\bf F}_2^n} be non-empty with {|A+A| \leq K|A|}. Then there exists a subgroup {H} of {{\bf F}_2^n} with {|H| \leq |A|} such that {A} is covered by at most {2K^C} translates of {H}, for some absolute constant {C}.

We established this result with {C=12}, although it has since been improved to {C=9} by Jyun-Jie Liao.

Our proof was written in order to optimize the constant {C} as much as possible; similarly for the more detailed blueprint of the proof that was prepared in order to formalize the result in Lean. I have been asked a few times whether it is possible to present a streamlined and more conceptual version of the proof in which one does not try to establish an explicit constant {C}, but just to show that the result holds for some constant {C}. This is what I will attempt to do in this post, though some of the more routine steps will be outsourced to the aforementioned blueprint.

The key concept here is that of the entropic Ruzsa distance {d[X;Y]} between two random variables {X,Y} taking values {{\bf F}_2^n}, defined as

\displaystyle  d[X;Y] := {\mathbf H}[X'+Y'] - \frac{1}{2} {\mathbf H}[X] - \frac{1}{2} {\mathbf H}[Y]

where {X',Y'} are independent copies of {X,Y}, and {{\mathbf H}[X]} denotes the Shannon entropy of {X}. This distance is symmetric and non-negative, and obeys the triangle inequality

\displaystyle  d[X;Z] \leq d[X;Y] + d[Y;Z]

for any random variables {X,Y,Z}; see the blueprint for a proof. The above theorem then follows from an entropic analogue:

Theorem 2 (Entropic Marton’s conjecture) Let {X} be a {{\bf F}_2^n}-valued random variable with {d[X;X] \leq \log K}. Then there exists a uniform random variable {U_H} on a subgroup {H} of {{\bf F}_2^n} such that {d[X; U_H] \leq C \log K} for some absolute constant {C}.

We were able to establish Theorem 2 with {C=11}, which implies Theorem 1 with {C=12} by fairly standard additive combinatorics manipulations (such as the Ruzsa covering lemma); see the blueprint for details.

The key proposition needed to establish Theorem 2 is the following distance decrement property:

Proposition 3 (Distance decrement) If {X,Y} are {{\bf F}_2^n}-valued random variables, then one can find {{\bf F}_2^n}-valued random variables {X',Y'} such that

\displaystyle  d[X';Y'] \leq (1-\eta) d[X;Y]

and

\displaystyle  d[X;X'], d[Y,Y'] \leq C d[X;Y]

for some absolute constants {C, \eta > 0}.

Indeed, suppose this proposition held. Starting with {X,Y} both equal to {X} and iterating, one can then find sequences of random variables {X_n, Y_n} with {X_0=Y_0=X},

\displaystyle  d[X_n;Y_n] \leq (1-\eta)^n d[X;X],

and

\displaystyle  d[X_{n+1};X_n], d[Y_{n+1};Y_n] \leq C (1-\eta)^n d[X;X].

In particular, from the triangle inequality and geometric series

\displaystyle  d[X_n;X], d[Y_n;X] \leq \frac{C}{\eta} d[X;X].

By weak compactness, some subsequence of the {X_n}, {Y_n} converge to some limiting random variables {X_\infty, Y_\infty}, and by some simple continuity properties of entropic Ruzsa distance, we conclude that

\displaystyle  d[X_\infty;Y_\infty] = 0

and

\displaystyle  d[X_\infty;X], d[Y_\infty;X] \leq \frac{C}{\eta} d[X;X].

Theorem 2 then follows from the “100% inverse theorem” for entropic Ruzsa distance; see the blueprint for details.

To prove Proposition 3, we can reformulate it as follows:

Proposition 4 (Lack of distance decrement implies vanishing) If {X,Y} are {{\bf F}_2^n}-valued random variables, with the property that

\displaystyle  d[X';Y'] > d[X;Y] - \eta ( d[X;Y] + d[X';X] + d[Y',Y] ) \ \ \ \ \ (1)

for all {{\bf F}_2^n}-valued random variables {X',Y'} and some sufficiently small absolute constant {\eta > 0}, then one can derive a contradiction.

Indeed, we may assume from the above proposition that

\displaystyle  d[X';Y'] \leq d[X;Y] - \eta ( d[X; Y] + d[X';X] + d[Y',Y] )

for some {X',Y'}, which will imply Proposition 3 with {C = 1/\eta}.

The entire game is now to use Shannon entropy inequalities and “entropic Ruzsa calculus” to deduce a contradiction from (1) for {\eta} small enough. This we will do below the fold, but before doing so, let us first make some adjustments to (1) that will make it more useful for our purposes. Firstly, because conditional entropic Ruzsa distance (see blueprint for definitions) is an average of unconditional entropic Ruzsa distance, we can automatically upgrade (1) to the conditional version

\displaystyle  d[X'|Z;Y'|W] \geq d[X;Y] - \eta ( d[X;Y] + d[X'|Z;X] + d[Y'|W;Y] )

for any random variables {Z,W} that are possibly coupled with {X',Y'} respectively. In particular, if we define a “relevant” random variable {X'} (conditioned with respect to some auxiliary data {Z}) to be a random variable for which

\displaystyle  d[X'|Z;X] = O( d[X;Y] )

or equivalently (by the triangle inequality)

\displaystyle  d[X'|Z;Y] = O( d[X;Y] )

then we have the useful lower bound

\displaystyle  d[X'|Z;Y'|W] \geq (1-O(\eta)) d[X;Y] \ \ \ \ \ (2)

whenever {X'} and {Y'} are relevant conditioning on {Z, W} respectively. This is quite a useful bound, since the laws of “entropic Ruzsa calculus” will tell us, roughly speaking, that virtually any random variable that we can create from taking various sums of copies of {X,Y} and conditioning against other sums, will be relevant. (Informally: the space of relevant random variables is {(1-O(\eta))d[X;Y]}-separated with respect to the entropic Ruzsa distance.)

— 1. Main argument —

Now we derive more and more consequences of (2) – at some point crucially using the hypothesis that we are in characteristic two – before we reach a contradiction.

Right now, our hypothesis (2) only supplies lower bounds on entropic distances. The crucial ingredient that allows us to proceed is what we call the fibring identity, which lets us convert these lower bounds into useful upper bounds as well, which in fact match up very nicely when {\eta} is small. Informally, the fibring identity captures the intuitive fact that the doubling constant of a set {A} should be at least as large as the doubling constant of the image {\pi(A)} of that set under a homomorphism, times the doubling constant of a typical fiber {A \cap \pi^{-1}(\{z\})} of that homomorphism; and furthermore, one should only be close to equality if the fibers “line up” in some sense.

Here is the fibring identity:

Proposition 5 (Fibring identity) Let {\pi: G \rightarrow H} be a homomorphism. Then for any independent {G}-valued random variables {X, Y}, one has

\displaystyle  d[X;Y] = d[\pi(X); \pi(Y)] + d[X|\pi(X); Y|\pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-\pi(Y) ].

The proof is of course in the blueprint, but given that it is a central pillar of the argument, I reproduce it here.

Proof: Expanding out the definition of Ruzsa distance, and using the conditional entropy chain rule

\displaystyle  {\mathbf H}[X] = {\mathbf H}[\pi(X)] + {\mathbf H}[X|\pi(X)]

and

\displaystyle  {\mathbf H}[Y] = {\mathbf H}[\pi(Y)] + {\mathbf H}[Y|\pi(Y)],

it suffices to establish the identity

\displaystyle  {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X), \pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ].

But from the chain rule again we have

\displaystyle  {\mathbf H}[X-Y] = {\mathbf H}[\pi(X)-\pi(Y)] + {\mathbf H}[X - Y|\pi(X)-\pi(Y)]

and from the definition of conditional mutual information (using the fact that {\pi(X)-\pi(Y)} is determined both by {X-Y} and by {(\pi(X),\pi(Y))}) one has

\displaystyle  {\mathbf H}[X - Y|\pi(X)-\pi(Y)] = {\mathbf H}[X - Y|\pi(X), \pi(Y)]

\displaystyle  + I[X-Y : \pi(X),\pi(Y) | \pi(X)-(Y) ]

giving the claim. \Box

We will only care about the characteristic {2} setting here, so we will now assume that all groups involved are {2}-torsion, so that we can replace all subtractions with additions. If we specialize the fibring identity to the case where {G = {\bf F}_2^n \times {\bf F}_2^n}, {H = {\bf F}_2^n}, {\pi: G \rightarrow H} is the addition map {\pi(x,y) = x+y}, and {X = (X_1, X_2)}, {Y = (Y_1, Y_2)} are pairs of independent random variables in {{\bf F}_2^n}, we obtain the following corollary:

Corollary 6 Let {X_1,X_2,Y_1,Y_2} be independent {{\bf F}_2^n}-valued random variables. Then we have the identity

\displaystyle  d[X_1;Y_1] + d[X_2;Y_2] = d[X_1+X_2;Y_1+Y_2]

\displaystyle  + d[X_1|X_1+X_2;Y_1|Y_1+Y_2]

\displaystyle  + I[(X_1+Y_1, X_2+Y_2) : (X_1+X_2,Y_1+Y_2) | X_1+X_2+Y_1+Y_2 ].

This is a useful and flexible identity, especially when combined with (2). For instance, we can discard the conditional mutual information term as being non-negative, to obtain the inequality

\displaystyle  d[X_1;Y_1] + d[X_2;Y_2] \geq d[X_1+X_2;Y_1+Y_2]

\displaystyle  + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].

If we let {X_1, Y_1, X_2, Y_2} be independent copies of {X, Y, Y, X} respectively (note the swap in the last two variables!) we obtain

\displaystyle  2 d[X;Y] \geq d[X+Y;X+Y] + d[X_1|X_1+X_2;Y_1|Y_1+Y_2].

From entropic Ruzsa calculus, one can check that {X+Y}, {X_1|X_1+X_2}, and {Y_1|Y_1+Y_2} are all relevant random variables, so from (2) we now obtain both upper and lower bounds for {d[X+Y;X+Y]}:

\displaystyle  d[X+Y; X+Y] = (1 + O(\eta)) d[X;Y].

A pleasant upshot of this is that we now get to work in the symmetric case {X=Y} without loss of generality. Indeed, if we set {X^* := X+Y}, we now have from (2) that

\displaystyle  d[X'|Z; Y'|W] \geq (1-O(\eta)) d[X^*;X^*] \ \ \ \ \ (3)

whenever {X'|Z, Y'|W} are relevant, which by entropic Ruzsa calculus is equivalent to asking that

\displaystyle  d[X'|Z; X^*], d[Y'|W; X^*] = O(d[X^*;X^*]).

Now we use the fibring identity again, relabeling {Y_1,Y_2} as {X_3,X_4} and requiring {X_1,X_2,X_3,X_4} to be independent copies of {X^*}. We conclude that

\displaystyle  2d[X^*; X^*] = d[X_1+X_2;X_3+Y_4] + d[X_1|X_1+X_2;X_3|X_1+X_4]

\displaystyle  + I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ].

As before, the random variables {X_1+X_2}, {X_3+X_4}, {X_1|X_1+X_2}, {X_3|X_3+X_4} are all relevant, so from (3) we have

\displaystyle  d[X_1+X_2;X_3+X_4], d[X_1|X_1+X_2;X_3|X_1+X_4]

\displaystyle  \geq (1-O(\eta)) d[X^*;X^*].

We could now also match these lower bounds with upper bounds, but the more important takeaway from this analysis is a really good bound on the conditional mutual information:

\displaystyle  I[(X_1+X_3, X_2+X_4) : (X_1+X_2,X_3+X_4) | X_1+X_2+X_3+X_4 ]

\displaystyle = O(\eta) d[X^*;X^*].

By the data processing inequality, we can discard some of the randomness here, and conclude

\displaystyle  I[X_1+X_3 : X_1+X_2 | X_1+X_2+X_3+X_4 ] = O(\eta) d[X^*;X^*].

Let us introduce the random variables

\displaystyle  S := X_1+X_2+X_3+X_4; U := X_1+X_2; V = X_1 + X_3

then we have

\displaystyle  I[U : V | S] = O(\eta) d[X^*;X^*].

Intuitively, this means that {U} and {V} are very nearly independent given {S}. For sake of argument, let us assume that they are actually independent; one can achieve something resembling this by invoking the entropic Balog-Szemerédi-Gowers theorem, established in the blueprint, after conceding some losses of {O(\eta) d[X^*,X^*]} in the entropy, but we skip over the details for this blog post. The key point now is that because we are in characteristic {2}, {U+V} has the same form as {U} or {V}:

\displaystyle  U + V = X_2 + X_3.

In particular, by permutation symmetry, we have

\displaystyle  {\mathbf H}[U+V|S] ={\mathbf H}[U|S] ={\mathbf H}[V|S],

and so by the definition of conditional Ruzsa distance we have a massive distance decrement

\displaystyle  {\bf E}_s d[U|S=s; V|S=s] = 0,

(where {s} is drawn from the distribution of {S}), contradicting (1) as desired. (In reality, we end up decreasing the distance not all the way to zero, but instead to {O(\eta d[X^*,X^*])} due to losses in the Balog-Szemerédi-Gowers theorem, but this is still enough to reach a contradiction. The quantity {{\bf E}_s d[U|S=s; V|S=s]} is very similar to {d[U|S; V|S]}, but is slightly different; the latter quantity is {{\bf E}_{s,s'}d[U|S=s; V|S=s']}.)

Remark 7 A similar argument works in the {m}-torsion case for general {m}. Instead of decrementing the entropic Ruzsa distance, one instead decrements a “multidistance”

\displaystyle  {\mathbf H}[X_1 + \dots + X_m] - \frac{1}{m} ({\mathbf H}[X_1] + \dots + {\mathbf H}[X_m])

for independent {X_1,\dots,X_m}. By an iterated version of the fibring identity, one can first reduce again to the symmetric case where the random variables are all copies of the same variable {X^*}. If one then takes {X_{i,j}}, {i,j=1,\dots,m} to be an array of {m^2} copies of {X^*}, one can get to the point where the row sums {\sum_i X_{i,j}} and the column sums {\sum_j X_{i,j}} have small conditional mutual information with respect to the double sum {S := \sum_i \sum_j X_{i,j}}. If we then set {U := \sum_i \sum_j j X_{i,j}} and {V := \sum_i \sum_j i X_{i,j}}, the data processing inequality again shows that {U} and {V} are nearly independent given {S}. The {m}-torsion now crucially intervenes as before to ensure that {U+V = \sum_i \sum_j (i+j) X_{i,j}} has the same form as {U} or {V}, leading to a contradiction as before. See this previous blog post for more discussion.

A recent paper of Kra, Moreira, Richter, and Robertson established the following theorem, resolving a question of Erdös. Given a discrete amenable group {G = (G,+)}, and a subset {A} of {G}, we define the Banach density of {A} to be the quantity

\displaystyle  \sup_\Phi \limsup_{N \rightarrow \infty} |A \cap \Phi_N|/|\Phi_N|,

where the supremum is over all Følner sequences {\Phi = (\Phi_N)_{N=1}^\infty} of {G}. Given a set {B} in {G}, we define the restricted sumset {B \oplus B} to be the set of all pairs {b_1+b_2} where {b_1, b_2} are distinct elements of {B}.

Theorem 1 Let {G} be a countably infinite abelian group with the index {[G:2G]} finite. Let {A} be a positive Banach density subset of {G}. Then there exists an infinite set {B \subset A} and {t \in G} such that {B \oplus B + t \subset A}.

Strictly speaking, the main result of Kra et al. only claims this theorem for the case of the integers {G={\bf Z}}, but as noted in the recent preprint of Charamaras and Mountakis, the argument in fact applies for all countable abelian {G} in which the subgroup {2G := \{ 2x: x \in G \}} has finite index. This condition is in fact necessary (as observed by forthcoming work of Ethan Acklesberg): if {2G} has infinite index, then one can find a subgroup {H_j} of {G} of index {2^j} for any {j \geq 1} that contains {2G} (or equivalently, {G/H_j} is {2}-torsion). If one lets {y_1,y_2,\dots} be an enumeration of {G}, and one can then check that the set

\displaystyle  A := G \backslash \bigcup_{j=1}^\infty (H_{j+1} + y_j) \backslash \{y_1,\dots,y_j\}

has positive Banach density, but does not contain any set of the form {B \oplus B + t} for any {t} (indeed, from the pigeonhole principle and the {2}-torsion nature of {G/H_{j+1}} one can show that {B \oplus B + y_j} must intersect {H_{j+1} + y_j \backslash \{y_1,\dots,y_j\}} whenever {B} has cardinality larger than {j 2^{j+1}}). It is also necessary to work with restricted sums {B \oplus B} rather than full sums {B+B}: a counterexample to the latter is provided for instance by the example with {G = {\bf Z}} and {A := \bigcup_{j=1}^\infty [10^j, 1.1 \times 10^j]}. Finally, the presence of the shift {t} is also necessary, as can be seen by considering the example of {A} being the odd numbers in {G ={\bf Z}}, though in the case {G=2G} one can of course delete the shift {t} at the cost of giving up the containment {B \subset A}.

Theorem 1 resembles other theorems in density Ramsey theory, such as Szemerédi’s theorem, but with the notable difference that the pattern located in the dense set {A} is infinite rather than merely arbitrarily large but finite. As such, it does not seem that this theorem can be proven by purely finitary means. However, one can view this result as the conjunction of an infinite number of statements, each of which is a finitary density Ramsey theory statement. To see this, we need some more notation. Observe from Tychonoff’s theorem that the collection {2^G := \{ B: B \subset G \}} is a compact topological space (with the topology of pointwise convergence) (it is also metrizable since {G} is countable). Subsets {{\mathcal F}} of {2^G} can be thought of as properties of subsets of {G}; for instance, the property a subset {B} of {G} of being finite is of this form, as is the complementary property of being infinite. A property of subsets of {G} can then be said to be closed or open if it corresponds to a closed or open subset of {2^G}. Thus, a property is closed and only if if it is closed under pointwise limits, and a property is open if, whenever a set {B} has this property, then any other set {B'} that shares a sufficiently large (but finite) initial segment with {B} will also have this property. Since {2^G} is compact and Hausdorff, a property is closed if and only if it is compact.

The properties of being finite or infinite are neither closed nor open. Define a smallness property to be a closed (or compact) property of subsets of {G} that is only satisfied by finite sets; the complement to this is a largeness property, which is an open property of subsets of {G} that is satisfied by all infinite sets. (One could also choose to impose other axioms on these properties, for instance requiring a largeness property to be an upper set, but we will not do so here.) Examples of largeness properties for a subset {B} of {G} include:

  • {B} has at least {10} elements.
  • {B} is non-empty and has at least {b_1} elements, where {b_1} is the smallest element of {B}.
  • {B} is non-empty and has at least {b_{b_1}} elements, where {b_n} is the {n^{\mathrm{th}}} element of {B}.
  • {T} halts when given {B} as input, where {T} is a given Turing machine that halts whenever given an infinite set as input. (Note that this encompasses the preceding three examples as special cases, by selecting {T} appropriately.)
We will call a set obeying a largeness property {{\mathcal P}} an {{\mathcal P}}-large set.

Theorem 1 is then equivalent to the following “almost finitary” version (cf. this previous discussion of almost finitary versions of the infinite pigeonhole principle):

Theorem 2 (Almost finitary form of main theorem) Let {G} be a countably infinite abelian group with {[G:2G]} finite. Let {\Phi_n} be a Følner sequence in {G}, let {\delta>0}, and let {{\mathcal P}_t} be a largeness property for each {t \in G}. Then there exists {N} such that if {A \subset G} is such that {|A \cap \Phi_n| / |\Phi_n| \geq \delta} for all {n \leq N}, then there exists a shift {t \in G} and {A} contains a {{\mathcal P}_t}-large set {B} such that {B \oplus B + t \subset A}.

Proof of Theorem 2 assuming Theorem 1. Let {G, \Phi_n}, {\delta}, {{\mathcal P}_t} be as in Theorem 2. Suppose for contradiction that Theorem 2 failed, then for each {N} we can find {A_N} with {|A_N \cap \Phi_n| / |\Phi_n| \geq \delta} for all {n \leq N}, such that there is no {t} and {{\mathcal P}_t}-large {B} such that {B, B \oplus B + t \subset A_N}. By compactness, a subsequence of the {A_N} converges pointwise to a set {A}, which then has Banach density at least {\delta}. By Theorem 1, there is an infinite set {B} and a {t} such that {B, B \oplus B + t \subset A}. By openness, we conclude that there exists a finite {{\mathcal P}_t}-large set {B'} contained in {B}, thus {B', B' \oplus B' + t \subset A}. This implies that {B', B' \oplus B' + t \subset A_N} for infinitely many {N}, a contradiction.

Proof of Theorem 1 assuming Theorem 2. Let {G, A} be as in Theorem 1. If the claim failed, then for each {t}, the property {{\mathcal P}_t} of being a set {B} for which {B, B \oplus B + t \subset A} would be a smallness property. By Theorem 2, we see that there is a {t} and a {B} obeying the complement of this property such that {B, B \oplus B + t \subset A}, a contradiction.

Remark 3 Define a relation {R} between {2^G} and {2^G \times G} by declaring {A\ R\ (B,t)} if {B \subset A} and {B \oplus B + t \subset A}. The key observation that makes the above equivalences work is that this relation is continuous in the sense that if {U} is an open subset of {2^G \times G}, then the inverse image

\displaystyle R^{-1} U := \{ A \in 2^G: A\ R\ (B,t) \hbox{ for some } (B,t) \in U \}

is also open. Indeed, if {A\ R\ (B,t)} for some {(B,t) \in U}, then {B} contains a finite set {B'} such that {(B',t) \in U}, and then any {A'} that contains both {B'} and {B' \oplus B' + t} lies in {R^{-1} U}.

For each specific largeness property, such as the examples listed previously, Theorem 2 can be viewed as a finitary assertion (at least if the property is “computable” in some sense), but if one quantifies over all largeness properties, then the theorem becomes infinitary. In the spirit of the Paris-Harrington theorem, I would in fact expect some cases of Theorem 2 to undecidable statements of Peano arithmetic, although I do not have a rigorous proof of this assertion.

Despite the complicated finitary interpretation of this theorem, I was still interested in trying to write the proof of Theorem 1 in some sort of “pseudo-finitary” manner, in which one can see analogies with finitary arguments in additive combinatorics. The proof of Theorem 1 that I give below the fold is my attempt to achieve this, although to avoid a complete explosion of “epsilon management” I will still use at one juncture an ergodic theory reduction from the original paper of Kra et al. that relies on such infinitary tools as the ergodic decomposition, the ergodic theory, and the spectral theorem. Also some of the steps will be a little sketchy, and assume some familiarity with additive combinatorics tools (such as the arithmetic regularity lemma).

Read the rest of this entry »

Let {S} be a non-empty finite set. If {X} is a random variable taking values in {S}, the Shannon entropy {H[X]} of {X} is defined as

\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let {f: S \rightarrow {\bf R}} be a function. Then

\displaystyle  \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)

Proof: Note that shifting {f} by a constant affects both sides of (1) the same way, so we may normalize {\sum_{s \in S} \exp(f(s)) = 1}. Then {\exp(f(s))} is now the probability distribution of some random variable {Y}, and the inequality can be rewritten as

\displaystyle  0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as {-D_{KL}(X||Y)}, where {D_{KL}} denotes Kullback-Leibler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions {x \mapsto e^x} and {y \mapsto y \log y - y}.) \Box

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let {n \geq 0}, let {S, T_1,\dots,T_n} be finite non-empty sets, and let {\pi_i: S \rightarrow T_i} be functions for each {i=1,\dots,n}. Let {K: S \rightarrow {\bf R}^+} and {f_i: T_i \rightarrow {\bf R}^+} be positive functions for each {i=1,\dots,n}. Then

\displaystyle  \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}

where {Q} is the quantity

\displaystyle  Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}

where {\Omega_n} is the set of all tuples {(s_0,\dots,s_n) \in S^{n+1}} such that {\pi_i(s_{i-1}) = \pi_i(s_i)} for {i=1,\dots,n}.

Thus for instance, the identity is trivial for {n=0}. When {n=1}, the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}

\displaystyle  ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},

which is easily proven by Cauchy-Schwarz, while for {n=2} the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))

\displaystyle  \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}

\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}

which can also be proven by elementary means. However even for {n=3}, the existing proofs require the “tensor power trick” in order to reduce to the case when the {f_i} are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write {K(s) = \exp(k(s))} and {f_i(t_i) = \exp(g_i(t_i))} for some functions {k: S \rightarrow {\bf R}} and {g_i: T_i \rightarrow {\bf R}}. If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

\displaystyle  \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]

\displaystyle  \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]

\displaystyle  + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]

where {X} ranges over random variables taking values in {S}, {X_0,\dots,X_n} range over tuples of random variables taking values in {\Omega_n}, and {Y_i} range over random variables taking values in {T_i}. Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let {X} be an {S}-valued random variable. Then there exists a {\Omega_n}-valued random variable {(X_0,\dots,X_n)}, where each {X_i} has the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]

\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].

Proof: We induct on {n}. When {n=0} we just take {X_0 = X}. Now suppose that {n \geq 1}, and the claim has already been proven for {n-1}, thus one has already obtained a tuple {(X_0,\dots,X_{n-1}) \in \Omega_{n-1}} with each {X_0,\dots,X_{n-1}} having the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].

By hypothesis, {\pi_n(X_{n-1})} has the same distribution as {\pi_n(X)}. For each value {t_n} attained by {\pi_n(X)}, we can take conditionally independent copies of {(X_0,\dots,X_{n-1})} and {X} conditioned to the events {\pi_n(X_{n-1}) = t_n} and {\pi_n(X) = t_n} respectively, and then concatenate them to form a tuple {(X_0,\dots,X_n)} in {\Omega_n}, with {X_n} a further copy of {X} that is conditionally independent of {(X_0,\dots,X_{n-1})} relative to {\pi_n(X_{n-1}) = \pi_n(X)}. One can the use the entropy chain rule to compute

\displaystyle  {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])

\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])

\displaystyle  ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]

and the claim now follows from the induction hypothesis. \Box

With a little more effort, one can replace {S} by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.

In my previous post, I walked through the task of formally deducing one lemma from another in Lean 4. The deduction was deliberately chosen to be short and only showcased a small number of Lean tactics. Here I would like to walk through the process I used for a slightly longer proof I worked out recently, after seeing the following challenge from Damek Davis: to formalize (in a civilized fashion) the proof of the following lemma:

Lemma. Let \{a_k\} and \{D_k\} be sequences of real numbers indexed by natural numbers k=0,1,\dots, with a_k non-increasing and D_k non-negative. Suppose also that a_k \leq D_k - D_{k+1} for all k \geq 0. Then a_k \leq \frac{D_0}{k+1} for all k.

Here I tried to draw upon the lessons I had learned from the PFR formalization project, and to first set up a human readable proof of the lemma before starting the Lean formalization – a lower-case “blueprint” rather than the fancier Blueprint used in the PFR project. The main idea of the proof here is to use the telescoping series identity

\displaystyle \sum_{i=0}^k D_i - D_{i+1} = D_0 - D_{k+1}.

Since D_{k+1} is non-negative, and a_i \leq D_i - D_{i+1} by hypothesis, we have

\displaystyle \sum_{i=0}^k a_i \leq D_0

but by the monotone hypothesis on a_i the left-hand side is at least (k+1) a_k, giving the claim.

This is already a human-readable proof, but in order to formalize it more easily in Lean, I decided to rewrite it as a chain of inequalities, starting at a_k and ending at D_0 / (k+1). With a little bit of pen and paper effort, I obtained

a_k = (k+1) a_k / (k+1)

(by field identities)

= (\sum_{i=0}^k a_k) / (k+1)

(by the formula for summing a constant)

\leq (\sum_{i=0}^k a_i) / (k+1)

(by the monotone hypothesis)

\leq (\sum_{i=0}^k D_i - D_{i+1}) / (k+1)

(by the hypothesis a_i \leq D_i - D_{i+1}

= (D_0 - D_{k+1}) / (k+1)

(by telescoping series)

\leq D_0 / (k+1)

(by the non-negativity of D_{k+1}).

I decided that this was a good enough blueprint for me to work with. The next step is to formalize the statement of the lemma in Lean. For this quick project, it was convenient to use the online Lean playground, rather than my local IDE, so the screenshots will look a little different from those in the previous post. (If you like, you can follow this tour in that playground, by clicking on the screenshots of the Lean code.) I start by importing Lean’s math library, and starting an example of a statement to state and prove:

Now we have to declare the hypotheses and variables. The main variables here are the sequences a_k and D_k, which in Lean are best modeled by functions a, D from the natural numbers ℕ to the reals ℝ. (One can choose to “hardwire” the non-negativity hypothesis into the D_k by making D take values in the nonnegative reals {\bf R}^+ (denoted NNReal in Lean), but this turns out to be inconvenient, because the laws of algebra and summation that we will need are clunkier on the non-negative reals (which are not even a group) than on the reals (which are a field). So we add in the variables:

Now we add in the hypotheses, which in Lean convention are usually given names starting with h. This is fairly straightforward; the one thing is that the property of being monotone decreasing already has a name in Lean’s Mathlib, namely Antitone, and it is generally a good idea to use the Mathlib provided terminology (because that library contains a lot of useful lemmas about such terms).

One thing to note here is that Lean is quite good at filling in implied ranges of variables. Because a and D have the natural numbers ℕ as their domain, the dummy variable k in these hypotheses is automatically being quantified over ℕ. We could have made this quantification explicit if we so chose, for instance using ∀ k : ℕ, 0 ≤ D k instead of ∀ k, 0 ≤ D k, but it is not necessary to do so. Also note that Lean does not require parentheses when applying functions: we write D k here rather than D(k) (which in fact does not compile in Lean unless one puts a space between the D and the parentheses). This is slightly different from standard mathematical notation, but is not too difficult to get used to.

This looks like the end of the hypotheses, so we could now add a colon to move to the conclusion, and then add that conclusion:

This is a perfectly fine Lean statement. But it turns out that when proving a universally quantified statement such as ∀ k, a k ≤ D 0 / (k + 1), the first step is almost always to open up the quantifier to introduce the variable k (using the Lean command intro k). Because of this, it is slightly more efficient to hide the universal quantifier by placing the variable k in the hypotheses, rather than in the quantifier (in which case we have to now specify that it is a natural number, as Lean can no longer deduce this from context):

At this point Lean is complaining of an unexpected end of input: the example has been stated, but not proved. We will temporarily mollify Lean by adding a sorry as the purported proof:

Now Lean is content, other than giving a warning (as indicated by the yellow squiggle under the example) that the proof contains a sorry.

It is now time to follow the blueprint. The Lean tactic for proving an inequality via chains of other inequalities is known as calc. We use the blueprint to fill in the calc that we want, leaving the justifications of each step as “sorry”s for now:

Here, we “open“ed the Finset namespace in order to easily access Finset‘s range function, with range k basically being the finite set of natural numbers \{0,\dots,k-1\}, and also “open“ed the BigOperators namespace to access the familiar ∑ notation for (finite) summation, in order to make the steps in the Lean code resemble the blueprint as much as possible. One could avoid opening these namespaces, but then expressions such as ∑ i in range (k+1), a i would instead have to be written as something like Finset.sum (Finset.range (k+1)) (fun i ↦ a i), which looks a lot less like like standard mathematical writing. The proof structure here may remind some readers of the “two column proofs” that are somewhat popular in American high school geometry classes.

Now we have six sorries to fill. Navigating to the first sorry, Lean tells us the ambient hypotheses, and the goal that we need to prove to fill that sorry:

The ⊢ symbol here is Lean’s marker for the goal. The uparrows ↑ are coercion symbols, indicating that the natural number k has to be converted to a real number in order to interact via arithmetic operations with other real numbers such as a k, but we can ignore these coercions for this tour (for this proof, it turns out Lean will basically manage them automatically without need for any explicit intervention by a human).

The goal here is a self-evident algebraic identity; it involves division, so one has to check that the denominator is non-zero, but this is self-evident. In Lean, a convenient way to establish algebraic identities is to use the tactic field_simp to clear denominators, and then ring to verify any identity that is valid for commutative rings. This works, and clears the first sorry:

field_simp, by the way, is smart enough to deduce on its own that the denominator k+1 here is manifestly non-zero (and in fact positive); no human intervention is required to point this out. Similarly for other “clearing denominator” steps that we will encounter in the other parts of the proof.

Now we navigate to the next `sorry`. Lean tells us the hypotheses and goals:

We can reduce the goal by canceling out the common denominator ↑k+1. Here we can use the handy Lean tactic congr, which tries to match two sides of an equality goal as much as possible, and leave any remaining discrepancies between the two sides as further goals to be proven. Applying congr, the goal reduces to

Here one might imagine that this is something that one can prove by induction. But this particular sort of identity – summing a constant over a finite set – is already covered by Mathlib. Indeed, searching for Finset, sum, and const soon leads us to the Finset.sum_const lemma here. But there is an even more convenient path to take here, which is to apply the powerful tactic simp, which tries to simplify the goal as much as possible using all the “simp lemmas” Mathlib has to offer (of which Finset.sum_const is an example, but there are thousands of others). As it turns out, simp completely kills off this identity, without any further human intervention:

Now we move on to the next sorry, and look at our goal:

congr doesn’t work here because we have an inequality instead of an equality, but there is a powerful relative gcongr of congr that is perfectly suited for inequalities. It can also open up sums, products, and integrals, reducing global inequalities between such quantities into pointwise inequalities. If we invoke gcongr with i hi (where we tell gcongr to use i for the variable opened up, and hi for the constraint this variable will satisfy), we arrive at a greatly simplified goal (and a new ambient variable and hypothesis):

Now we need to use the monotonicity hypothesis on a, which we have named ha here. Looking at the documentation for Antitone, one finds a lemma that looks applicable here:

One can apply this lemma in this case by writing apply Antitone.imp ha, but because ha is already of type Antitone, we can abbreviate this to apply ha.imp. (Actually, as indicated in the documentation, due to the way Antitone is defined, we can even just use apply ha here.) This reduces the goal nicely:

The goal is now very close to the hypothesis hi. One could now look up the documentation for Finset.range to see how to unpack hi, but as before simp can do this for us. Invoking simp at hi, we obtain

Now the goal and hypothesis are very close indeed. Here we can just close the goal using the linarith tactic used in the previous tour:

The next sorry can be resolved by similar methods, using the hypothesis hD applied at the variable i:

Now for the penultimate sorry. As in a previous step, we can use congr to remove the denominator, leaving us in this state:

This is a telescoping series identity. One could try to prove it by induction, or one could try to see if this identity is already in Mathlib. Searching for Finset, sum, and sub will locate the right tool (as the fifth hit), but a simpler way to proceed here is to use the exact? tactic we saw in the previous tour:

A brief check of the documentation for sum_range_sub' confirms that this is what we want. Actually we can just use apply sum_range_sub' here, as the apply tactic is smart enough to fill in the missing arguments:

One last sorry to go. As before, we use gcongr to cancel denominators, leaving us with

This looks easy, because the hypothesis hpos will tell us that D (k+1) is nonnegative; specifically, the instance hpos (k+1) of that hypothesis will state exactly this. The linarith tactic will then resolve this goal once it is told about this particular instance:

We now have a complete proof – no more yellow squiggly line in the example. There are two warnings though – there are two variables i and hi introduced in the proof that Lean’s “linter” has noticed are not actually used in the proof. So we can rename them with underscores to tell Lean that we are okay with them not being used:

This is a perfectly fine proof, but upon noticing that many of the steps are similar to each other, one can do a bit of “code golf” as in the previous tour to compactify the proof a bit:

With enough familiarity with the Lean language, this proof actually tracks quite closely with (an optimized version of) the human blueprint.

This concludes the tour of a lengthier Lean proving exercise. I am finding the pre-planning step of the proof (using an informal “blueprint” to break the proof down into extremely granular pieces) to make the formalization process significantly easier than in the past (when I often adopted a sequential process of writing one line of code at a time without first sketching out a skeleton of the argument). (The proof here took only about 15 minutes to create initially, although for this blog post I had to recreate it with screenshots and supporting links, which took significantly more time.) I believe that a realistic near-term goal for AI is to be able to fill in automatically a significant fraction of the sorts of atomic “sorry“s of the size one saw in this proof, allowing one to convert a blueprint to a formal Lean proof even more rapidly.

One final remark: in this tour I filled in the “sorry“s in the order in which they appeared, but there is actually no requirement that one does this, and once one has used a blueprint to atomize a proof into self-contained smaller pieces, one can fill them in in any order. Importantly for a group project, these micro-tasks can be parallelized, with different contributors claiming whichever “sorry” they feel they are qualified to solve, and working independently of each other. (And, because Lean can automatically verify if their proof is correct, there is no need to have a pre-existing bond of trust with these contributors in order to accept their contributions.) Furthermore, because the specification of a “sorry” someone can make a meaningful contribution to the proof by working on an extremely localized component of it without needing the mathematical expertise to understand the global argument. This is not particularly important in this simple case, where the entire lemma is not too hard to understand to a trained mathematician, but can become quite relevant for complex formalization projects.

Since the release of my preprint with Tim, Ben, and Freddie proving the Polynomial Freiman-Ruzsa (PFR) conjecture over {\mathbb F}_2, I (together with Yael Dillies and Bhavik Mehta) have started a collaborative project to formalize this argument in the proof assistant language Lean4. It has been less than a week since the project was launched, but it is proceeding quite well, with a significant fraction of the paper already either fully or partially formalized. The project has been greatly assisted by the Blueprint tool of Patrick Massot, which allows one to write a human-readable “blueprint” of the proof that is linked to the Lean formalization; similar blueprints have been used for other projects, such as Scholze’s liquid tensor experiment. For the PFR project, the blueprint can be found here. One feature of the blueprint that I find particularly appealing is the dependency graph that is automatically generated from the blueprint, and can provide a rough snapshot of how far along the formalization has advanced. For PFR, the latest state of the dependency graph can be found here. At the current time of writing, the graph looks like this:

The color coding of the various bubbles (for lemmas) and rectangles (for definitions) is explained in the legend to the dependency graph, but roughly speaking the green bubbles/rectangles represent lemmas or definitions that have been fully formalized, and the blue ones represent lemmas or definitions which are ready to be formalized (their statements, but not proofs, have already been formalized, as well as those of all prerequisite lemmas and proofs). The goal is to get all the bubbles leading up to and including the “pfr” bubble at the bottom colored in green.

In this post I would like to give a quick “tour” of the project, to give a sense of how it operates. If one clicks on the “pfr” bubble at the bottom of the dependency graph, we get the following:

Here, Blueprint is displaying a human-readable form of the PFR statement. This is coming from the corresponding portion of the blueprint, which also comes with a human-readable proof of this statement that relies on other statements in the project:

(I have cropped out the second half of the proof here, as it is not relevant to the discussion.)

Observe that the “pfr” bubble is white, but has a green border. This means that the statement of PFR has been formalized in Lean, but not the proof; and the proof itself is not ready to be formalized, because some of the prerequisites (in particular, “entropy-pfr” (Theorem 6.16)) do not even have their statements formalized yet. If we click on the “Lean” link below the description of PFR in the dependency graph, we are lead to the (auto-generated) Lean documentation for this assertion:

This is what a typical theorem in Lean looks like (after a procedure known as “pretty printing”). There are a number of hypotheses stated before the colon, for instance that G is a finite elementary abelian group of order 2 (this is how we have chosen to formalize the finite field vector spaces {\bf F}_2^n), that A is a non-empty subset of G (the hypothesis that A is non-empty was not stated in the LaTeX version of the conjecture, but we realized it was necessary in the formalization, and will update the LaTeX blueprint shortly to reflect this) with the cardinality of A+A less than K times the cardinality of A, and the statement after the colon is the conclusion: that A can be contained in the sum c+H of a subgroup H of G and a set c of cardinality at most 2K^{12}.

The astute reader may notice that the above theorem seems to be missing one or two details, for instance it does not explicitly assert that H is a subgroup. This is because the “pretty printing” suppresses some of the information in the actual statement of the theorem, which can be seen by clicking on the “Source” link:

Here we see that H is required to have the “type” of an additive subgroup of G. (Lean’s language revolves very strongly around types, but for this tour we will not go into detail into what a type is exactly.) The prominent “sorry” at the bottom of this theorem asserts that a proof is not yet provided for this theorem, but the intention of course is to replace this “sorry” with an actual proof eventually.

Filling in this “sorry” is too hard to do right now, so let’s look for a simpler task to accomplish instead. Here is a simple intermediate lemma “ruzsa-nonneg” that shows up in the proof:

The expression d[X; Y] refers to something called the entropic Ruzsa distance between X and Y, which is something that is defined elsewhere in the project, but for the current discussion it is not important to know its precise definition, other than that it is a real number. The bubble is blue with a green border, which means that the statement has been formalized, and the proof is ready to be formalized also. The blueprint dependency graph indicates that this lemma can be deduced from just one preceding lemma, called “ruzsa-diff“:

ruzsa-diff” is also blue and bordered in green, so it has the same current status as “ruzsa-nonneg“: the statement is formalized, and the proof is ready to be formalized also, but the proof has not been written in Lean yet. The quantity H[X], by the way, refers to the Shannon entropy of X, defined elsewhere in the project, but for this discussion we do not need to know its definition, other than to know that it is a real number.

Looking at Lemma 3.11 and Lemma 3.13 it is clear how the former will imply the latter: the quantity |H[X] - H[Y]| is clearly non-negative! (There is a factor of 2 present in Lemma 3.11, but it can be easily canceled out.) So it should be an easy task to fill in the proof of Lemma 3.13 assuming Lemma 3.11, even if we still don’t know how to prove Lemma 3.11 yet. Let’s first look at the Lean code for each lemma. Lemma 3.11 is formalized as follows:

Again we have a “sorry” to indicate that this lemma does not currently have a proof. The Lean notation (as well as the name of the lemma) differs a little from the LaTeX version for technical reasons that we will not go into here. (Also, the variables X, \mu, Y, \mu' are introduced at an earlier stage in the Lean file; again, we will ignore this point for the ensuing discussion.) Meanwhile, Lemma 3.13 is currently formalized as

OK, I’m now going to try to fill in the latter “sorry”. In my local copy of the PFR github repository, I open up the relevant Lean file in my editor (Visual Studio Code, with the lean4 extension) and navigate to the “sorry” of “rdist_nonneg”. The accompanying “Lean infoview” then shows the current state of the Lean proof:

Here we see a number of ambient hypotheses (e.g., that G is an additive commutative group, that X is a map from \Omega to G, and so forth; many of these hypotheses are not actually relevant for this particular lemma), and at the bottom we see the goal we wish to prove.

OK, so now I’ll try to prove the claim. This is accomplished by applying a series of “tactics” to transform the goal and/or hypotheses. The first step I’ll do is to put in the factor of 2 that is needed to apply Lemma 3.11. This I will do with the “suffices” tactic, writing in the proof

I now have two goals (and two “sorries”): one to show that 0 \leq 2 d[X;Y] implies 0 \leq d[X,Y], and the other to show that 0 \leq 2 d[X;Y]. (The yellow squiggly underline indicates that this lemma has not been fully proven yet due to the presence of “sorry”s. The dot “.” is a syntactic marker that is useful to separate the two goals from each other, but you can ignore it for this tour.) The Lean tactic “suffices” corresponds, roughly speaking, to the phrase “It suffices to show that …” (or more precisely, “It suffices to show that … . To see this, … . It remains to verify the claim …”) in Mathematical English. For my own education, I wrote a “Lean phrasebook” of further correspondences between lines of Lean code and sentences or phrases in Mathematical English, which can be found here.

Let’s fill in the first “sorry”. The tactic state now looks like this (cropping out some irrelevant hypotheses):

Here I can use a handy tactic “linarith“, which solves any goal that can be derived by linear arithmetic from existing hypotheses:

This works, and now the tactic state reports no goals left to prove on this branch, so we move on to the remaining sorry, in which the goal is now to prove 0 \leq 2 d[X;Y]:

Here we will try to invoke Lemma 3.11. I add the following lines of code:

The Lean tactic “have” roughly corresponds to the Mathematical English phrase “We have the statement…” or “We claim the statement…”; like “suffices”, it splits a goal into two subgoals, though in the reversed order to “suffices”.

I again have two subgoals, one to prove the bound |H[X]-H[Y]| \leq 2 d[X;Y] (which I will call “h”), and then to deduce the previous goal 0 \leq 2 d[X;Y] from h. For the first, I know I should invoke the lemma “diff_ent_le_rdist” that is encoding Lemma 3.11. One way to do this is to try the tactic “exact?”, which will automatically search to see if the goal can already be deduced immediately from an existing lemma. It reports:

So I try this (by clicking on the suggested code, which automatically pastes it into the right location), and it works, leaving me with the final “sorry”:

The lean tactic “exact” corresponds, roughly speaking, to the Mathematical English phrase “But this is exactly …”.

At this point I should mention that I also have the Github Copilot extension to Visual Studio Code installed. This is an AI which acts as an advanced autocomplete that can suggest possible lines of code as one types. In this case, it offered a suggestion which was almost correct (the second line is what we need, whereas the first is not necessary, and in fact does not even compile in Lean):

In any event, “exact?” worked in this case, so I can ignore the suggestion of Copilot this time (it has been very useful in other cases though). I apply the “exact?” tactic a second time and follow its suggestion to establish the matching bound 0 \leq |H[X] - H[Y]|:

(One can find documention for the “abs_nonneg” method here. Copilot, by the way, was also able to resolve this step, albeit with a slightly different syntax; there are also several other search engines available to locate this method as well, such as Moogle. One of the main purposes of the Lean naming conventions for lemmas, by the way, is to facilitate the location of methods such as “abs_nonneg”, which is easier figure out how to search for than a method named (say) “Lemma 1.2.1”.) To fill in the final “sorry”, I try “exact?” one last time, to figure out how to combine h and h' to give the desired goal, and it works!

Note that all the squiggly underlines have disappeared, indicating that Lean has accepted this as a valid proof. The documentation for “ge_trans” may be found here. The reader may observe that this method uses the \geq relation rather than the \leq relation, but in Lean the assertions X \geq Y and Y \leq X are “definitionally equal“, allowing tactics such as “exact” to use them interchangeably. “exact le_trans h’ h” would also have worked in this instance.

It is possible to compactify this proof quite a bit by cutting out several intermediate steps (a procedure sometimes known as “code golf“):

And now the proof is done! In the end, it was literally a “one-line proof”, which makes sense given how close Lemma 3.11 and Lemma 3.13 were to each other.

The current version of Blueprint does not automatically verify the proof (even though it does compile in Lean), so we have to manually update the blueprint as well. The LaTeX for Lemma 3.13 currently looks like this:

I add the “\leanok” macro to the proof, to flag that the proof has now been formalized:

I then push everything back up to the master Github repository. The blueprint will take quite some time (about half an hour) to rebuild, but eventually it does, and the dependency graph (which Blueprint has for some reason decided to rearrange a bit) now shows “ruzsa-nonneg” in green:

And so the formalization of PFR moves a little bit closer to completion. (Of course, this was a particularly easy lemma to formalize, that I chose to illustrate the process; one can imagine that most other lemmas will take a bit more work.) Note that while “ruzsa-nonneg” is now colored in green, we don’t yet have a full proof of this result, because the lemma “ruzsa-diff” that it relies on is not green. Nevertheless, the proof is locally complete at this point; hopefully at some point in the future, the predecessor results will also be locally proven, at which point this result will be completely proven. Note how this blueprint structure allows one to work on different parts of the proof asynchronously; it is not necessary to wait for earlier stages of the argument to be fully formalized to start working on later stages, although I anticipate a small amount of interaction between different components as we iron out any bugs or slight inaccuracies in the blueprint. (For instance, I am suspecting that we may need to add some measurability hypotheses on the random variables X, Y in the above two lemmas to make them completely true, but this is something that should emerge organically as the formalization process continues.)

That concludes the brief tour! If you are interested in learning more about the project, you can follow the Zulip chat stream; you can also download Lean and work on the PFR project yourself, using a local copy of the Github repository and sending pull requests to the master copy if you have managed to fill in one or more of the “sorry”s in the current version (but if you plan to work on anything more large scale than filling in a small lemma, it is good to announce your intention on the Zulip chat to avoid duplication of effort) . (One key advantage of working with a project based around a proof assistant language such as Lean is that it makes large-scale mathematical collaboration possible without necessarily having a pre-established level of trust amongst the collaborators; my fellow repository maintainers and I have already approved several pull requests from contributors that had not previously met, as the code was verified to be correct and we could see that it advanced the project. Conversely, as the above example should hopefully demonstrate, it is possible for a contributor to work on one small corner of the project without necessarily needing to understand all the mathematics that goes into the project as a whole.)

If one just wants to experiment with Lean without going to the effort of downloading it, you can playing try the “Natural Number Game” for a gentle introduction to the language, or the Lean4 playground for an online Lean server. Further resources to learn Lean4 may be found here.

A common task in analysis is to obtain bounds on sums

\displaystyle  \sum_{n \in A} f(n)

or integrals

\displaystyle  \int_A f(x)\ dx

where {A} is some simple region (such as an interval) in one or more dimensions, and {f} is an explicit (and elementary) non-negative expression involving one or more variables (such as {n} or {x}, and possibly also some additional parameters. Often, one would be content with an order of magnitude upper bound such as

\displaystyle  \sum_{n \in A} f(n) \ll X

or

\displaystyle  \int_A f(x)\ dx \ll X

where we use {X \ll Y} (or {Y \gg X} or {X = O(Y)}) to denote the bound {|X| \leq CY} for some constant {C}; sometimes one wishes to also obtain the matching lower bound, thus obtaining

\displaystyle  \sum_{n \in A} f(n) \asymp X

or

\displaystyle  \int_A f(x)\ dx \asymp X

where {X \asymp Y} is synonymous with {X \ll Y \ll X}. Finally, one may wish to obtain a more precise bound, such as

\displaystyle  \sum_{n \in A} f(n) = (1+o(1)) X

where {o(1)} is a quantity that goes to zero as the parameters of the problem go to infinity (or some other limit). (For a deeper dive into asymptotic notation in general, see this previous blog post.)

Here are some typical examples of such estimation problems, drawn from recent questions on MathOverflow:

  • (i) (From this question) If {d,p \geq 1} and {a>d/p}, is the expression

    \displaystyle  \sum_{j \in {\bf Z}} 2^{(\frac{d}{p}+1-a)j} \int_0^\infty e^{-2^j s} \frac{s^a}{1+s^{2a}}\ ds

    finite?
  • (ii) (From this question) If {h,m \geq 1}, how can one show that

    \displaystyle  \sum_{d=0}^\infty \frac{2d+1}{2h^2 (1 + \frac{d(d+1)}{h^2}) (1 + \frac{d(d+1)}{h^2m^2})^2} \ll 1 + \log(m^2)?

  • (iii) (From this question) Can one show that

    \displaystyle  \sum_{k=1}^{n-1} \frac{k^{2n-4k-3}(n^2-2nk+2k^2)}{(n-k)^{2n-4k-1}} = (c+o(1)) \sqrt{n}

    as {n \rightarrow \infty} for an explicit constant {c}, and what is this constant?

Compared to other estimation tasks, such as that of controlling oscillatory integrals, exponential sums, singular integrals, or expressions involving one or more unknown functions (that are only known to lie in some function spaces, such as an {L^p} space), high-dimensional geometry (or alternatively, large numbers of random variables), or number-theoretic structures (such as the primes), estimation of sums or integrals of non-negative elementary expressions is a relatively straightforward task, and can be accomplished by a variety of methods. The art of obtaining such estimates is typically not explicitly taught in textbooks, other than through some examples and exercises; it is typically picked up by analysts (or those working in adjacent areas, such as PDE, combinatorics, or theoretical computer science) as graduate students, while they work through their thesis or their first few papers in the subject.

Somewhat in the spirit of this previous post on analysis problem solving strategies, I am going to try here to collect some general principles and techniques that I have found useful for these sorts of problems. As with the previous post, I hope this will be something of a living document, and encourage others to add their own tips or suggestions in the comments.

Read the rest of this entry »

As someone who had a relatively light graduate education in algebra, the import of Yoneda’s lemma in category theory has always eluded me somewhat; the statement and proof are simple enough, but definitely have the “abstract nonsense” flavor that one often ascribes to this part of mathematics, and I struggled to connect it to the more grounded forms of intuition, such as those based on concrete examples, that I was more comfortable with. There is a popular MathOverflow post devoted to this question, with many answers that were helpful to me, but I still felt vaguely dissatisfied. However, recently when pondering the very concrete concept of a polynomial, I managed to accidentally stumble upon a special case of Yoneda’s lemma in action, which clarified this lemma conceptually for me. In the end it was a very simple observation (and would be extremely pedestrian to anyone who works in an algebraic field of mathematics), but as I found this helpful to a non-algebraist such as myself, and I thought I would share it here in case others similarly find it helpful.

In algebra we see a distinction between a polynomial form (also known as a formal polynomial), and a polynomial function, although this distinction is often elided in more concrete applications. A polynomial form in, say, one variable with integer coefficients, is a formal expression {P} of the form

\displaystyle  P = a_d {\mathrm n}^d + \dots + a_1 {\mathrm n} + a_0 \ \ \ \ \ (1)

where {a_0,\dots,a_d} are coefficients in the integers, and {{\mathrm n}} is an indeterminate: a symbol that is often intended to be interpreted as an integer, real number, complex number, or element of some more general ring {R}, but is for now a purely formal object. The collection of such polynomial forms is denoted {{\bf Z}[{\mathrm n}]}, and is a commutative ring.

A polynomial form {P} can be interpreted in any ring {R} (even non-commutative ones) to create a polynomial function {P_R : R \rightarrow R}, defined by the formula

\displaystyle  P_R(n) := a_d n^d + \dots + a_1 n + a_0 \ \ \ \ \ (2)

for any {n \in R}. This definition (2) looks so similar to the definition (1) that we usually abuse notation and conflate {P} with {P_R}. This conflation is supported by the identity theorem for polynomials, that asserts that if two polynomial forms {P, Q} agree at an infinite number of (say) complex numbers, thus {P_{\bf C}(z) = Q_{\bf C}(z)} for infinitely many {z}, then they agree {P=Q} as polynomial forms (i.e., their coefficients match). But this conflation is sometimes dangerous, particularly when working in finite characteristic. For instance:

  • (i) The linear forms {{\mathrm n}} and {-{\mathrm n}} are distinct as polynomial forms, but agree when interpreted in the ring {{\bf Z}/2{\bf Z}}, since {n = -n} for all {n \in {\bf Z}/2{\bf Z}}.
  • (ii) Similarly, if {p} is a prime, then the degree one form {{\mathrm n}} and the degree {p} form {{\mathrm n}^p} are distinct as polynomial forms (and in particular have distinct degrees), but agree when interpreted in the ring {{\bf Z}/p{\bf Z}}, thanks to Fermat’s little theorem.
  • (iii) The polynomial form {{\mathrm n}^2+1} has no roots when interpreted in the reals {{\bf R}}, but has roots when interpreted in the complex numbers {{\bf C}}. Similarly, the linear form {2{\mathrm n}-1} has no roots when interpreted in the integers {{\bf Z}}, but has roots when interpreted in the rationals {{\bf Q}}.

The above examples show that if one only interprets polynomial forms in a specific ring {R}, then some information about the polynomial could be lost (and some features of the polynomial, such as roots, may be “invisible” to that interpretation). But this turns out not to be the case if one considers interpretations in all rings simultaneously, as we shall now discuss.

If {R, S} are two different rings, then the polynomial functions {P_R: R \rightarrow R} and {P_S: S \rightarrow S} arising from interpreting a polynomial form {P} in these two rings are, strictly speaking, different functions. However, they are often closely related to each other. For instance, if {R} is a subring of {S}, then {P_R} agrees with the restriction of {P_S} to {R}. More generally, if there is a ring homomorphism {\phi: R \rightarrow S} from {R} to {S}, then {P_R} and {P_S} are intertwined by the relation

\displaystyle  \phi \circ P_R = P_S \circ \phi, \ \ \ \ \ (3)

which basically asserts that ring homomorphism respect polynomial operations. Note that the previous observation corresponded to the case when {\phi} was an inclusion homomorphism. Another example comes from the complex conjugation automorphism {z \mapsto \overline{z}} on the complex numbers, in which case (3) asserts the identity

\displaystyle  \overline{P_{\bf C}(z)} = P_{\bf C}(\overline{z})

for any polynomial function {P_{\bf C}} on the complex numbers, and any complex number {z}.

What was surprising to me (as someone who had not internalized the Yoneda lemma) was that the converse statement was true: if one had a function {F_R: R \rightarrow R} associated to every ring {R} that obeyed the intertwining relation

\displaystyle  \phi \circ F_R = F_S \circ \phi \ \ \ \ \ (4)

for every ring homomorphism {\phi: R \rightarrow S}, then there was a unique polynomial form {P \in {\bf Z}[\mathrm{n}]} such that {F_R = P_R} for all rings {R}. This seemed surprising to me because the functions {F} were a priori arbitrary functions, and as an analyst I would not expect them to have polynomial structure. But the fact that (4) holds for all rings {R,S} and all homomorphisms {\phi} is in fact rather powerful. As an analyst, I am tempted to proceed by first working with the ring {{\bf C}} of complex numbers and taking advantage of the aforementioned identity theorem, but this turns out to be tricky because {{\bf C}} does not “talk” to all the other rings {R} enough, in the sense that there are not always as many ring homomorphisms from {{\bf C}} to {R} as one would like. But there is in fact a more elementary argument that takes advantage of a particularly relevant (and “talkative”) ring to the theory of polynomials, namely the ring {{\bf Z}[\mathrm{n}]} of polynomials themselves. Given any other ring {R}, and any element {n} of that ring, there is a unique ring homomorphism {\phi_{R,n}: {\bf Z}[\mathrm{n}] \rightarrow R} from {{\bf Z}[\mathrm{n}]} to {R} that maps {\mathrm{n}} to {n}, namely the evaluation map

\displaystyle  \phi_{R,n} \colon a_d {\mathrm n}^d + \dots + a_1 {\mathrm n} + a_0 \mapsto a_d n^d + \dots + a_1 n + a_0

that sends a polynomial form to its evaluation at {n}. Applying (4) to this ring homomorphism, and specializing to the element {\mathrm{n}} of {{\bf Z}[\mathrm{n}]}, we conclude that

\displaystyle  \phi_{R,n}( F_{{\bf Z}[\mathrm{n}]}(\mathrm{n}) ) = F_R( n )

for any ring {R} and any {n \in R}. If we then define {P \in {\bf Z}[\mathrm{n}]} to be the formal polynomial

\displaystyle  P := F_{{\bf Z}[\mathrm{n}]}(\mathrm{n}),

then this identity can be rewritten as

\displaystyle  F_R = P_R

and so we have indeed shown that the family {F_R} arises from a polynomial form {P}. Conversely, from the identity

\displaystyle  P = P_{{\bf Z}[\mathrm{n}]}(\mathrm{n})

valid for any polynomial form {P}, we see that two polynomial forms {P,Q} can only generate the same polynomial functions {P_R, Q_R} for all rings {R} if they are identical as polynomial forms. So the polynomial form {P} associated to the family {F_R} is unique.

We have thus created an identification of form and function: polynomial forms {P} are in one-to-one correspondence with families of functions {F_R} obeying the intertwining relation (4). But this identification can be interpreted as a special case of the Yoneda lemma, as follows. There are two categories in play here: the category {\mathbf{Ring}} of rings (where the morphisms are ring homomorphisms), and the category {\mathrm{Set}} of sets (where the morphisms are arbitrary functions). There is an obvious forgetful functor {\mathrm{Forget}: \mathbf{Ring} \rightarrow \mathbf{Set}} between these two categories that takes a ring and removes all of the algebraic structure, leaving behind just the underlying set. A collection {F_R: R \rightarrow R} of functions (i.e., {\mathbf{Set}}-morphisms) for each {R} in {\mathbf{Ring}} that obeys the intertwining relation (4) is precisely the same thing as a natural transformation from the forgetful functor {\mathrm{Forget}} to itself. So we have identified formal polynomials in {{\bf Z}[\mathbf{n}]} as a set with natural endomorphisms of the forgetful functor:

\displaystyle  \mathrm{Forget}({\bf Z}[\mathbf{n}]) \equiv \mathrm{Hom}( \mathrm{Forget}, \mathrm{Forget} ). \ \ \ \ \ (5)

Informally: polynomial forms are precisely those operations on rings that are respected by ring homomorphisms.

What does this have to do with Yoneda’s lemma? Well, remember that every element {n} of a ring {R} came with an evaluation homomorphism {\phi_{R,n}: {\bf Z}[\mathrm{n}] \rightarrow R}. Conversely, every homomorphism from {{\bf Z}[\mathrm{n}]} to {R} will be of the form {\phi_{R,n}} for a unique {n} – indeed, {n} will just be the image of {\mathrm{n}} under this homomorphism. So the evaluation homomorphism provides a one-to-one correspondence between elements of {R}, and ring homomorphisms in {\mathrm{Hom}({\bf Z}[\mathrm{n}], R)}. This correspondence is at the level of sets, so this gives the identification

\displaystyle  \mathrm{Forget} \equiv \mathrm{Hom}({\bf Z}[\mathrm{n}], -).

Thus our identification can be written as

\displaystyle  \mathrm{Forget}({\bf Z}[\mathbf{n}]) \equiv \mathrm{Hom}( \mathrm{Hom}({\bf Z}[\mathrm{n}], -), \mathrm{Forget} )

which is now clearly a special case of the Yoneda lemma

\displaystyle  F(A) \equiv \mathrm{Hom}( \mathrm{Hom}(A, -), F )

that applies to any functor {F: {\mathcal C} \rightarrow \mathbf{Set}} from a (locally small) category {{\mathcal C}} and any object {A} in {{\mathcal C}}. And indeed if one inspects the standard proof of this lemma, it is essentially the same argument as the argument we used above to establish the identification (5). More generally, it seems to me that the Yoneda lemma is often used to identify “formal” objects with their “functional” interpretations, as long as one simultaneously considers interpretations across an entire category (such as the category of rings), as opposed to just a single interpretation in a single object of the category in which there may be some loss of information due to the peculiarities of that specific object. Grothendieck’s “functor of points” interpretation of a scheme, discussed in this previous blog post, is one typical example of this.

The “epsilon-delta” nature of analysis can be daunting and unintuitive to students, as the heavy reliance on inequalities rather than equalities. But it occurred to me recently that one might be able to leverage the intuition one already has from “deals” – of the type one often sees advertised by corporations – to get at least some informal understanding of these concepts.

Take for instance the concept of an upper bound {X \leq A} or a lower bound {X \geq B} on some quantity {X}. From an economic perspective, one could think of the upper bound as an assertion that {X} can be “bought” for {A} units of currency, and the lower bound can similarly be viewed as an assertion that {X} can be “sold” for {B} units of currency. Thus for instance, a system of inequalities and equations like

\displaystyle  2 \leq Y \leq 5

\displaystyle  X+Y \leq 7

\displaystyle  X+Y+Z = 10

\displaystyle  Y+Z \leq 6

could be viewed as analogous to a currency rate exchange board, of the type one sees for instance in airports:

Currency We buy at We sell at
{Y} {2} {5}
{X+Y} {7}
{X+Y+Z} {10} {10}
{Y+Z} {6}

Someone with an eye for spotting “deals” might now realize that one can actually buy {Y} for {3} units of currency rather than {5}, by purchasing one copy each of {X+Y} and {Y+Z} for {7+6=13} units of currency, then selling off {X+Y+Z} to recover {10} units of currency back. In more traditional mathematical language, one can improve the upper bound {Y \leq 5} to {Y \leq 3} by taking the appropriate linear combination of the inequalities {X+Y \leq 7}, {Y+Z \leq 6}, and {X+Y+Z=10}. More generally, this way of thinking is useful when faced with a linear programming situation (and of course linear programming is a key foundation for operations research), although this analogy begins to break down when one wants to use inequalities in a more non-linear fashion.

Asymptotic estimates such as {X = O(Y)} (also often written {X \lesssim Y} or {X \ll Y}) can be viewed as some sort of liquid market in which {Y} can be used to purchase {X}, though depending on market rates, one may need a large number of units of {Y} in order to buy a single unit of {X}. An asymptotic estimate like {X=o(Y)} represents an economic situation in which {Y} is so much more highly desired than {X} that, if one is a patient enough haggler, one can eventually convince someone to give up a unit of {X} for even just a tiny amount of {Y}.

When it comes to the basic analysis concepts of convergence and continuity, one can similarly view these concepts as various economic transactions involving the buying and selling of accuracy. One could for instance imagine the following hypothetical range of products in which one would need to spend more money to obtain higher accuracy to measure weight in grams:

Object Accuracy Price
Low-end kitchen scale {\pm 1} gram {\$ 5}
High-end bathroom scale {\pm 0.1} grams {\$ 15}
Low-end lab scale {\pm 0.01} grams {\$ 50}
High-end lab scale {\pm 0.001} grams {\$ 250}

The concept of convergence {x_n \rightarrow x} of a sequence {x_1,x_2,x_3,\dots} to a limit {x} could then be viewed as somewhat analogous to a rewards program, of the type offered for instance by airlines, in which various tiers of perks are offered when one hits a certain level of “currency” (e.g., frequent flyer miles). For instance, the convergence of the sequence {x_n := 2 + \frac{1}{\sqrt{n}}} to its limit {x := 2} offers the following accuracy “perks” depending on one’s level {n} in the sequence:

Status Accuracy benefit Eligibility
Basic status {|x_n - x| \leq 1} {n \geq 1}
Bronze status {|x_n - x| \leq 0.1} {n \geq 10^2}
Silver status {|x_n - x| \leq 0.01} {n \geq 10^4}
Gold status {|x_n - x| \leq 0.001} {n \geq 10^6}
{\dots} {\dots} {\dots}

With this conceptual model, convergence means that any status level of accuracy can be unlocked if one’s number {n} of “points earned” is high enough.

In a similar vein, continuity becomes analogous to a conversion program, in which accuracy benefits from one company can be traded in for new accuracy benefits in another company. For instance, the continuity of the function {f(x) = 2 + \sqrt{x}} at the point {x_0=0} can be viewed in terms of the following conversion chart:

Accuracy benefit of {x} to trade in Accuracy benefit of {f(x)} obtained
{|x - x_0| \leq 1} {|f(x) - f(x_0)| \leq 1}
{|x - x_0| \leq 0.01} {|f(x) - f(x_0)| \leq 0.1}
{|x - x_0| \leq 0.0001} {|f(x) - f(x_0)| \leq 0.01}
{\dots} {\dots}

Again, the point is that one can purchase any desired level of accuracy of {f(x)} provided one trades in a suitably high level of accuracy of {x}.

At present, the above conversion chart is only available at the single location {x_0}. The concept of uniform continuity can then be viewed as an advertising copy that “offer prices are valid in all store locations”. In a similar vein, the concept of equicontinuity for a class {{\mathcal F}} of functions is a guarantee that “offer applies to all functions {f} in the class {{\mathcal F}}, without any price discrimination. The combined notion of uniform equicontinuity is then of course the claim that the offer is valid in all locations and for all functions.

In a similar vein, differentiability can be viewed as a deal in which one can trade in accuracy of the input for approximately linear behavior of the output; to oversimplify slightly, smoothness can similarly be viewed as a deal in which one trades in accuracy of the input for high-accuracy polynomial approximability of the output. Measurability of a set or function can be viewed as a deal in which one trades in a level of resolution for an accurate approximation of that set or function at the given resolution. And so forth.

Perhaps readers can propose some other examples of mathematical concepts being re-interpreted as some sort of economic transaction?

This post is an unofficial sequel to one of my first blog posts from 2007, which was entitled “Quantum mechanics and Tomb Raider“.

One of the oldest and most famous allegories is Plato’s allegory of the cave. This allegory centers around a group of people chained to a wall in a cave that cannot see themselves or each other, but only the two-dimensional shadows of themselves cast on the wall in front of them by some light source they cannot directly see. Because of this, they identify reality with this two-dimensional representation, and have significant conceptual difficulties in trying to view themselves (or the world as a whole) as three-dimensional, until they are freed from the cave and able to venture into the sunlight.

There is a similar conceptual difficulty when trying to understand Einstein’s theory of special relativity (and more so for general relativity, but let us focus on special relativity for now). We are very much accustomed to thinking of reality as a three-dimensional space endowed with a Euclidean geometry that we traverse through in time, but in order to have the clearest view of the universe of special relativity it is better to think of reality instead as a four-dimensional spacetime that is endowed instead with a Minkowski geometry, which mathematically is similar to a (four-dimensional) Euclidean space but with a crucial change of sign in the underlying metric. Indeed, whereas the distance {ds} between two points in Euclidean space {{\bf R}^3} is given by the three-dimensional Pythagorean theorem

\displaystyle  ds^2 = dx^2 + dy^2 + dz^2

under some standard Cartesian coordinate system {(x,y,z)} of that space, and the distance {ds} in a four-dimensional Euclidean space {{\bf R}^4} would be similarly given by

\displaystyle  ds^2 = dx^2 + dy^2 + dz^2 + du^2

under a standard four-dimensional Cartesian coordinate system {(x,y,z,u)}, the spacetime interval {ds} in Minkowski space is given by

\displaystyle  ds^2 = dx^2 + dy^2 + dz^2 - c^2 dt^2

(though in many texts the opposite sign convention {ds^2 = -dx^2 -dy^2 - dz^2 + c^2dt^2} is preferred) in spacetime coordinates {(x,y,z,t)}, where {c} is the speed of light. The geometry of Minkowski space is then quite similar algebraically to the geometry of Euclidean space (with the sign change replacing the traditional trigonometric functions {\sin, \cos, \tan}, etc. by their hyperbolic counterparts {\sinh, \cosh, \tanh}, and with various factors involving “{c}” inserted in the formulae), but also has some qualitative differences to Euclidean space, most notably a causality structure connected to light cones that has no obvious counterpart in Euclidean space.

That said, the analogy between Minkowski space and four-dimensional Euclidean space is strong enough that it serves as a useful conceptual aid when first learning special relativity; for instance the excellent introductory text “Spacetime physics” by Taylor and Wheeler very much adopts this view. On the other hand, this analogy doesn’t directly address the conceptual problem mentioned earlier of viewing reality as a four-dimensional spacetime in the first place, rather than as a three-dimensional space that objects move around in as time progresses. Of course, part of the issue is that we aren’t good at directly visualizing four dimensions in the first place. This latter problem can at least be easily addressed by removing one or two spatial dimensions from this framework – and indeed many relativity texts start with the simplified setting of only having one spatial dimension, so that spacetime becomes two-dimensional and can be depicted with relative ease by spacetime diagrams – but still there is conceptual resistance to the idea of treating time as another spatial dimension, since we clearly cannot “move around” in time as freely as we can in space, nor do we seem able to easily “rotate” between the spatial and temporal axes, the way that we can between the three coordinate axes of Euclidean space.

With this in mind, I thought it might be worth attempting a Plato-type allegory to reconcile the spatial and spacetime views of reality, in a way that can be used to describe (analogues of) some of the less intuitive features of relativity, such as time dilation, length contraction, and the relativity of simultaneity. I have (somewhat whimsically) decided to place this allegory in a Tolkienesque fantasy world (similarly to how my previous allegory to describe quantum mechanics was phrased in a world based on the computer game “Tomb Raider”). This is something of an experiment, and (like any other analogy) the allegory will not be able to perfectly capture every aspect of the phenomenon it is trying to represent, so any feedback to improve the allegory would be appreciated.

Read the rest of this entry »

Archives