You are currently browsing the tag archive for the ‘randomness’ tag.
In the modern theory of higher order Fourier analysis, a key role are played by the Gowers uniformity norms for
. For finitely supported functions
, one can define the (non-normalised) Gowers norm
by the formula
The significance of the Gowers norms is that they control other multilinear forms that show up in additive combinatorics. Given any polynomials and functions
, we define the multilinear form
-
and
have true complexity
;
-
has true complexity
;
-
has true complexity
;
- The form
(which among other things could be used to count twin primes) has infinite true complexity (which is quite unfortunate for applications).
Gowers and Wolf formulated a conjecture on what this complexity should be, at least for linear polynomials ; Ben Green and I thought we had resolved this conjecture back in 2010, though it turned out there was a subtle gap in our arguments and we were only able to resolve the conjecture in a partial range of cases. However, the full conjecture was recently resolved by Daniel Altman.
The (semi-)norm is so weak that it barely controls any averages at all. For instance the average
Because of this, I propose inserting an additional norm in the Gowers uniformity norm hierarchy between the and
norms, which I will call the
(or “profinite
“) norm:
The norm recently appeared implicitly in work of Peluse and Prendiville, who showed that the form
had true complexity
in this notation (with polynomially strong bounds). [Actually, strictly speaking this control was only shown for the third function
; for the first two functions
one needs to localize the
norm to intervals of length
. But I will ignore this technical point to keep the exposition simple.] The weaker claim that
has true complexity
is substantially easier to prove (one can apply the circle method together with Gauss sum estimates).
The well known inverse theorem for the norm tells us that if a
-bounded function
has
norm at least
for some
, then there is a Fourier phase
such that
For one has a trivial inverse theorem; by definition, the
norm of
is at least
if and only if
For one has the intermediate situation in which the frequency
is not taken to be zero, but is instead major arc. Indeed, suppose that
is
-bounded with
, thus
Here is a diagram showing some of the control relationships between various Gowers norms, multilinear forms, and duals of classes of functions (where each class of functions
induces a dual norm
:
Here I have included the three classes of functions that one can choose from for the inverse theorem, namely degree two nilsequences, bracket quadratic phases, and local quadratic phases, as well as the more narrow class of globally quadratic phases.
The Gowers norms have counterparts for measure-preserving systems , known as Host-Kra seminorms. The
norm can be defined for
as
Given a function on the natural numbers taking values in
, one can invoke the Furstenberg correspondence principle to locate a measure preserving system
– a probability space
together with a measure-preserving shift
(or equivalently, a measure-preserving
-action on
) – together with a measurable function (or “observable”)
that has essentially the same statistics as
in the sense that
for any integers . In particular, one has
whenever the limit on the right-hand side exists. We will refer to the system together with the designated function
as a Furstenberg limit ot the sequence
. These Furstenberg limits capture some, but not all, of the asymptotic behaviour of
; roughly speaking, they control the typical “local” behaviour of
, involving correlations such as
in the regime where
are much smaller than
. However, the control on error terms here is usually only qualitative at best, and one usually does not obtain non-trivial control on correlations in which the
are allowed to grow at some significant rate with
(e.g. like some power
of
).
The correspondence principle is discussed in these previous blog posts. One way to establish the principle is by introducing a Banach limit that extends the usual limit functional on the subspace of
consisting of convergent sequences while still having operator norm one. Such functionals cannot be constructed explicitly, but can be proven to exist (non-constructively and non-uniquely) using the Hahn-Banach theorem; one can also use a non-principal ultrafilter here if desired. One can then seek to construct a system
and a measurable function
for which one has the statistics
for all . One can explicitly construct such a system as follows. One can take
to be the Cantor space
with the product
-algebra and the shift
with the function being the coordinate function at zero:
(so in particular for any
). The only thing remaining is to construct the invariant measure
. In order to be consistent with (2), one must have
for any distinct integers and signs
. One can check that this defines a premeasure on the Boolean algebra of
defined by cylinder sets, and the existence of
then follows from the Hahn-Kolmogorov extension theorem (or the closely related Kolmogorov extension theorem). One can then check that the correspondence (2) holds, and that
is translation-invariant; the latter comes from the translation invariance of the (Banach-)Césaro averaging operation
. A variant of this construction shows that the Furstenberg limit is unique up to equivalence if and only if all the limits appearing in (1) actually exist.
One can obtain a slightly tighter correspondence by using a smoother average than the Césaro average. For instance, one can use the logarithmic Césaro averages in place of the Césaro average
, thus one replaces (2) by
Whenever the Césaro average of a bounded sequence exists, then the logarithmic Césaro average exists and is equal to the Césaro average. Thus, a Furstenberg limit constructed using logarithmic Banach-Césaro averaging still obeys (1) for all
when the right-hand side limit exists, but also obeys the more general assertion
whenever the limit of the right-hand side exists.
In a recent paper of Frantizinakis, the Furstenberg limits of the Liouville function (with logarithmic averaging) were studied. Some (but not all) of the known facts and conjectures about the Liouville function can be interpreted in the Furstenberg limit. For instance, in a recent breakthrough result of Matomaki and Radziwill (discussed previously here), it was shown that the Liouville function exhibited cancellation on short intervals in the sense that
In terms of Furstenberg limits of the Liouville function, this assertion is equivalent to the assertion that
for all Furstenberg limits of Liouville (including those without logarithmic averaging). Invoking the mean ergodic theorem (discussed in this previous post), this assertion is in turn equivalent to the observable
that corresponds to the Liouville function being orthogonal to the invariant factor
of
; equivalently, the first Gowers-Host-Kra seminorm
of
(as defined for instance in this previous post) vanishes. The Chowla conjecture, which asserts that
for all distinct integers , is equivalent to the assertion that all the Furstenberg limits of Liouville are equivalent to the Bernoulli system (
with the product measure arising from the uniform distribution on
, with the shift
and observable
as before). Similarly, the logarithmically averaged Chowla conjecture
is equivalent to the assertion that all the Furstenberg limits of Liouville with logarithmic averaging are equivalent to the Bernoulli system. Recently, I was able to prove the two-point version
of the logarithmically averaged Chowla conjecture, for any non-zero integer ; this is equivalent to the perfect strong mixing property
for any Furstenberg limit of Liouville with logarithmic averaging, and any .
The situation is more delicate with regards to the Sarnak conjecture, which is equivalent to the assertion that
for any zero-entropy sequence (see this previous blog post for more discussion). Morally speaking, this conjecture should be equivalent to the assertion that any Furstenberg limit of Liouville is disjoint from any zero entropy system, but I was not able to formally establish an implication in either direction due to some technical issues regarding the fact that the Furstenberg limit does not directly control long-range correlations, only short-range ones. (There are however ergodic theoretic interpretations of the Sarnak conjecture that involve the notion of generic points; see this paper of El Abdalaoui, Lemancyk, and de la Rue.) But the situation is currently better with the logarithmically averaged Sarnak conjecture
as I was able to show that this conjecture was equivalent to the logarithmically averaged Chowla conjecture, and hence to all Furstenberg limits of Liouville with logarithmic averaging being Bernoulli; I also showed the conjecture was equivalent to local Gowers uniformity of the Liouville function, which is in turn equivalent to the function having all Gowers-Host-Kra seminorms vanishing in every Furstenberg limit with logarithmic averaging. In this recent paper of Frantzikinakis, this analysis was taken further, showing that the logarithmically averaged Chowla and Sarnak conjectures were in fact equivalent to the much milder seeming assertion that all Furstenberg limits with logarithmic averaging were ergodic.
Actually, the logarithmically averaged Furstenberg limits have more structure than just a -action on a measure preserving system
with a single observable
. Let
denote the semigroup of affine maps
on the integers with
and
positive. Also, let
denote the profinite integers (the inverse limit of the cyclic groups
). Observe that
acts on
by taking the inverse limit of the obvious actions of
on
.
Proposition 1 (Enriched logarithmically averaged Furstenberg limit of Liouville) Let
be a Banach limit. Then there exists a probability space
with an action
of the affine semigroup
, as well as measurable functions
and
, with the following properties:
- (i) (Affine Furstenberg limit) For any
, and any congruence class
, one has
- (ii) (Equivariance of
) For any
, one has
for
-almost every
.
- (iii) (Multiplicativity at fixed primes) For any prime
, one has
for
-almost every
, where
is the dilation map
.
- (iv) (Measure pushforward) If
is of the form
and
is the set
, then the pushforward
of
by
is equal to
, that is to say one has
for every measurable
.
Note that can be viewed as the subgroup of
consisting of the translations
. If one only keeps the
-portion of the
action and forgets the rest (as well as the function
) then the action becomes measure-preserving, and we recover an ordinary Furstenberg limit with logarithmic averaging. However, the additional structure here can be quite useful; for instance, one can transfer the proof of (3) to this setting, which we sketch below the fold, after proving the proposition.
The observable , roughly speaking, means that points
in the Furstenberg limit
constructed by this proposition are still “virtual integers” in the sense that one can meaningfully compute the residue class of
modulo any natural number modulus
, by first applying
and then reducing mod
. The action of
means that one can also meaningfully multiply
by any natural number, and translate it by any integer. As with other applications of the correspondence principle, the main advantage of moving to this more “virtual” setting is that one now acquires a probability measure
, so that the tools of ergodic theory can be readily applied.
Given a random variable that takes on only finitely many values, we can define its Shannon entropy by the formula
with the convention that . (In some texts, one uses the logarithm to base
rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables
taking on finitely many values, the joint variable
is also a random variable taking on finitely many values, and also has an entropy
. It obeys the Shannon inequalities
so we can define some further nonnegative quantities, the mutual information
and the conditional entropies
More generally, given three random variables , one can define the conditional mutual information
and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.
The mutual information is a measure of the extent to which
and
fail to be independent; indeed, it is not difficult to show that
vanishes if and only if
and
are independent. Similarly,
vanishes if and only if
and
are conditionally independent relative to
. At the other extreme,
is a measure of the extent to which
fails to depend on
; indeed, it is not difficult to show that
if and only if
is determined by
in the sense that there is a deterministic function
such that
. In a related vein, if
and
are equivalent in the sense that there are deterministic functional relationships
,
between the two variables, then
is interchangeable with
for the purposes of computing the above quantities, thus for instance
,
,
,
, etc..
One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables being considered come from restricting a single random (and uniformly distributed) boolean function
on a given finite domain
to some subset
of
:
In this case, has the law of a random uniformly distributed boolean function from
to
, and the entropy here can be easily computed to be
, where
denotes the cardinality of
. If
is the restriction of
to
, and
is the restriction of
to
, then the joint variable
is equivalent to the restriction of
to
. If one discards the normalisation factor
, one then obtains the following dictionary between entropy and the combinatorics of finite sets:
Random variables |
Finite sets |
Entropy |
Cardinality |
Joint variable |
Union |
Mutual information |
Intersection cardinality |
Conditional entropy |
Set difference cardinality |
Conditional mutual information |
|
Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality becomes the union bound
, and the definition of mutual information becomes the inclusion-exclusion formula
For a more advanced example, consider the data processing inequality that asserts that if are conditionally independent relative to
, then
. Specialising to sets, this now says that if
are disjoint outside of
, then
; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if
and
are not necessarily disjoint outside of
, then a consideration of Venn diagrams gives the more general inequality
and a further inspection of the diagram then reveals the more precise identity
Using the dictionary in the reverse direction, one is then led to conjecture the identity
which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.
On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set is preserved by unions:
Indeed, one has the union bound
Applying the dictionary in the reverse direction, one might now conjecture that if was independent of
and
was independent of
, then
should also be independent of
, and furthermore that
but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take to be independent, uniformly distributed random elements of the finite field
of two elements, and take
to be the sum of these two field elements. One can easily check that each of
and
is separately independent of
, but the joint variable
determines
and thus is not independent of
.
From the inclusion-exclusion identities
one can check that (1) is equivalent to the trivial lower bound . The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection
. (Even the double intersection
only exists information theoretically in a “virtual” sense; the mutual information
allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)
However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities associated to just two variables
are those that are also necessarily obeyed by their combinatorial analogues
. (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)
One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let be some finite-dimensional vector space over a finite field
, and let
be a random linear functional on
, selected uniformly among all such functions. Every subspace
of
then gives rise to a random variable
formed by restricting
to
. This random variable is also distributed uniformly amongst all linear functions on
, and its entropy can be easily computed to be
. Given two random variables
formed by restricting
to
respectively, the joint random variable
determines the random linear function
on the union
on the two spaces, and thus by linearity on the Minkowski sum
as well; thus
is equivalent to the restriction of
to
. In particular,
. This implies that
and also
, where
is the quotient map. After discarding the normalising constant
, this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:
Random variables |
Subspaces |
Entropy |
Dimension |
Joint variable |
Sum |
Mutual information |
Dimension of intersection |
Conditional entropy |
Dimension of projection |
Conditional mutual information |
|
The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking to be the vector space
over the finite field
of two elements, and only considering those subspaces
that are coordinate subspaces
associated to various subsets
of
.
As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of ). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).
The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces such that
and
are separately transverse to
, but their sum
is not; for instance, in a two-dimensional vector space
, one can take
to be the one-dimensional subspaces spanned by
,
, and
respectively. Note that this is essentially the same counterexample from before (which took
to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces
(as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables
(as well as the corresponding quantities on the above table).
However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality
for any subspaces . This is easiest to see when the three terms on the right-hand side vanish; then
are transverse, which implies that
; similarly
. But
and
are transverse, and this clearly implies that
and
are themselves transverse. To prove the general case of Ingleton’s inequality, one can define
and use
(and similarly for
instead of
) to reduce to establishing the inequality
which can be rearranged using (and similarly for
instead of
) and
as
but this is clear since .
Returning to the entropy setting, the analogue
of (3) is true (exercise!), but the analogue
of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then are conditionally independent relative to
, and relative to
, and
and
are independent, and the claim (4) would then be asserting that
and
are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take
to be independent uniform variables from
, and take
and
to be (say)
and
respectively (thus
are the indicators of the events
and
respectively). Once one conditions on either
or
, one of
has positive conditional entropy and the other has zero entropy, and so
are conditionally independent relative to either
or
; also,
or
are independent of each other. But
and
are not independent of each other (they cannot be simultaneously equal to
). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces
has a well-defined intersection
that is also a subspace, whereas for arbitrary random variables
, there does not necessarily exist the analogue of an intersection, namely a “common information” random variable
that has the entropy of
and is determined either by
or by
.
I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality
One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that
whenever was sufficiently large depending on
, where
is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale
between
and
, one can form certain random variables
. The random variable
is a sign pattern of the form
where
is a random number chosen from
to
(with logarithmic weighting). The random variable
was tuple
of reductions of
to primes
comparable to
. Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of
, the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound
on the mutual information between and
. From translation invariance, this also gives the more general lower bound
for any , where
denotes the shifted sign pattern
. On the other hand, one had the entropy bounds
and from concatenating sign patterns one could see that is equivalent to the joint random variable
for any
. Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once
was allowed to become sufficiently large compared to
, but the bound was quite weak (coming ultimately from the unboundedness of
as the interval
of values of
under consideration becomes large), something of the order of
; the quantity
needs at various junctures to be less than a small power of
, so the relationship between
and
becomes essentially quadruple exponential in nature,
. The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate
of the mean entropy, in that this quantity decreased by
as
increased from
to
, basically by dividing
into
components
,
and observing from (6) each of these shares a bit of common information with the same variable
. This is relatively clear when one works in a set model, in which
is modeled by a set
of size
, and
is modeled by a set of the form
for various sets of size
(also there is some translation symmetry that maps
to a shift
while preserving all of the
).
However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables are basically jointly independent as
ranges over dyadic values that are much smaller than
, which in the set model corresponds to the
all being disjoint. One can then establish a variant
of (6), which in the set model roughly speaking asserts that each claims a portion of the
of cardinality
that is not claimed by previous choices of
. This leads to a more efficient contradiction (relying on the unboundedness of
rather than
) that looks like it removes one order of exponential growth, thus the relationship between
and
is now
. Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form
for a small constant , which on iterating and using the boundedness of
gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges
greater than
or so, although at this range the theorem is not significantly simpler than the general case).
Klaus Roth, who made fundamental contributions to analytic number theory, died this Tuesday, aged 90.
I never met or communicated with Roth personally, but was certainly influenced by his work; he wrote relatively few papers, but they tended to have outsized impact. For instance, he was one of the key people (together with Bombieri) to work on simplifying and generalising the large sieve, taking it from the technically formidable original formulation of Linnik and Rényi to the clean and general almost orthogonality principle that we have today (discussed for instance in these lecture notes of mine). The paper of Roth that had the most impact on my own personal work was his three-page paper proving what is now known as Roth’s theorem on arithmetic progressions:
Theorem 1 (Roth’s theorem on arithmetic progressions) Let
be a set of natural numbers of positive upper density (thus
). Then
contains infinitely many arithmetic progressions
of length three (with
non-zero of course).
At the heart of Roth’s elegant argument was the following (surprising at the time) dichotomy: if had some moderately large density within some arithmetic progression
, either one could use Fourier-analytic methods to detect the presence of an arithmetic progression of length three inside
, or else one could locate a long subprogression
of
on which
had increased density. Iterating this dichotomy by an argument now known as the density increment argument, one eventually obtains Roth’s theorem, no matter which side of the dichotomy actually holds. This argument (and the many descendants of it), based on various “dichotomies between structure and randomness”, became essential in many other results of this type, most famously perhaps in Szemerédi’s proof of his celebrated theorem on arithmetic progressions that generalised Roth’s theorem to progressions of arbitrary length. More recently, my recent work on the Chowla and Elliott conjectures that was a crucial component of the solution of the Erdös discrepancy problem, relies on an entropy decrement argument which was directly inspired by the density increment argument of Roth.
The Erdös discrepancy problem also is connected with another well known theorem of Roth:
Theorem 2 (Roth’s discrepancy theorem for arithmetic progressions) Let
be a sequence in
. Then there exists an arithmetic progression
in
with
positive such that
for an absolute constant
.
In fact, Roth proved a stronger estimate regarding mean square discrepancy, which I am not writing down here; as with the Roth theorem in arithmetic progressions, his proof was short and Fourier-analytic in nature (although non-Fourier-analytic proofs have since been found, for instance the semidefinite programming proof of Lovasz). The exponent is known to be sharp (a result of Matousek and Spencer).
As a particular corollary of the above theorem, for an infinite sequence of signs, the sums
are unbounded in
. The Erdös discrepancy problem asks whether the same statement holds when
is restricted to be zero. (Roth also established discrepancy theorems for other sets, such as rectangles, which will not be discussed here.)
Finally, one has to mention Roth’s most famous result, cited for instance in his Fields medal citation:
Theorem 3 (Roth’s theorem on Diophantine approximation) Let
be an irrational algebraic number. Then for any
there is a quantity
such that
From the Dirichlet approximation theorem (or from the theory of continued fractions) we know that the exponent in the denominator cannot be reduced to
or below. A classical and easy theorem of Liouville gives the claim with the exponent
replaced by the degree of the algebraic number
; work of Thue and Siegel reduced this exponent, but Roth was the one who obtained the near-optimal result. An important point is that the constant
is ineffective – it is a major open problem in Diophantine approximation to produce any bound significantly stronger than Liouville’s theorem with effective constants. This is because the proof of Roth’s theorem does not exclude any single rational
from being close to
, but instead very ingeniously shows that one cannot have two different rationals
,
that are unusually close to
, even when the denominators
are very different in size. (I refer to this sort of argument as a “dueling conspiracies” argument; they are strangely prevalent throughout analytic number theory.)
The most fundamental unsolved problem in complexity theory is undoubtedly the P=NP problem, which asks (roughly speaking) whether a problem which can be solved by a non-deterministic polynomial-time (NP) algorithm, can also be solved by a deterministic polynomial-time (P) algorithm. The general belief is that , i.e. there exist problems which can be solved by non-deterministic polynomial-time algorithms but not by deterministic polynomial-time algorithms.
One reason why the question is so difficult to resolve is that a certain generalisation of this question has an affirmative answer in some cases, and a negative answer in other cases. More precisely, if we give all the algorithms access to an oracle, then for one choice
of this oracle, all the problems that are solvable by non-deterministic polynomial-time algorithms that calls
(
), can also be solved by a deterministic polynomial-time algorithm algorithm that calls
(
), thus
; but for another choice
of this oracle, there exist problems solvable by non-deterministic polynomial-time algorithms that call
, which cannot be solved by a deterministic polynomial-time algorithm that calls
, thus
. One particular consequence of this result (which is due to Baker, Gill, and Solovay) is that there cannot be any relativisable proof of either
or
, where “relativisable” means that the proof would also work without any changes in the presence of an oracle.
The Baker-Gill-Solovay result was quite surprising, but the idea of the proof turns out to be rather simple. To get an oracle such that
, one basically sets
to be a powerful simulator that can simulate non-deterministic machines (and, furthermore, can also simulate itself); it turns out that any PSPACE-complete oracle would suffice for this task. To get an oracle
for which
, one has to be a bit sneakier, setting
to be a query device for a sparse set of random (or high-complexity) strings, which are too complex to be guessed at by any deterministic polynomial-time algorithm.
Unfortunately, the simple idea of the proof can be obscured by various technical details (e.g. using Turing machines to define and
precisely), which require a certain amount of time to properly absorb. To help myself try to understand this result better, I have decided to give a sort of “allegory” of the proof, based around a (rather contrived) story about various students trying to pass a multiple choice test, which avoids all the technical details but still conveys the basic ideas of the argument. This allegory was primarily for my own benefit, but I thought it might also be of interest to some readers here (and also has some tangential relation to the proto-polymath project of determinstically finding primes), so I reproduce it below.
This week I am in Bremen, where the 50th International Mathematical Olympiad is being held. A number of former Olympians (Béla Bollobás, Tim Gowers, Laci Lovasz, Stas Smirnov, Jean-Christophe Yoccoz, and myself) were invited to give a short talk (20 minutes in length) at the celebratory event for this anniversary. I chose to talk on a topic I have spoken about several times before, on “Structure and randomness in the prime numbers“. Given the time constraints, there was a limit as to how much substance I could put into the talk; but I try to describe, in very general terms, what we know about the primes, and what we suspect to be true, but cannot yet establish. As I have mentioned in previous talks, the key problem is that we suspect the distribution of the primes to obey no significant patterns (other than “local” structure, such as having a strong tendency to be odd (which is local information at the 2 place), or obeying the prime number theorem (which is local information at the infinity place)), but we still do not have fully satisfactory tools for establishing the absence of a pattern. (This is in contrast with many types of Olympiad problems, where the key to solving a problem often lies in discovering the right pattern or structure in the problem to exploit.)
The PDF of the talk is here; I decided to try out the Beamer LaTeX package for a change.
A remarkable phenomenon in probability theory is that of universality – that many seemingly unrelated probability distributions, which ostensibly involve large numbers of unknown parameters, can end up converging to a universal law that may only depend on a small handful of parameters. One of the most famous examples of the universality phenomenon is the central limit theorem; another rich source of examples comes from random matrix theory, which is one of the areas of my own research.
Analogous universality phenomena also show up in empirical distributions – the distributions of a statistic from a large population of “real-world” objects. Examples include Benford’s law, Zipf’s law, and the Pareto distribution (of which the Pareto principle or 80-20 law is a special case). These laws govern the asymptotic distribution of many statistics
which
- (i) take values as positive numbers;
- (ii) range over many different orders of magnitude;
- (iiii) arise from a complicated combination of largely independent factors (with different samples of
arising from different independent factors); and
- (iv) have not been artificially rounded, truncated, or otherwise constrained in size.
Examples here include the population of countries or cities, the frequency of occurrence of words in a language, the mass of astronomical objects, or the net worth of individuals or corporations. The laws are then as follows:
- Benford’s law: For
, the proportion of
whose first digit is
is approximately
. Thus, for instance,
should have a first digit of
about
of the time, but a first digit of
only about
of the time.
- Zipf’s law: The
largest value of
should obey an approximate power law, i.e. it should be approximately
for the first few
and some parameters
. In many cases,
is close to
.
- Pareto distribution: The proportion of
with at least
digits (before the decimal point), where
is above the median number of digits, should obey an approximate exponential law, i.e. be approximately of the form
for some
. Again, in many cases
is close to
.
Benford’s law and Pareto distribution are stated here for base , which is what we are most familiar with, but the laws hold for any base (after replacing all the occurrences of
in the above laws with the new base, of course). The laws tend to break down if the hypotheses (i)-(iv) are dropped. For instance, if the statistic
concentrates around its mean (as opposed to being spread over many orders of magnitude), then the normal distribution tends to be a much better model (as indicated by such results as the central limit theorem). If instead the various samples of the statistics are highly correlated with each other, then other laws can arise (for instance, the eigenvalues of a random matrix, as well as many empirically observed matrices, are correlated to each other, with the behaviour of the largest eigenvalues being governed by laws such as the Tracy-Widom law rather than Zipf’s law, and the bulk distribution being governed by laws such as the semicircular law rather than the normal or Pareto distributions).
To illustrate these laws, let us take as a data set the populations of 235 countries and regions of the world in 2007 (using the CIA world factbook); I have put the raw data here. This is a relatively small sample (cf. my previous post), but is already enough to discern these laws in action. For instance, here is how the data set tracks with Benford’s law (rounded to three significant figures):
|
Countries | Number | Benford prediction |
1 | Angola, Anguilla, Aruba, Bangladesh, Belgium, Botswana, Brazil, Burkina Faso, Cambodia, Cameroon, Chad, Chile, China, Christmas Island, Cook Islands, Cuba, Czech Republic, Ecuador, Estonia, Gabon, (The) Gambia, Greece, Guam, Guatemala, Guinea-Bissau, India, Japan, Kazakhstan, Kiribati, Malawi, Mali, Mauritius, Mexico, (Federated States of) Micronesia, Nauru, Netherlands, Niger, Nigeria, Niue, Pakistan, Portugal, Russia, Rwanda, Saint Lucia, Saint Vincent and the Grenadines, Senegal, Serbia, Swaziland, Syria, Timor-Leste (East-Timor), Tokelau, Tonga, Trinidad and Tobago, Tunisia, Tuvalu, (U.S.) Virgin Islands, Wallis and Futuna, Zambia, Zimbabwe | 59 ( |
71 ( |
2 | Armenia, Australia, Barbados, British Virgin Islands, Cote d’Ivoire, French Polynesia, Ghana, Gibraltar, Indonesia, Iraq, Jamaica, (North) Korea, Kosovo, Kuwait, Latvia, Lesotho, Macedonia, Madagascar, Malaysia, Mayotte, Mongolia, Mozambique, Namibia, Nepal, Netherlands Antilles, New Caledonia Norfolk Island, Palau, Peru, Romania, Saint Martin, Samoa, San Marino, Sao Tome and Principe, Saudi Arabia, Slovenia, Sri Lanka, Svalbard, Taiwan, Turks and Caicos Islands, Uzbekistan, Vanuatu, Venezuela, Yemen | 44 ( |
41 ( |
3 | Afghanistan, Albania, Algeria, (The) Bahamas, Belize, Brunei, Canada, (Rep. of the) Congo, Falkland Islands (Islas Malvinas), Iceland, Kenya, Lebanon, Liberia, Liechtenstein, Lithuania, Maldives, Mauritania, Monaco, Morocco, Oman, (Occupied) Palestinian Territory, Panama, Poland, Puerto Rico, Saint Kitts and Nevis, Uganda, United States of America, Uruguay, Western Sahara | 29 ( |
29 ( |
4 | Argentina, Bosnia and Herzegovina, Burma (Myanmar), Cape Verde, Cayman Islands, Central African Republic, Colombia, Costa Rica, Croatia, Faroe Islands, Georgia, Ireland, (South) Korea, Luxembourg, Malta, Moldova, New Zealand, Norway, Pitcairn Islands, Singapore, South Africa, Spain, Sudan, Suriname, Tanzania, Ukraine, United Arab Emirates | 27 ( |
22 ( |
5 | (Macao SAR) China, Cocos Islands, Denmark, Djibouti, Eritrea, Finland, Greenland, Italy, Kyrgyzstan, Montserrat, Nicaragua, Papua New Guinea, Slovakia, Solomon Islands, Togo, Turkmenistan | 16 ( |
19 ( |
6 | American Samoa, Bermuda, Bhutan, (Dem. Rep. of the) Congo, Equatorial Guinea, France, Guernsey, Iran, Jordan, Laos, Libya, Marshall Islands, Montenegro, Paraguay, Sierra Leone, Thailand, United Kingdom | 17 ( |
16 ( |
7 | Bahrain, Bulgaria, (Hong Kong SAR) China, Comoros, Cyprus, Dominica, El Salvador, Guyana, Honduras, Israel, (Isle of) Man, Saint Barthelemy, Saint Helena, Saint Pierre and Miquelon, Switzerland, Tajikistan, Turkey | 17 ( |
14 ( |
8 | Andorra, Antigua and Barbuda, Austria, Azerbaijan, Benin, Burundi, Egypt, Ethiopia, Germany, Haiti, Holy See (Vatican City), Northern Mariana Islands, Qatar, Seychelles, Vietnam | 15 ( |
12 ( |
9 | Belarus, Bolivia, Dominican Republic, Fiji, Grenada, Guinea, Hungary, Jersey, Philippines, Somalia, Sweden | 11 ( |
11 ( |
Here is how the same data tracks Zipf’s law for the first twenty values of , with the parameters
and
(selected by log-linear regression), again rounding to three significant figures:
|
Country | Population | Zipf prediction | Deviation from prediction |
1 | China | 1,330,000,000 | 1,280,000,000 | |
2 | India | 1,150,000,000 | 626,000,000 | |
3 | USA | 304,000,000 | 412,000,000 | |
4 | Indonesia | 238,000,000 | 307,000,000 | |
5 | Brazil | 196,000,000 | 244,000,000 | |
6 | Pakistan | 173,000,000 | 202,000,000 | |
7 | Bangladesh | 154,000,000 | 172,000,000 | |
8 | Nigeria | 146,000,000 | 150,000,000 | |
9 | Russia | 141,000,000 | 133,000,000 | |
10 | Japan | 128,000,000 | 120,000,000 | |
11 | Mexico | 110,000,000 | 108,000,000 | |
12 | Philippines | 96,100,000 | 98,900,000 | |
13 | Vietnam | 86,100,000 | 91,100,000 | |
14 | Ethiopia | 82,600,000 | 84,400,000 | |
15 | Germany | 82,400,000 | 78,600,000 | |
16 | Egypt | 81,700,000 | 73,500,000 | |
17 | Turkey | 71,900,000 | 69,100,000 | |
18 | Congo | 66,500,000 | 65,100,000 | |
19 | Iran | 65,900,000 | 61,600,000 | |
20 | Thailand | 65,500,000 | 58,400,000 | |
As one sees, Zipf’s law is not particularly precise at the extreme edge of the statistics (when is very small), but becomes reasonably accurate (given the small sample size, and given that we are fitting twenty data points using only two parameters) for moderate sizes of
.
This data set has too few scales in base to illustrate the Pareto distribution effectively – over half of the country populations are either seven or eight digits in that base. But if we instead work in base
, then country populations range in a decent number of scales (the majority of countries have population between
and
), and we begin to see the law emerge, where
is now the number of digits in binary, the best-fit parameters are
and
:
|
Countries with |
Number | Pareto prediction |
31 | China, India | 2 | 1 |
30 | ” | 2 | 2 |
29 | “, United States of America | 3 | 5 |
28 | “, Indonesia, Brazil, Pakistan, Bangladesh, Nigeria, Russia | 9 | 8 |
27 | “, Japan, Mexico, Philippines, Vietnam, Ethiopia, Germany, Egypt, Turkey | 17 | 15 |
26 | “, (Dem. Rep. of the) Congo, Iran, Thailand, France, United Kingdom, Italy, South Africa, (South) Korea, Burma (Myanmar), Ukraine, Colombia, Spain, Argentina, Sudan, Tanzania, Poland, Kenya, Morocco, Algeria | 36 | 27 |
25 | “, Canada, Afghanistan, Uganda, Nepal, Peru, Iraq, Saudi Arabia, Uzbekistan, Venezuela, Malaysia, (North) Korea, Ghana, Yemen, Taiwan, Romania, Mozambique, Sri Lanka, Australia, Cote d’Ivoire, Madagascar, Syria, Cameroon | 58 | 49 |
24 | “, Netherlands, Chile, Kazakhstan, Burkina Faso, Cambodia, Malawi, Ecuador, Niger, Guatemala, Senegal, Angola, Mali, Zambia, Cuba, Zimbabwe, Greece, Portugal, Belgium, Tunisia, Czech Republic, Rwanda, Serbia, Chad, Hungary, Guinea, Belarus, Somalia, Dominican Republic, Bolivia, Sweden, Haiti, Burundi, Benin | 91 | 88 |
23 | “, Austria, Azerbaijan, Honduras, Switzerland, Bulgaria, Tajikistan, Israel, El Salvador, (Hong Kong SAR) China, Paraguay, Laos, Sierra Leone, Jordan, Libya, Papua New Guinea, Togo, Nicaragua, Eritrea, Denmark, Slovakia, Kyrgyzstan, Finland, Turkmenistan, Norway, Georgia, United Arab Emirates, Singapore, Bosnia and Herzegovina, Croatia, Central African Republic, Moldova, Costa Rica | 123 | 159 |
Thus, with each new scale, the number of countries introduced increases by a factor of a little less than , on the average. This approximate doubling of countries with each new scale begins to falter at about the population
(i.e. at around
million), for the simple reason that one has begun to run out of countries. (Note that the median-population country in this set, Singapore, has a population with
binary digits.)
These laws are not merely interesting statistical curiosities; for instance, Benford’s law is often used to help detect fraudulent statistics (such as those arising from accounting fraud), as many such statistics are invented by choosing digits at random, and will therefore deviate significantly from Benford’s law. (This is nicely discussed in Robert Matthews’ New Scientist article “The power of one“; this article can also be found on the web at a number of other places.) In a somewhat analogous spirit, Zipf’s law and the Pareto distribution can be used to mathematically test various models of real-world systems (e.g. formation of astronomical objects, accumulation of wealth, population growth of countries, etc.), without necessarily having to fit all the parameters of that model with the actual data.
Being empirically observed phenomena rather than abstract mathematical facts, Benford’s law, Zipf’s law, and the Pareto distribution cannot be “proved” the same way a mathematical theorem can be proved. However, one can still support these laws mathematically in a number of ways, for instance showing how these laws are compatible with each other, and with other plausible hypotheses on the source of the data. In this post I would like to describe a number of ways (both technical and non-technical) in which one can do this; these arguments do not fully explain these laws (in particular, the empirical fact that the exponent in Zipf’s law or the Pareto distribution is often close to
is still quite a mysterious phenomenon), and do not always have the same universal range of applicability as these laws seem to have, but I hope that they do demonstrate that these laws are not completely arbitrary, and ought to have a satisfactory basis of mathematical support. Read the rest of this entry »
One further paper in this stream: László Erdős, José Ramírez, Benjamin Schlein, Van Vu, Horng-Tzer Yau, and myself have just uploaded to the arXiv the paper “Bulk universality for Wigner hermitian matrices with subexponential decay“, submitted to Mathematical Research Letters. (Incidentally, this is my first six-author paper I have been involved in, not counting the polymath projects of course, though I have had a number of five-author papers.)
This short paper (9 pages) combines the machinery from two recent papers on the universality conjecture for the eigenvalue spacings in the bulk for Wigner random matrices (see my earlier blog post for more discussion). On the one hand, the paper of Erdős-Ramírez-Schlein-Yau established this conjecture under the additional hypothesis that the distribution of the individual entries obeyed some smoothness and exponential decay conditions. Meanwhile, the paper of Van Vu and myself (which I discussed in my earlier blog post) established the conjecture under a somewhat different set of hypotheses, namely that the distribution of the individual entries obeyed some moment conditions (in particular, the third moment had to vanish), a support condition (the entries had to have real part supported in at least three points), and an exponential decay condition.
After comparing our results, the six of us realised that our methods could in fact be combined rather easily to obtain a stronger result, establishing the universality conjecture assuming only a exponential decay (or more precisely, sub-exponential decay) bound on the coefficients; thus all regularity, moment, and support conditions have been eliminated. (There is one catch, namely that we can no longer control a single spacing
for a single fixed i, but must now average over all
before recovering the universality. This is an annoying technical issue but it may be resolvable in the future with further refinements to the method.)
I can describe the main idea behind the unified approach here. One can arrange the Wigner matrices in a hierarchy, from most structured to least structured:
- The most structured (or special) ensemble is the Gaussian Unitary Ensemble (GUE), in which the coefficients are gaussian. Here, one has very explicit and tractable formulae for the eigenvalue distributions, gap spacing, etc.
- The next most structured ensemble of Wigner matrices are the Gaussian-divisible or Johansson matrices, which are matrices H of the form
, where
is another Wigner matrix, V is a GUE matrix independent of
, and
is a fixed parameter independent of n. Here, one still has quite explicit (though not quite as tractable) formulae for the joint eigenvalue distribution and related statistics. Note that the limiting case t=1 is GUE.
- After this, one has the Ornstein-Uhlenbeck-evolved matrices, which are also of the form
, but now
decays at a power rate with n, rather than being comparable to 1. Explicit formulae still exist for these matrices, but extracting universality out of this is hard work (and occupies the bulk of the paper of Erdős-Ramírez-Schlein-Yau).
- Finally, one has arbitrary Wigner matrices, which can be viewed as the t=0 limit of the above Ornstein-Uhlenbeck process.
The arguments in the paper of Erdős-Ramírez-Schlein-Yau can be summarised as follows (I assume subexponential decay throughout this discussion):
- (Structured case) The universality conjecture is true for Ornstein-Uhlenbeck-evolved matrices with
for any
. (The case
was treated in an earlier paper of Erdős-Ramírez-Schlein-Yau, while the case where t is comparable to 1 was treated by Johansson.)
- (Matching) Every Wigner matrix with suitable smoothness conditions can be “matched” with an Ornstein-Uhlenbeck-evolved matrix, in the sense that the eigenvalue statistics for the two matrices are asymptotically identical. (This is relatively easy due to the fact that
can be taken arbitrarily close to zero.)
- Combining 1. and 2. one obtains universality for all Wigner matrices obeying suitable smoothness conditions.
The arguments in the paper of Van and myself can be summarised as follows:
- (Structured case) The universality conjecture is true for Johansson matrices, by the paper of Johansson.
- (Matching) Every Wigner matrix with some moment and support conditions can be “matched” with a Johansson matrix, in the sense that the first four moments of the entries agree, and hence (by the Lindeberg strategy in our paper) have asymptotically identical statistics.
- Combining 1. and 2. one obtains universality for all Wigner matrices obtaining suitable moment and support conditions.
What we realised is by combining the hard part 1. of the paper of Erdős-Ramírez-Schlein-Yau with the hard part 2. of the paper of Van and myself, we can remove all regularity, moment, and support conditions. Roughly speaking, the unified argument proceeds as follows:
- (Structured case) By the arguments of Erdős-Ramírez-Schlein-Yau, the universality conjecture is true for Ornstein-Uhlenbeck-evolved matrices with
for any
.
- (Matching) Every Wigner matrix
can be “matched” with an Ornstein-Uhlenbeck-evolved matrix
for
(say), in the sense that the first four moments of the entries almost agree, which is enough (by the arguments of Van and myself) to show that these two matrices have asymptotically identical statistics on the average.
- Combining 1. and 2. one obtains universality for the averaged statistics for all Wigner matrices.
The averaging should be removable, but this would require better convergence results to the semicircular law than are currently known (except with additional hypotheses, such as vanishing third moment). The subexponential decay should also be relaxed to a condition of finiteness for some fixed moment , but we did not pursue this direction in order to keep the paper short.
One of the most important topological concepts in analysis is that of compactness (as discussed for instance in my Companion article on this topic). There are various flavours of this concept, but let us focus on sequential compactness: a subset E of a topological space X is sequentially compact if every sequence in E has a convergent subsequence whose limit is also in E. This property allows one to do many things with the set E. For instance, it allows one to maximise a functional on E:
Proposition 1. (Existence of extremisers) Let E be a non-empty sequentially compact subset of a topological space X, and let
be a continuous function. Then the supremum
is attained at at least one point
, thus
for all
. (In particular, this supremum is finite.) Similarly for the infimum.
Proof. Let be the supremum
. By the definition of supremum (and the axiom of (countable) choice), one can find a sequence
in E such that
. By compactness, we can refine this sequence to a subsequence (which, by abuse of notation, we shall continue to call
) such that
converges to a limit x in E. Since we still have
, and f is continuous at x, we conclude that f(x)=L, and the claim for the supremum follows. The claim for the infimum is similar.
Remark 1. An inspection of the argument shows that one can relax the continuity hypothesis on F somewhat: to attain the supremum, it suffices that F be upper semicontinuous, and to attain the infimum, it suffices that F be lower semicontinuous.
We thus see that sequential compactness is useful, among other things, for ensuring the existence of extremisers. In finite-dimensional spaces (such as vector spaces), compact sets are plentiful; indeed, the Heine-Borel theorem asserts that every closed and bounded set is compact. However, once one moves to infinite-dimensional spaces, such as function spaces, then the Heine-Borel theorem fails quite dramatically; most of the closed and bounded sets one encounters in a topological vector space are non-compact, if one insists on using a reasonably “strong” topology. This causes a difficulty in (among other things) calculus of variations, which is often concerned to finding extremisers to a functional on a subset E of an infinite-dimensional function space X.
In recent decades, mathematicians have found a number of ways to get around this difficulty. One of them is to weaken the topology to recover compactness, taking advantage of such results as the Banach-Alaoglu theorem (or its sequential counterpart). Of course, there is a tradeoff: weakening the topology makes compactness easier to attain, but makes the continuity of F harder to establish. Nevertheless, if F enjoys enough “smoothing” or “cancellation” properties, one can hope to obtain continuity in the weak topology, allowing one to do things such as locate extremisers. (The phenomenon that cancellation can lead to continuity in the weak topology is sometimes referred to as compensated compactness.)
Another option is to abandon trying to make all sequences have convergent subsequences, and settle just for extremising sequences to have convergent subsequences, as this would still be enough to retain Theorem 1. Pursuing this line of thought leads to the Palais-Smale condition, which is a substitute for compactness in some calculus of variations situations.
But in many situations, one cannot weaken the topology to the point where the domain E becomes compact, without destroying the continuity (or semi-continuity) of F, though one can often at least find an intermediate topology (or metric) in which F is continuous, but for which E is still not quite compact. Thus one can find sequences in E which do not have any subsequences that converge to a constant element
, even in this intermediate metric. (As we shall see shortly, one major cause of this failure of compactness is the existence of a non-trivial action of a non-compact group G on E; such a group action can cause compensated compactness or the Palais-Smale condition to fail also.) Because of this, it is a priori conceivable that a continuous function F need not attain its supremum or infimum.
Nevertheless, even though a sequence does not have any subsequences that converge to a constant x, it may have a subsequence (which we also call
) which converges to some non-constant sequence
(in the sense that the distance
between the subsequence and the new sequence in a this intermediate metric), where the approximating sequence
is of a very structured form (e.g. “concentrating” to a point, or “travelling” off to infinity, or a superposition
of several concentrating or travelling profiles of this form). This weaker form of compactness, in which superpositions of a certain type of profile completely describe all the failures (or defects) of compactness, is known as concentration compactness, and the decomposition
of the subsequence is known as the profile decomposition. In many applications, it is a sufficiently good substitute for compactness that one can still do things like locate extremisers for functionals F – though one often has to make some additional assumptions of F to compensate for the more complicated nature of the compactness. This phenomenon was systematically studied by P.L. Lions in the 80s, and found great application in calculus of variations and nonlinear elliptic PDE. More recently, concentration compactness has been a crucial and powerful tool in the non-perturbative analysis of nonlinear dispersive PDE, in particular being used to locate “minimal energy blowup solutions” or “minimal mass blowup solutions” for such a PDE (analogously to how one can use calculus of variations to find minimal energy solutions to a nonlinear elliptic equation); see for instance this recent survey by Killip and Visan.
In typical applications, the concentration compactness phenomenon is exploited in moderately sophisticated function spaces (such as Sobolev spaces or Strichartz spaces), with the failure of traditional compactness being connected to a moderately complicated group G of symmetries (e.g. the group generated by translations and dilations). Because of this, concentration compactness can appear to be a rather complicated and technical concept when it is first encountered. In this note, I would like to illustrate concentration compactness in a simple toy setting, namely in the space of absolutely summable sequences, with the uniform (
) metric playing the role of the intermediate metric, and the translation group
playing the role of the symmetry group G. This toy setting is significantly simpler than any model that one would actually use in practice [for instance, in most applications X is a Hilbert space], but hopefully it serves to illuminate this useful concept in a less technical fashion.
Recent Comments