18
$\begingroup$

Let $p_1, p_2 \sim U([0, 1]^n)$ with $n \in \mathbb{N}$ be two points in the $n$-dimensional unit hypercube which are uniform randomly independently sampled.

How is the distance $d(p_1, p_2) = \sqrt{\sum_{i=1}^n { \left (p_1^{(i)} - p_2^{(i)} \right )}^2}$ distributed?

Similar questions

There are questions on math.SE which cover the average distance question for $n=1$ and $n =2$ (the one for $n=2$ also explains how to )

  • One dimension: $\frac{1}{3}$
  • Two dimensions: about $0.521...$ (this makes me guess that the distribution is noting "standard", because then the average distance question should be easy to answer. Can one make a statement like "the distribution of the distance of two points is 'almost' ..."?)
$\endgroup$
4
  • $\begingroup$ Well, as all $\left(p_1^{(i)}-p_2^{(i)}\right)^2$ are similarly distributed and independent, we are approaching the central limit theorem, which means the squared distance is distributed "almost" normally. $\endgroup$ Commented Oct 20, 2016 at 7:27
  • 3
    $\begingroup$ Cf. this MathWorld page. $\endgroup$ Commented Oct 20, 2016 at 7:28
  • $\begingroup$ Empirically, for large $n$ you seem to get the distance close to normally distributed with mean slightly below $0.41\sqrt{n}$ and a standard deviation slightly below $0.25$. This is the so-called curse of dimensionality $\endgroup$
    – Henry
    Commented Oct 20, 2016 at 7:55
  • $\begingroup$ Perhaps better to say the second moment of the distribution of the distance is $\dfrac{n}{6}$, the mean $\mu$ is $\sqrt{\dfrac{n}{6}-\sigma^2}$ and the variance $\sigma^2$ is not less than $\dfrac{1}{18}\approx 0.0555$ and not more than $\dfrac{1}{16}=0.0625$, tending to $\dfrac{7}{120}\approx 0.0583$ as $n$ increases. $\endgroup$
    – Henry
    Commented Oct 20, 2016 at 21:37

1 Answer 1

10
$\begingroup$

As Parcly Taxel pointed out, MathWorld has a page on Hypercube Line Picking, with many references.

Mathworld gives a table of the mean distance for hypercubes up to $n=8$ dimensions. For reasons I will explain below, for large $n$ a good approximation of the mean is $\sqrt{\dfrac{n}{6}-\dfrac{7}{120}}$ and it is not bad for small $n$ either

n   Mathworld mean  sqrt(n/6-1/16)  sqrt(n/6-7/120) sqrt(n/6-1/18)
1   0.333333333       0.3227          0.3291          0.3333
2   0.521405433       0.5204          0.5244          0.5270
3   0.661707182       0.6614          0.6646          0.6667
4   0.777665654       0.7773          0.7800          0.7817
5   0.878530915       0.8780          0.8803          0.8819
6   0.968942083       0.9682          0.9704          0.9718
7   1.051583873       1.0508          1.0528          1.0541
8   1.128165340       1.1273          1.1292          1.1304

For $n=1$, you have a triangular distribution for the distance with density $f_{d_1}(x)=2-2d_1$ for $0 \lt x \le 1$, giving a mean of $\frac13$, a variance of $\frac1{18}$ and a second moment of $\frac1{6}$. The square of the distance has density $f_{d_1^2}(x)=\frac{1}{\sqrt{x}}-1$ for $0 \lt x\le 1$, giving a mean of $\frac16$, a variance of $\frac7{180}$ and a second moment of $\frac1{15}$.

It gets more complicated for higher dimensions, but (as Ivan Neretin says) the Central Limit Theorem tells us that the square of the distance is almost normally distributed for large $n$, with mean $\frac{n}{6}$ and variance $\frac{7n}{180}$. So we can say $$\dfrac{D_n^2 - \frac{n}{6}}{\sqrt{\frac{7n}{180}}} \ \xrightarrow{d}\ N(0,1)$$

Less obviously, the distance itself is also almost normally distributed for large $n$. In general we can say that if $X_1, \ldots, X_n$ are i.i.d. random variables with finite non-zero mean $\mu$ and variance $\sigma^2$, and $\displaystyle Y=\sum_{i=1}^n X_i$ and $Z=\sqrt{|Y|}$, then $\displaystyle \dfrac{Z - \sqrt{n |\mu|-\tfrac{\sigma^2}{4|\mu|}}}{\sqrt{\tfrac{\sigma^2}{4|\mu|}}}\ \xrightarrow{d}\ N(0,1)$ as $n$ increases. In this particular case $\mu=\frac{1}{6}$ and $\sigma^2 = \frac7{180}$ as statistics of the $1$-dimensional square of distance, so we can say $$\dfrac{D_n - \sqrt{\frac{n}{6}-\frac7{120}}}{\sqrt{\frac{7}{120}}} \ \xrightarrow{d}\ N(0,1)$$ suggesting an approximate mean for the distance of $\sqrt{\frac{n}{6}-\frac7{120}}$ and approximate variance of $\frac7{120}$ when $n$ is large.

The actual densities are not simple analytically, but the following graph uses numerical convolution and integration to illustrate the densities for the distance when $n=1$ to $16$ and also shows in red the normal approximation when $n=16$.

enter image description here

For large $n$ the variance of the distance stays close to $\frac{7}{120}$ making the standard deviation about $0.24$. For example, with an $n=2500$ dimensional unit hypercube, the distance can be anything from $0$ to $50$ but in the large majority of cases it will be between $20$ and $21$ and in all but a vanishingly tiny proportion of cases it will be between $19$ and $22$. In data analysis, this curse of dimensionality means there can be relatively little difference in the distances between different pairs of random samples.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .