17
$\begingroup$

I have a list (lets call it $ \{L_N\} $) of N random numbers $R\in(0,1)$ (chosen from a uniform distribution). Next, I roll another random number from the same distribution (let's call this number "b"). Now I find the element in the list $ \{L_N\} $ that is the closest to the number "b" and find this distance.

If I repeat this process, I can plot the distribution of distances that are obtained through this process.

When $N\to \infty$, what does this distribution approach?

When I simulate this in Mathematica, it appears as though it approaches an exponential function. And if the list was 1 element long, then I believe this would exactly follow an exponential distribution.

Looking at the wikipedia for exponential distributions, I can see that there is some discussion on the topic:

enter image description here

But I'm having trouble interpreting what they are saying here. What is "k" here? Is my case what they are describing here in the limit where $n\to \infty$?

EDIT: After a very helpful helpful intuitive answer by Bayequentist, I understand now that the behavior as $N \to \infty$ should approach a dirac delta function. But I'd still like to understand why my data (which is like the minimum of a bunch of exponential distributions), appears to also be exponential. And is there a way that I can figure out what this distribution is exactly (for large but finite N)?

Here is a picture of what the such a distribution looks like for large but finite N: enter image description here

EDIT2: Here's some python code to simulate these distributions:

%matplotlib inline
import math
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
numpoints = 10000
NBINS = 1000

randarray1 = np.random.random_sample((numpoints,))
randarray2 = np.random.random_sample((numpoints,))

dtbin = []

for i in range(len(t1)):
    dt = 10000000
    for j in range(len(t2)):
        delta = t1[i]-t2[j]
        if abs(delta) < abs(dt): 
            dt = delta
    dtbin.append(dt)

plt.figure()
plt.hist(dtbin, bins = NBINS)
plt.show()
$\endgroup$
9
  • $\begingroup$ The exponential may be a reasonable approximation, but your minimal distance will certainly not be truly exponentially distributed: the exponential has unbounded support, but your distance is bounded between 0 and 1. $\endgroup$ Commented Dec 15, 2019 at 6:23
  • $\begingroup$ I'm pretty confused right now :/ Can you also share the code that you used to generate this picture? $\endgroup$ Commented Dec 15, 2019 at 6:33
  • $\begingroup$ Okay, I added some python code that can produce those images. $\endgroup$ Commented Dec 15, 2019 at 7:17
  • $\begingroup$ That looks to me like it's going to converge to Dirac delta. Have you tried bigger sample sizes (100k, 1m, 10m...)? (make sure the range of x-axis is consistent) The only difference between my code and yours is that you allow the distances to be negative. If you take absolute value of the distances your plot will look like mine. $\endgroup$ Commented Dec 15, 2019 at 7:49
  • $\begingroup$ Also, the number of b’s (let’s call it nSim) doesn’t have to be as large as N! If you also make nSim = N = 1m, your code will take forever to run. You can try fixing nSim = 10k, and observe how the distribution changes shape as N goes from 10k to 100k. $\endgroup$ Commented Dec 15, 2019 at 8:03

5 Answers 5

15
$\begingroup$

If you had been looking for the distance to the next value above, and if you inserted an extra value at $1$ so this always had an answer, then using rotational symmetry the distribution of these distances $D$ would be the same as the distribution of the minimum of $n+1$ independent uniform random variables on $[0,1]$.

That would have $P(D \le d) = 1-(1-d)^{n+1}$ and so density $f(d)=(n+1)(1-d)^n$ when $0 \le d \le 1$. For large $n$ and small $d$ this density can be approximated by $f(d) \approx n e^{-nd}$, explaining the exponential shape you have spotted.

But your question is slightly more complicated, as you are interested in the signed distance to the nearest value above or below. As your Wikipedia link shows, the minimum of two i.i.d. exponential random variables with rate $\lambda$ is an exponential random variable with rate $2\lambda$. So you need to change the approximation to the density to reflect both the doubled rate and the possibility of negative values of $d$. The approximation actually becomes a Laplace distribution with $$f(d) \approx n e^{-2n|d|}$$ remembering this is for large $n$ and small $d$ (in particular the true density is $0$ unless $-\frac12 \lt d \lt \frac12$). As $n$ increases, this concentrates almost all the density at $0$ as in Bayequentist's response of the limit of a Dirac delta distribution

With $n=10^6$ the approximation to the density would look like this, matching the shape of your simulated data.

enter image description here

$\endgroup$
1
  • $\begingroup$ So if you multiply the difference by n (e.g. rescale by the appropriate amount so the expectation stays bounded away from 0 and infinity), then it does converge to a two sided exponential distribution. $\endgroup$
    – shalop
    Commented Dec 16, 2019 at 14:00
8
$\begingroup$

When $N \to \infty$, $L_N$ contains all real numbers in $(0,1)$. Thus, the distance from any number in $(0,1)$ to the closest number in $L_N$ will approach 0 as $N \to \infty$. The distribution of distances approaches the Dirac delta distribution as $N \to \infty$.

Here are some simulations: enter image description here

Here's a code snippet:

n <- 100000
Ln <- runif(n)

nSim <- 10000
distances <- rep(0,nSim)
for (i in 1:nSim){
  b <- runif(1)
  distances[i] <- min(abs(Ln-b))
}
hist(distances,main="N=100000")
$\endgroup$
5
  • $\begingroup$ Thanks for the intuitive answer! I think this answers my question about what happens when $N \to \infty$, but I'm still hoping to have an understanding of why my data looks exponential. I'll upload an edit with a picture so you can see it. $\endgroup$ Commented Dec 15, 2019 at 6:16
  • $\begingroup$ Also, if my edit is a bit too much of a "moving target" for you, then I can rewrite that particular part as a separate question. I'll happily accept your answer if that's the case. $\endgroup$ Commented Dec 15, 2019 at 6:21
  • 3
    $\begingroup$ When $N\to\infty$, $L_N$ contains all real numbers in $(0,1)$. That seems doubtful to me, since the set $\{L_1,L_2,L_3,\dots\}$ is countable, so we get $N$ as "countable infinity", while the set of "all real numbers in $(0,1)$" is uncountable. So, maybe it's better to say "contains a number that is arbitrary close to any real number in $(0, 1)$" (i.e. a dense subset of $(0, 1)$)? $\endgroup$
    – trolley813
    Commented Dec 16, 2019 at 7:44
  • 1
    $\begingroup$ That is correct - a subset of a countable set cannot be uncountable. This answer was never meant to be a rigorous answer anyway. I was just trying to provide a useful and easy-to-understand intuition. Henry's answer is much more complete and rigorous than mine. $\endgroup$ Commented Dec 16, 2019 at 8:07
  • $\begingroup$ This was my thought when I first read the question, that the answer was obviously 0. The average distance obviously gets smaller for each added element, and there's no obvious way that it can reasonably approach anything but 0 $\endgroup$
    – Cruncher
    Commented Dec 16, 2019 at 18:28
1
$\begingroup$

is there a way that I can figure out what this distribution is exactly (for large but finite N)?

The difference of two standard Uniform random variables is Triangular(-1,0,1) with pdf $1-|x|$ defined on $(-1,1)$.

Distance is the absolute value of the difference which has pdf say $f(x)$:

enter image description here

Repeating the exercise $n$ times and taking the minimum distance is equivalent to finding the minimum $(1^{\text{st}})$ order statistic wrt the parent pdf $f(x)$, which is given by:

enter image description here

where I am using the OrderStat function from the mathStatica package for Mathematica to automate the nitty gritties, and where the domain of support is (0,1). The solution has a Power Function distribution with pdf of form $g(x) = a x^{a-1}$.

The following diagram compares a plot of the exact pdf of the minimum distance just derived $g(x)$ (red dashed curve) ... to a Monte Carlo simulation (squiggly blue curve), when the sample size is $n=10$:

enter image description here

Simulation: As you are using Mathematica for simulation, here is the code I am using for the data simulation in Mathematica:

  data = Table[Min[Abs[RandomReal[{}, 10] - RandomReal[]]], 20000];
$\endgroup$
1
$\begingroup$

For you to get a number larger than $d$ as your result, all numbers in your sample have to be $d$ away from $b$. The probability of that happening for any individual $x_0$ is just the probability mass outside the range $b \pm d$. Call that $p_{outside}$. The probability of that happening for all $x_i$ in your sample is $(p_{outside})^N$. If $x_i$ are chosen uniformly from the unit interval, then $p_{outside}$ for $b$ more than $d$ from the boundary will be $1-2d$, and that gives $p_{outside}^N = (1-2d)^N$. For large $N$ and small $d$, that can be approximated by $e^{-2Nd}$.

$\endgroup$
0
$\begingroup$

Imagine you first draw the last one and denote it X. This does not change the problem formulation at all. For any $X_i \in L_N, i=1,...,N$, we know that $Y_i := |X-X_i|$ has some distribution (you may or may not want to compute this) and that $Y_i$ are iid given $X$. From Wikipedia, we know that the CDF of their minimum is $$ F_{min}(y) = 1 - [1-F_Y(y)]^N. $$

For any fixed $y$, we know $F_Y(y) > 0$ for any $y > 0$ and $F(y) = 0$ otherwise. Take $N \to \infty$ and you get a CDF which is identically one for $y > 0$ and identically zero otherwise. This is a delta function centered at zero, like all simulations above show. This holds for any $x \in (0,1)$ so the convergence holds always (albeit with varying convergence rates, perhaps).

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.