1
$\begingroup$

In the article https://projecteuclid.org/download/pdf_1/euclid.aoms/1177704564 they describe procedure how to sample batch of size $n$ with nonuniform probabilities $p_i$, where probability of choosing $i$-th element is $np_i$ (assumption $np_i\leq 1$)(in the section (2.1)). I would be interested, how can you prove it because they don't provide any proof and also I am interested in probability of picking $i,j(i\ne j)$ in the same batch.

They procedure is following. Randomly order our given sequence, then label them $1,2,3,...,N$ and construct $\Pi_k = \sum_{i=1}^k np_i$ and $\Pi_0=0$, then pick $d$ uniformly from $[0,1)$ and your sample are elements for which inequality $$ \Pi_{j-1} \leq d+k < \Pi_{j} $$ holds for some $k = 1,2,...,n$.

$\endgroup$
2
  • $\begingroup$ There's a link to a 25-page paper on the URL. Are you saying that the paper provides no proof? $\endgroup$
    – saulspatz
    Commented Mar 23, 2018 at 12:15
  • $\begingroup$ This sampling is just part of subsection (2.1), but they don't provide any proof $\endgroup$
    – Ethan
    Commented Mar 23, 2018 at 12:43

1 Answer 1

2
$\begingroup$

Think of these as sticks of length $np_i$; order the sticks randomly and lay them down end-to-end to make a combined length $\sum_i np_i=n$; then choose a uniform random number $d \in [0,1)$ and choose the sticks which cover the points at distances $d, d+1, d+2, \ldots , d+n-1$ from the start; all these points have fractional part $d$

Let $f(x)$ be the fractional part of $x$, i.e. the distance to the nearest integer less than or equal to $x$

Suppose the start of stick $i$ when laid out with the others is at distance $a$ and its end is at distance $b$ from the start. Then

  • if there is no integer $m$ with $a \lt m \le b$ then the probability stick $i$ covers one of the points is $\mathbb P(f(a) \le d < f(b)) = f(b)-f(a)=b-a$

  • if there is an integer $m$ with $a \lt m \le b$ then the probability stick $i$ covers one of the points is $\mathbb P(0 \le d < f(b)) + \mathbb P(f(a) \le d < 1) =f(b) + 1-f(a)= (b-m)+(m-a)=b-a$

so in either case the probability stick $i$ covers one of these points and so is selected is $b-a$. But this is its length, which is $np_i$

$\endgroup$
2
  • $\begingroup$ I had similar intuition, but without proof. Thank you. Do you know how to extend this analysis for picking two at once, I mean $P(i \in S, j \in S)$, where $S$ is our sample with size $n$? $\endgroup$
    – Ethan
    Commented Mar 23, 2018 at 13:54
  • $\begingroup$ @Ethan: I doubt there is a simple expression using this algorithm $\endgroup$
    – Henry
    Commented Mar 23, 2018 at 14:46

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .