2
$\begingroup$

While calculating the probability that the same sequence of 50 words occurs twice in a file that contains 400 million random words (each word can have 65536 different states, chosen at random)...

First, I assumed that the 50-word sequence is a permutation with replacements thus it collectively has $65536^{50}$ states.

Next, I assumed that the file can contain (400000000 - 50) such sequences (not $\frac{400000000}{50}$ because the sequence can start at any word of the file).

Finally, I used the birthday formula to calculate the probability that the file does NOT contain 2 identical sequences: $$ \frac{65536^{50}!}{65536^{50*400000000}(65536^{50}-400000000)!} $$

However, one helpful comment in a related question suggested that this formula might not be applicable because these sequences can overlap and therefore are dependent on each other and not truly randomly distributed.

QUESTION: How to account for the overlapping sequences ?

$\endgroup$
2
  • 1
    $\begingroup$ The overlapping issue here will make a minimal difference to a number which is already so excessively small that there is little value in making the adjustment. $\endgroup$
    – Henry
    Commented Jun 24 at 23:59
  • $\begingroup$ @Henry: That is nice to know, but how to make the argument that the overlapping issue makes a minimal difference here ? $\endgroup$ Commented Jun 25 at 5:12

1 Answer 1

2
$\begingroup$

To make this more clear, I will generalize to sequences of $N$ words drawn from an bank with $A$ words, and you want to find the probability there exists a repeated subsequence of length $L$. In your case, $A=65536$, $L=50$, and $N=400\,000\,000$.

I cannot exactly compute the probability of two repeated sequences of length $L$ existing. However, I can prove the probability is at most $10^{-200}$. This is small enough such that, if you repeated this experiment every second until the sun died out, you could still be confident that none of the experiments had two repeated subsequences of length $L$.

For each $i,j\in \{1,2,\dots,N-L+1\}$, let $E_{i,j}$ be the event that the $i^\text{th}$ subsequence of length $L$ is equal to the $j^\text{th}$ subsequence of length $L$. If call the sequence of random words $W_1,W_2,\dots,W_N$, then $E_{i,j}$ is the event that $$ (W_i,W_{i+1},\dots,W_{i+L-1})=(W_{j},W_{j+1},\dots,W_{j+L-1}) $$

Claim: For all $i,j\in \{1,\dots,N-L+1\}$, with $i<j$, $P(E_{i,j})= A^{-L}$.

Note: This claim applies to overlapping sequences, as well as non-overlapping ones.

Proof: In order for $E_{i,j}$ to occur, we need $W_{i+k}=W_{j+k}$ for each $k\in \{0,1,\dots,L-1\}$.

  • The probability that $W_j=W_i$ is $1/A$. This is because, no matter word $W_i$ is, the probability that $W_j$ equals that same word is $1/A$.

  • Conditional on $\{W_j=W_i\}$ occurring, the probability that $W_{j+1}=W_{i+1}$ is also $1/A$. This follows by the same logic, together with the fact that $W_{j+1}$ is independent of $W_i$ and $W_j$, so conditioning on $\{W_j=W_i\}$ does not affect the probability distribution of $W_{j+1}$.

  • Continuing this argument, for all $k\in \{1,\dots,L-1\}$, the probability that $W_{j+k}=W_{i+k}$, conditional on $W_{j+m}=W_{i+m}$ for all $m\in \{1,\dots,k-1\}$, is equal to $1/A$.

Finally, to compute the probability that $\{W_{i+k}=W_{j+k}\}$ occurs for each $k$, we multiply together all of these conditional probabilities, resulting in $A^{-L}$.$\tag*{$\blacksquare$}$

To conclude, there are $\binom{N-L+1}2$ events of the form $E_{i,j}$, and each occurs with probability $A^{-L}$. To see two repeated words, we need at least one of the $E_{i,j}$ to occur. By the union bound, the probability of some $E_{i,j}$ occurring bounded above as follows: $$ P(\text{two repeated words of length $L$})\le \binom {N-L+1}2 A^{-L} $$ Plugging back in your specific numbers (see WA computation), you get the probability of this collision is at most $10^{-200}$.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .