2
$\begingroup$

I have a dataset of 1296 unique codes which can be numbered 1 through 1296. If numbers are selected at random, one at a time, with replacement. On average, how many iterations will it take to select a number that has already been selected?

Experimentally, (looping through the list of 1296 codes and creating a subset of selected codes using Python) it averages out at 45.875 times (this number includes the duplicate) but I would like to verify it with a calculation so any help would be appreciated.

This question has some similarities but I am unable to perform a calculation based on the answer:

Question with similarities

$\endgroup$
4
  • $\begingroup$ This is an example of the generalized birthday problem where you have $1296$ "days" instead of $365$. The number that gives a $50\%$ chance of a match in $d$ "days" is about $\sqrt {2d\ln 2}$, which for you is about $42.39$ $\endgroup$ Commented Aug 19, 2017 at 1:15
  • 1
    $\begingroup$ @RossMillikan instead of asking for where it switches from being below a $50\%$ chance to being above a $50\%$ chance, isn't the OP asking for the expected number of draws until a match occurs? Will those two necessarily be the same? I'm getting a different result (fixed minor typo in equation) getting a result closer to $\approx 45.788$ $\endgroup$
    – JMoravitz
    Commented Aug 19, 2017 at 1:20
  • $\begingroup$ @JMoravitz: I'll reopen it. I don't know an easy approach to the expected number version, but would expect it to be close to the $50\%$ probability number. No, they won't be the same because there is the long tail to high numbers. $\endgroup$ Commented Aug 19, 2017 at 2:10
  • $\begingroup$ @RossMillikan well, the pdf is straightforward (and is included in one of your two links in some form) and we can apply the definition of expected value. I don't know of a clean way to simplify the sum by hand, but computers can calculate it easily enough for us. $\endgroup$
    – JMoravitz
    Commented Aug 19, 2017 at 2:12

2 Answers 2

3
$\begingroup$

It is impossible to have gotten a duplicate on the first draw. It is impossible to have not gotten a duplicate by the 1297'th draw by pigeon-hole principle.

To have gotten your first duplicate on the $k$'th draw, you need the first $k-1$ draws to all be distinct and the $k$'th to be a duplicate.

The first draw will always be distinct. The second will be distinct from the first with probability $\frac{1295}{1296}$. The third will be distinct from the first two with probability $\frac{1294}{1296}$ and so on... the $(n)$'th will be distinct from the earlier $n-1$ with probability $\frac{1296-n+1}{1296}$. Multiplying these, we get for $n$ draws to all be distinct, this will occur with probability $\frac{1296\frac{n}{~}}{1296^n}$ where $x\frac{n}{~}$ represents a falling factorial $x\frac{n}{~}=\underbrace{x(x-1)(x-2)\cdots (x-n+1)}_{n~\text{terms in the product}}=\frac{x!}{(x-n)!}$.

Next, supposing $k-1$ distinct values have all been taken, for the $k$'th to duplicate one of the earlier results, this will occur with probability $\frac{k-1}{1296}$

We have then the probability distribution function for $X$, the number of draws until the first duplicate:

$$Pr(X=k)=\frac{(k-1)1296\frac{k-1}{~}}{1296^k}$$

Applying the definition of expected value for a pdf: $E[X]=\sum\limits_{k\in\Delta} kPr(X=k)$ we have then the expected value is

$$\sum\limits_{k=2}^{1297}\frac{k(k-1)1296\frac{k-1}{~}}{1296^k}\approx 45.7889$$

wolfram link: http://www.wolframalpha.com/input/?i=sum+from+n%3D2+to+1297+of+n(n-1)(1296!%2F(1296-n%2B1)!)%2F1296%5En

$\endgroup$
4
  • $\begingroup$ As a minor aside, I was having difficulty with the link. Trying to write it as [linkname](actuallinkgoeshere) with parenthesis appearing in the link part of it was being chopped off, and replacing parenthesis with brackets was causing the calculation to time out. Tinyurl's aren't permenant... does anyone have a suggestion other than just hiding it with a spoiler tag like I have? $\endgroup$
    – JMoravitz
    Commented Aug 19, 2017 at 2:35
  • $\begingroup$ Why is the wolfram formula different from the expected value? $\endgroup$ Commented Sep 12, 2023 at 10:52
  • $\begingroup$ @OlivierLalonde This is a six year old post... I do not remember it very well at this point. Glancing over everything, it all appears correct. What do you mean that the formula in wolfram is different than expected value? Are you perhaps confused as to why I wrote $\frac{1296!}{(1296-n+1)!}$ instead of $1296\frac{n-1}{}$? Read the line where I introduce the $x\frac{n}{~}$ notation again. $\endgroup$
    – JMoravitz
    Commented Sep 12, 2023 at 12:18
  • $\begingroup$ Oops, missed that. Thanks. $\endgroup$ Commented Sep 14, 2023 at 17:27
2
$\begingroup$

With $n$ objects the expected time until the first repeat is exactly $$\mathbb{E}(T)=\int_0^\infty \left(1+{x\over n}\right)^ne^{-x}\,dx,$$ and approximately equal to $\sqrt{n\pi/2}.$

You can find a derivation of this formula at my answer here: Variance of time to find first duplicate

For $n=1296$ the exact formula gives $\mathbb{E}(T)\approx 45.78885405,$ while the approximation gives $\sqrt{1296\pi/2}\approx 45.11930893.$

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .