3
$\begingroup$

Given $m$ people and an $n$ possible "days of the year", what is the expected number of days which 2 or more people share as a birthday (if the distribution of birthdays is iid uniform over the possible days) ? (Alternative formulation: We use $n$ distinct colors to paint $m$ objects. Each object gets colored with a uniformly iid randomly chosen color. What is the number of colors that color at least two objects?)

Written more precisely, let $B_i$ be independent uniform random on $S = \{ 1, ..., n \}$ for $i = 1, ..., m$. What is the expected cardinality of the set of $d \in S$ where there exist $i,j$ not equal with $B_i = B_j = d$?

This sounds similar to birthday problem - expected number of collisions which asks for the number of collisions, but that question was about the number of people who share birthdays, not the number of days shared. The answer to that question is in the range $[0,m]$ while the answer to my question is in the range $[0,n]$.

Ultimately this is actually a question about hash collisions and prefix collisions in a prefix tree, but I deemed the birthday formulation the most familiar and most likely to be helpful to others searching for it in the future.

$\endgroup$
6
  • 1
    $\begingroup$ math.stackexchange.com/a/1366028/6460 may help. In terms of your variables it is $n - n \left(1-\frac1n\right)^m - m \left(1-\frac1n\right)^{m-1}$ $\endgroup$
    – Henry
    Commented Oct 4, 2021 at 13:10
  • $\begingroup$ @Henry: AFAICT those are both about counting people not counting days. $\endgroup$ Commented Oct 4, 2021 at 13:15
  • $\begingroup$ Look at the third point in my answer $\endgroup$
    – Henry
    Commented Oct 4, 2021 at 13:17
  • $\begingroup$ @Henry: Oh, I see the number of days embedded in that answer as an aside. Thanks. I'm not entirely sure it's right though. I tried working it out last night and came up with something similar but it was clearly wrong for trivially low values of $n$ and $m$. $\endgroup$ Commented Oct 4, 2021 at 13:17
  • $\begingroup$ It you have a small example where it is wrong then I would interested. It seems to be correct for 2 "days" and 3 "people" $\endgroup$
    – Henry
    Commented Oct 4, 2021 at 13:20

2 Answers 2

5
$\begingroup$

I gave the answer in a previous question without calculation. Here is the justification

  • The probability a day has no people is $\left(1-\frac1n\right)^m$ so the expected number of days with no people is $n\left(1-\frac1n\right)^m$
  • The probability a day has one person is ${m \choose 1}\frac1n\left(1-\frac1n\right)^{m-1}$ so the expected number of days with one person is $m\left(1-\frac1n\right)^{m-1}$
  • So the expected number of days with two or more people is $$n - n \left(1-\frac1n\right)^m - m \left(1-\frac1n\right)^{m-1}$$
$\endgroup$
2
  • $\begingroup$ Thanks! The expected number of days with exactly one person (the second bullet point) is actually interesting for my purposes, and has a critical point (for $m$) between $n-1$ and $n$, which is somewhat surprising (IOW you go from expecting more unique birthdays to fewer exactly when number of people crosses the number of days). $\endgroup$ Commented Oct 4, 2021 at 13:47
  • $\begingroup$ And thanks again. While the expression was unwieldy to do some of the things I wanted, it made it easy to look at numerical behavior without resorting to running simulations, and to make predictions that accurately predicted that what I wanted to try would work. (In particular, that prefix trees on 8-bit alphabet are a bad idea but that dropping to a 4-bit alphabet makes a huge difference.) $\endgroup$ Commented Oct 4, 2021 at 22:31
2
$\begingroup$

For each $d\in[n]\overset{\text{Def.}}=\{1,2,\dots, n\}$, consider the random variable $$X_d\overset{\text{Def.}}=\lvert\{i\in[m]: B_i = n\}\rvert.$$ $X_d$ is the number of people that have their birthday on day $d$.

Then you are looking for the expected value of the random variable $$C=\lvert\{d\in[n]: X_d\ge 2\}\rvert,$$ i.e. the expected value of the number of days on which two or more people have their birthday. I have named the random variable "$C$" for "collisions".

Now note that, using Iverson brackets, $$C=\lvert\{d\in[n]: X_d\ge 2\}\rvert = [X_1\ge 2]+[X_2\ge 2]+\dots+[X_n\ge 2].$$

In particular, $$\mathsf E(C) = \mathsf E([X_1\ge 2])+\dots+\mathsf E([X_n\ge 2]),$$ which, since the $X_d$ have the same distribution, namely $\text{Binomial}(\text{amount}=m,p=1/n)$ (exercise), is equal to $$n\mathsf P(X_1\ge 2) = n (1-\mathsf P(X_1=0)-\mathsf P(X_1=1))= n\left(1-(1-1/n)^m - \frac mn (1-1/n)^{m-1}\right).$$

Note. The $X_d$ are not independent, but looking further into them may reveal something about the distribution of $C$.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .