We learned about the birthday problem paradox: If people are in a room with randomly distributed birthdays, very few people are needed for at least two people to have the same birthday.
As I understand, the formula just uses logic.
Let $Q(n)$ be the probability that $n$ people all have different birthdays ($n+1$ is used as a correction factor to avoid double counting)
$$ Q(n) = \frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \times \ldots \times \frac{365-n+1}{365} $$
And since $P(n) = 1 - Q(n)$, the probability that at least two people have the same birthday is:
$$ P(n) = 1 - Q(n) = 1 - \left( \frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \times \ldots \times \frac{365-n+1}{365} \right) $$
Here is a simulation that shows the minimum number of people with random birthdays required to have at least one common birthday:
library(ggplot2)
num_simulations <- 1000
num_days <- 365
results <- numeric(num_simulations)
for (i in 1:num_simulations) {
birthdays <- integer(0)
while (TRUE) {
new_birthday <- sample(1:num_days, 1)
if (new_birthday %in% birthdays) {
results[i] <- length(birthdays) + 1
break
} else {
birthdays <- c(birthdays, new_birthday)
}
}
}
mean_trials <- mean(results)
ggplot(data.frame(Trials=results), aes(Trials)) +
geom_histogram(binwidth=1, color="black", fill="lightblue") +
ggtitle(paste("Distribution of Trials Until Match (Mean =", round(mean_trials, 2), ")")) +
xlab("Number of Trials Until Match") +
ylab("Frequency") +
theme_minimal()
And here is a simulation which compares simulated results to the theoretical results over different numbers of people (we can see that the simulations are identical to the theoretical number):
birthday_probability <- function(n) {
q <- 1
for (i in 1:n) {
q <- q * (365 - i + 1) / 365
}
return(1 - q)
}
simulate_and_percentage <- function(n, num_simulations) {
num_days <- 365
repeat_count <- 0
for (i in 1:num_simulations) {
birthdays <- sample(1:num_days, n, replace = TRUE)
if (length(birthdays) != length(unique(birthdays))) {
repeat_count <- repeat_count + 1
}
}
return(repeat_count / num_simulations)
}
n_values <- 2:100
num_simulations <- 1000
theoretical_probabilities <- sapply(n_values, birthday_probability)
simulation_percentages <- sapply(n_values, simulate_and_percentage, num_simulations = num_simulations)
data <- data.frame(
n_values = rep(n_values, 2),
probabilities = c(theoretical_probabilities, simulation_percentages),
type = rep(c("Theoretical", "Simulation"), each = length(n_values))
)
ggplot(data, aes(x = n_values, y = probabilities, color = type)) +
geom_line() +
ggtitle("Comparison of Theoretical vs Simulation Probabilities") +
xlab("Number of People (n)") +
ylab("Probability") +
ylim(0, 1) + # Ensure y-axis is limited between 0 and 1
theme_minimal()
I am wondering if its possible to slightly modify the objective for this problem. For example, if there are $n$ people:
- What is the probability of at least $k$ groups of people having the same the birthdays and how many people are likely to be in these groups?
For example: Suppose there are $n = 12$ people with the following birthdays:
- 1 person with Jan 4th
- 2 people with Feb 6th
- 2 people with March 8th
- 3 people with April 2nd
- 1 person with May 7th
- 1 person with June 15th
- 2 people with July 1st
Thus we have:
- 3 groups of size = 1
- 3 groups of size = 2
- 1 group of size = 3
What is the probability of observing this specific sequence? Or as another example, what is the probability of observing 1 group of size=9, 1 group of size=2 and 1 group of size =1?
Can we change the birthday problem formula for this?