11
$\begingroup$

I am trying to wrap my head around this problem.
A die is rolled 100 times. What is the probability of no face being appearing more than 20 times? My first thought was using the Binomial distribution P(x) = 1 - 6 cmf(100, 1/6, 20) but this is obviously wrong since we count some cases more than once. My second idea is to enumerate all the possible rolls x1 +x2+x3+x4+x5+x6 = 100, such that xi <= 20 and sum the multinomials but this seems too computationally intensive. Approximate solutions will also work for me.

$\endgroup$

3 Answers 3

13
$\begingroup$

This is a generalization of the famous Birthday Problem: given $n=100$ individuals who have random, uniformly distributed "birthdays" among a set of $d=6$ possibilities, what is the chance that no birthday is shared by more than $m=20$ individuals?

An exact calculation yields the answer $0.267\,747\,907\,805\,267$ (to double precision). I will sketch the theory and provide the code for general $n, m, d.$ The code's asymptotic timing is $O(n^2\log(d))$ which makes it suitable for very large numbers of birthdays $d$ and provides reasonable performance until $n$ is in the thousands. At that point, the Poisson approximation discussed at Extending the birthday paradox to more than 2 people ought to work well in most cases.


Explanation of the solution

The probability generating function (pgf) for the outcomes of $n$ independent rolls of a $d$-sided die is

$$d^{-n}f_n(x_1,x_2,\ldots,x_d) = d^{-n}(x_1+x_2+ \cdots + x_d)^n.$$

The coefficient of $x_1^{e_1}x_2^{e_2}\cdots x_d^{e_d}$ in the expansion of this multinomial gives the number of ways in which face $i$ can appear exactly $e_i$ times, $i=1, 2, \ldots, d.$

Limiting our interest to no more than $m$ appearances by any face is tantamount to evaluating $f_n$ modulo the ideal $\mathcal I$ generated by $x_1^{m+1}, x_2^{m+1}, \ldots, x_d^{m+1}.$ To perform this evaluation, use the Binomial Theorem recursively to obtain

$$\eqalign{ f_n(x_1, \ldots, x_d) &= ((x_1+\cdots+x_r) + (x_{r+1}+x_{r+2} + \cdots + x_{2r}))^n \\ &= \sum_{k=0}^n \binom{n}{k} (x_1+\cdots+x_r)^k (x_{r+1}+\cdots+x_{2r})^{n-k} \\ &= \sum_{k=0}^n \binom{n}{k} f_k(x_1, \ldots, x_r) f_{n-k}(x_{r+1}, \ldots, x_{2r}) }$$

when $d=2r$ is even. Writing $f_n^{(d)} = f_n(1,1,\ldots, 1)$ ($d$ terms), we have

$$f_n^{(2r)} = \sum_{k=0}^n \binom{n}{k} f_k^{(r)} f_{n-k}^{(r)}.\tag{a}$$

When $d=2r+1$ is odd, use an analogous decomposition

$$\eqalign{ f_n(x_1, \ldots, x_d) &= ((x_1+\cdots+x_{2r}) + x_{2r+1})^n \\ &= \sum_{k=0}^n \binom{n}{k} f_k(x_1, \ldots, x_{2r}) f_{n-k}(x_{2r+1}), }$$

giving

$$f_n^{(2r+1)} = \sum_{k=0}^n \binom{n}{k} f_k^{(2r)} f_{n-k}^{(1)}.\tag{b}$$

In both cases, we may also reduce everything modulo $\mathcal I$, which is easily carried out beginning with

$$f_n(x_j) \cong \left\{ \matrix{x^n & n \le m \\ 0 & n \gt m} \right. \mod \mathcal{I},$$

providing the starting values for the recursion,

$$f_n^{(1)} = \left\{ \matrix{1 & n \le m \\ 0 & n \gt m} \right.$$

What makes this efficient is that by splitting the $d$ variables into two equal-sized groups of $r$ variables each and setting all variable values to $1,$ we only have to evaluate everything once for one group and then combine the results. This requires computing up to $n+1$ terms, each of them needing $O(n)$ calculation for the combination. We don't even need a 2D array to store the $f_n^{(r)}$, because when computing $f_n^{(d)},$ only $f_n^{(r)}$ and $f_n^{(1)}$ are required.

The total number of steps is one less than the number of digits in the binary expansion of $d$ (which counts the splits into equal groups in formula $(a)$) plus the number of ones in the expansion (which counts all of the times an odd value is encountered, requiring the application of formula $(b)$). That's still just $O(\log(d))$ steps.

In R on a decade-old workstation the work was done in 0.007 seconds. The code is listed at the end of this post. It uses logarithms of the probabilities, rather than the probabilities themselves, to avoid possible overflows or accumulating too much underflow. This makes it possible to remove the $d^{-n}$ factor in the solution so we may compute the counts that underlie the probabilities.

Note that this procedure results in computing the whole sequence of probabilities $f_0, f_1, \ldots, f_n$ at once, which easily enables us to study how the chances change with $n$.


Applications

The distribution in the generalized Birthday Problem is computed by the function tmultinom.full. The only challenge lies in finding an upper bound for the number of people who must be present before the chance of an $m+1$-collision becomes too great. The following code does this by brute force, starting with small $n$ and doubling it until it's large enough. The whole calculation therefore takes $O(n^2\log(n)\log(d))$ time where $n$ is the solution. The entire distribution of probabilities for numbers of people up through $n$ is computed.

#
# The birthday problem: find the number of people where the chance of
# a collision of `m+1` birthdays first exceeds `alpha`.
#
birthday <- function(m=1, d=365, alpha=0.50) {
  n <- 8
  while((p <- tmultinom.full(n, m, d))[n] > alpha) n <- n * 2
  return(p)
}

As an example, the minimum number of people needed in a crowd to make it more likely than not that at least eight of them share a birthday is $798$, as found by the calculation birthday(7). It takes just a couple of seconds. Here is a plot of part of the output:

Figure


A special version of this problem is addressed at Extending the birthday paradox to more than 2 people, which concerns the case of a $365$-sided die that is rolled a very large number of times.


Code

# Compute the chance that in `n` independent rolls of a `d`-sided die, 
# no side appears more than `m` times.
#
tmultinom <- function(n, m, d, count=FALSE) tmultinom.full(n, m, d, count)[n+1]
#
# Compute the chances that in 0, 1, 2, ..., `n` independent rolls of a
# `d`-sided die, no side appears more than `m` times.
#
tmultinom.full <- function(n, m, d, count=FALSE) {
  if (n < 0) return(numeric(0))
  one <- rep(1.0, n+1); names(one) <- 0:n
  if (d <= 0 || m >= n) return(one)

  if(count) log.p <- 0 else log.p <- -log(d)
  f <- function(n, m, d) {                   # The recursive solution
    if (d==1) return(one)                    # Base case
    r <- floor(d/2)
    x <- double(f(n, m, r), m)               # Combine two equal values
    if (2*r < d) x <- combine(x, one, m)     # Treat odd `d`
    return(x)
  }
  one <- c(log.p*(0:m), rep(-Inf, n-m))      # Reduction modulo x^(m+1)
  double <- function(x, m) combine(x, x, m)
  combine <- function(x, y, m) {             # The Binomial Theorem
    z <- sapply(1:length(x), function(n) {   # Need all powers 0..n
      z <- x[1:n] + lchoose(n-1, 1:n-1) + y[n:1]
      z.max <- max(z)
      log(sum(exp(z - z.max), na.rm=TRUE)) + z.max
    })
    return(z)
  }
  x <- exp(f(n, m, d)); names(x) <- 0:n
  return(x)
}

The answer is obtained with

print(tmultinom(100,20,6), digits=15)

0.267747907805267

$\endgroup$
0
4
$\begingroup$

Random sampling method

I ran this code in R replicating 100 die throws for a million times:

y <- replicate(1000000, all(table(sample(1:6, size = 100, replace = TRUE)) <=20))

The output of the code inside the replicate function is true if all the faces appear less than or equal to 20 times. y is a vector with 1 million values of true or false.

The total no. of true values in y divided by 1 million should be approximately equal to the probability you desire. In my case it was 266872/1000000, suggesting a probability of around 26.6%

$\endgroup$
2
  • 3
    $\begingroup$ Based on the OP, I think it should be <= 20 rather than < 20 $\endgroup$
    – klumbard
    Commented Mar 14, 2018 at 16:18
  • 1
    $\begingroup$ I have edited the post (the second time) because placing an edit note is sometimes less clear than editing the entire post. Feel free to revert it if you think it is useful to keep the trace of the history in the post. meta.stackexchange.com/questions/127639/… $\endgroup$ Commented Mar 14, 2018 at 16:45
4
$\begingroup$

Brute force calculation

This code takes a few seconds on my laptop

total = 0
pb <- txtProgressBar(min = 0, max = 20^2, style = 3)
for (i in 0:20) {
  for (j in 0:20) {
    for (k in 0:20) { 
      for (l in 0:20) {
        for (m in 0:20) {
          n = 100-sum(i,j,k,l,m)
          if (n<=20) {
            total = total+dmultinom(c(i,j,k,l,m,n),100,prob=rep(1/6,6))
          }
        }
      }
    }
    setTxtProgressBar(pb, i*20+j) # update progression bar            
  }
}
total

output: 0.2677479

But still it might be interesting to find a more direct method in case you wish to do lots of these calculations or use higher values, or just for the sake of getting a more elegant method.

At least this computation gives a simplistically calculated, but valid, number to check other (more complicated) methods.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.