39
$\begingroup$

A fair die is rolled 1,000 times. What is the probability of rolling the same number 5 times in a row? How do you solve this type of question for variable number of throws and number of repeats?

$\endgroup$
3
  • 9
    $\begingroup$ The question does not specifically state how many sides the die has. Sextus Empiricus's answer assumes a 6-sided die, which is the most commonly used die type. Is that what you wanted? $\endgroup$ Commented Oct 15, 2020 at 15:29
  • 7
    $\begingroup$ @LelandHepworth I am not so much into dungeons and dragons or other roll playing games. So I assumed a 6-sided dice, which is the standard dice in statistics. But the question is easily generalizable. The type of dice is irrelevant for explaining the solution to the question. $\endgroup$ Commented Oct 15, 2020 at 17:00
  • $\begingroup$ @LelandHepworth Exactly. My spontaneous thought was about the tetraeder-shaped dies Irving Finkel (one of the curators of the British Museum) uses to play the Royal Game of Ur (think about the epoch cuneiform and Mesopotamia). Have a look, e.g. here with Tom Scott. $\endgroup$
    – Buttonwood
    Commented Oct 17, 2020 at 20:04

2 Answers 2

68
$\begingroup$

Below we compute the probability in four ways:

Computation with Markov Chain          0.473981098314993
Computation with generating function   0.473981098314988
Estimation false method                0.536438013618686
Estimation correct method              0.473304632462677

The first two are exact methods and differ only a little (probably some round of errors), the third method is a naive estimation that does not give the correct number, the fourth method is better and gives a result that is very close to the exact method.

Computationally:

Markov Chain

You can model this computationally with a transition matrix

Say the column vector $X_{k,j} = \lbrace x_1,x_2,x_3,x_4,x_5 \rbrace_{j}$ is the probability to have $k$ of the same numbers in a row in the $j$-th dice roll. Then (when assuming a 6-sided dice)

$$X_{k,j} = M \cdot X_{k,j-1}$$ with

$$M = \begin{bmatrix} \frac{5}{6} & \frac{5}{6} & \frac{5}{6} & \frac{5}{6} & 0 \\ \frac{1}{6} & 0& 0 & 0 & 0 \\ 0& \frac{1}{6} & 0& 0 & 0 \\ 0 & 0& \frac{1}{6} & 0& 0 \\ 0&0 & 0& \frac{1}{6} & 1 \\ \end{bmatrix}$$

where this last entry $M_{5,5} = 1$ relates to 5 of the same in a row being an absorbing state where we 'stop' the experiment.

After the first roll you will be certainly in state 1 (there's certainly only 1 of the same number in a row).

$$X_{k,1} = \lbrace 1,0,0,0,0 \rbrace$$

After the $j$-th roll this will be multiplied with $M$ a $j-1$ times

$$X_{k,j} = M^{j-1} \lbrace 1,0,0,0,0 \rbrace$$

R-Code:

library(matrixcalc) ### allows us to use matrix.power

M <- matrix(c(5/6, 5/6, 5/6, 5/6, 0,
              1/6, 0  , 0  , 0  , 0,
              0,   1/6, 0  , 0  , 0,
              0,   0  , 1/6, 0  , 0,
              0,   0  , 0  , 1/6, 1),
            5, byrow = TRUE)

start <- c(1,0,0,0,0)
matrix.power(M,999) %*% start

The result is $$X_{k,1000} = \begin{bmatrix} 0.438631855\\ 0.073152468\\ 0.012199943\\ 0.002034635\\ \color{red}{0.473981098}\end{bmatrix}$$

and this last entry 0.473981098 is the probability to roll the same number 5 times in a row in 1000 rolls.

generating function

Our question is:

  • How to calculate the probability of rolling any number at least $k$ times in a row, out of $n$ tries?

This is equivalent to the question

  • How to calculate the probability of rolling the number 6 at least $k-1$ times in a row, out of $n-1$ tries?

You can see it as tracking whether the dice roll $m$ is the same number as the number of the dice roll $m-1$ (which has 1/6-th probabilty). And this needs to happen $k-1$ times in a row (in our case 4 times).

In this Q&A the alternative question is solved as a combinatorial problem: How many ways can we roll the dice $n$ times without the number '6' occuring $k$ or more times in a row.

This is found by finding all possible combinations of ways that we can combine the strings 'x', 'x6', 'x66', 'x666' (where 'x' is any number 1,2,3,4,5) into a string of length $n+1$ ($n+1$ instead of $n$ because in this way of constructing strings the first letter is always $x$ here). In this way we counted all possibilities to make a string of length $n$ but with only 1, 2, or 3 times a 6 in a row (and not 4 or more times).

Those combinations can be found by using an equivalent polynomial. This is very similar to the binomial coefficients which relate to the coefficients when we expand the power $(x+y)^n$, but it also relates to a combination.

The polynomial is

$$\begin{array}{rcl} P(x) &=& \sum_{k=0}^\infty (5x+5x^2+5x^3+5x^4)^k\\ &=& \frac{1}{1-(5x+5x^2+5x^3+5x^4)} \\ &=& \frac{1}{1-5\frac{x-x^5}{1-x}}\\ &=& \frac{1-x}{1-6x+5x^5} \end{array}$$

The coefficient of the $x^n$ relates to the number of ways to arrange the numbers 1,2,3,4,5,6 in a string of length $n-1$ without 4 or more 6's in a row. This coefficient can be found by a recursive relation. $$P(x) (1-6x+5x^5) = 1-x$$ which implies that the coefficients follow the relation

$$a_n - 6a_{n-1} + 5 a_{n-5} = 0$$

and the first coefficients can be computed manually

$$a_1,a_2,a_3,a_4,a_5,a_6,a_7 = 5,30,180,1080,6475,38825,232800$$

With this, you can compute $a_{1000}$ and $1-a_{1000}/6^{999}$ will be the probability to roll the same number 5 times in a row 5.

In the R-code below we compute this (and we include a division by 6 inside the recursion because the numbers $a_{1000}$ and $6^{999}$ are too large to compute directly). The result is $0.473981098314988$, the same as the computation with the Markov Chain.

x <- 6/5*c(5/6,30/6^2,180/6^3,1080/6^4,6475/6^5,38825/6^6,232800/6^7)
for (i in 1:1000) {
  t <- tail(x,5)
  x <- c(x,(6/6*t[5]-5/6^5*t[1]))   ### this adds a new number to the back of the vector x
}
1-x[1000]

Analytic/Estimate

Method 1: wrong

You might think, the probability to have in any set of 5 neighboring dices, 5 of the same numbers, is $\frac{1}{6^4} = \frac{1}{1296}$, and since there are 996 sets of 5 neighboring dices the probability to have in at least one of these sets 5 of the same dices is:

$$ 1-(1-\frac{1}{6^4})^{996} \approx 0.536$$

But this is wrong. The reason is that the 996 sets are overlapping and not independent.

Method 2: correct

A better way is to approximate the Markov chain that we computed above. After some time you will get that the occupation of the states, with 1,2,3,4 of the same number in a row, are more or less stable and the ratio's will be roughly $1/6,1/6^2,1/6^3,1/6^4$ (*). Thus the fraction of the time that we have 4 in a row is:

$$\text{frequency 4 in a row} = \frac{1/6^4}{1/6+1/6^2+1/6^3+1/6^4}$$

If we have these 4 in a row then we have a 1/6-th probability to finish the game. So the frequency of finishing the game is

$$\text{finish-rate} = \frac{1}{6} \text{frequency 4 in a row} = \frac{1}{1554}$$

and the probability to be finished after $k$ steps is approximately

$$P_k \approx 1-(1-\frac{1}{1554})^{k-4} \underbrace{\approx 0.47330}_{\text{if $k=1000$}}$$

much closer to the exact computation.


(*) The occupation in state $k$ during roll $j$ will relate to the occupation in state $k-1$ during roll $j-1$. We will have $x_{k,j} = \frac{1}{6} x_{k-1,j-1} \approx \frac{1}{6} x_{k-1,j}$. Note that this requires that you have $x_{k-1,j} \approx x_{k-1,j-1}$, which occurs when the finish-rate is small. If this is not the case, then you could apply a factor to compensate, but the assumption of relatively steady ratio's will be wrong as well.

Related problems

This latter related problem gives a different approximation based on expectation values and estimates the distribution as an overdispersed Poisson distribution. Giving an approximation $1- \exp \left(-(1000-5+1)\left(\frac{1}{6^4}\right) /1.2 \right)\approx 0.4729354$ which isn't bad either.

$\endgroup$
8
  • $\begingroup$ What code? R? $\endgroup$ Commented Oct 15, 2020 at 18:50
  • $\begingroup$ @PeterMortensen The code is in the grey block. But I guess that you are asking for the type of language. I have used the language that is most used on stats.stackexchange.com $\endgroup$ Commented Oct 15, 2020 at 18:53
  • $\begingroup$ @PeterMortensen It's R $\endgroup$
    – Dale C
    Commented Oct 15, 2020 at 20:38
  • $\begingroup$ "The result is 0.473981098" Pardon the semi-math-challenged question, but does that mean that the odds of it happening somewhere in those 1,000 rolls are just over 47%? Or had you already converted to percentage? (Which seems unlikely to me. In 1,000 rolls, I'd figure the odds were well over less than half a percent...but ~47% percent sounds reasonable.) $\endgroup$ Commented Oct 16, 2020 at 9:52
  • 3
    $\begingroup$ @T.J.Crowder it means $X_{5,1000} = 0.473981098$ it is the fraction of the cases that have ended in the state 5 after 1000 rolls. So yes in terms of percentages this would be ~47%. I will clarify this in the answer later. $\endgroup$ Commented Oct 16, 2020 at 10:02
0
$\begingroup$

I got a different result from the accepted answer and would like to know where I've gone wrong.

I assumed a fair, 6-sided die, and simulated 1000 runs of 1000 rolls each. When the result of a roll matches the results of the previous 4 rolls, a flag is set to TRUE. The mean of this flag column and the mean of the runs is then reported. I get ~0.07% as the probability of seeing 5 rolls in a row of the same number.

In R,

tibble(
  run = rep(seq(1:1000), each = 1000), 
  roll = rep(seq(1:1000), 1000), 
  x = sample(1:6, 1000000, replace = T)
  ) %>% 
group_by(run) %>% 
mutate(
  same_five = x == lag(x, 1) & x == lag(x, 2) & x == lag(x, 3) & x == lag(x, 4)
  ) %>% 
summarize(
  p_same_five = mean(same_five, na.rm = TRUE), .groups = "drop"
  ) %>% 
summarize(mean(p_same_five)) * 100

  mean(p_same_five)
1        0.07208702
$\endgroup$
3
  • 1
    $\begingroup$ It looks like what you are calculating is the frequency of rolling 5 in a row, which Sextus showed is about 0.06. However, the problem isn't asking for the frequency of 5 in a row, but rather how likely you are to see at 5 in a row at least once during 1000 rolls. So you should have a flag for each of your runs that you mark True if 5 in a row occurs anywhere in the run. $\endgroup$
    – Tyberius
    Commented Oct 17, 2020 at 1:32
  • $\begingroup$ It'll work with this line same_five = sum(x == lag(x, 1) & x == lag(x, 2) & x == lag(x, 3) & x == lag(x, 4), na.rm = TRUE)>0 then the same_five in the entire row will be TRUE if for any of the x you have a 5-in-a-row. Personally, I would not do this programming nested within tibble and dplyr. It makes it more difficult to debug and also more difficult for others to read. ` $\endgroup$ Commented Oct 17, 2020 at 20:25
  • $\begingroup$ Yeah Tyberius has it, I did do a simulation using run length encoding, andrewpwheeler.com/2020/10/17/simulating-runs-of-events. You can see it approximately conforms to Sextus's analytic estimates (0.475 for the R and python code). $\endgroup$
    – Andy W
    Commented Oct 18, 2020 at 14:06

Not the answer you're looking for? Browse other questions tagged or ask your own question.