4
$\begingroup$

Consider a given sequence of length $k$. I want to calculate the number of sequences of length $n$ that contains the given sequence as a subsequence. The alphabet used to generate the string consists of $|A|$ values.

For example, a sequence "120" is given. in this case, k=3. Consider the alphabet to be $A=\{0,1,2,3\}$ and $n=5$. In this case, two of the possible sequences are:

10230
12320

And the question becomes the total number of sequences of length 5 which contain "120" as a subsequence. The important part here is that the given subsequence is not necessarily contained in the sequence, which is clear by the provided examples.

I know this problem can be solved by using the principle of inclusion-exclusion. However, I was looking for a more straightforward and probably a closed-form equation for this problem.

Thank you in advance for your help.

$\endgroup$
2
  • $\begingroup$ Do you think that this number does depend only on k,n and |A| but not on the special structure of the sub-sequence? $\endgroup$
    – miracle173
    Commented Jan 14, 2020 at 16:12
  • $\begingroup$ @miracle173 This problem can be solved by the principle of inclusion and exclusion, which does not consider the structure of the subsequence. So I guess it is independent of the subsequence. However, I could be wrong. $\endgroup$
    – user741849
    Commented Jan 14, 2020 at 16:17

1 Answer 1

3
$\begingroup$

Let $T$ be your given sequence of length $k$, and $d$ the size of the alphabet. Let $A(k,n)$ be the set of sequences $S$ of length $n$ such that $T$ forms a subsequence of $S$.

It looks to me like $$|A(k,n)| = \sum_{j=0}^{n-k} {n \choose j+k} (d-1)^{n-k-j}$$

EDIT: This can be seen as follows. Let $p_k$ be the position of the last occurrence of $T_k$ in $S$, and for $1 \le j < k$ let $p_j$ be the last occurrence of $T_j$ before $p_{j+1}$. In order for $S \in A(k,n)$ we need all $p_j \ge 1$. Conversely, given a choice of $p_1, \ldots, p_k$ with $1 \le p_1 \le \ldots \le p_k \le n$, there are $ d^{p_1-1} (d-1)^{n-k-p_1}$ members of $A(k,n)$ where $S_{p_j} = T_j$, for $p_j < i < p_{j+1}$ (or $n$ in the case $j=k$) $T_i$ can be any symbol except $T_j$, and for $1 \le i < p_1$, $T_i$ can be any symbol. To get a sum just involving a power of $d-1$ rather than both $d$ and $d-1$, we also look at the occurrences of $T_1$ before $p_1$. The term ${n \choose j+k} (d-1)^{n-k-j}$ comes from the case where there are $j$ occurrences of $T_1$ in $S_{1..p_1-1}$.

$\endgroup$
3
  • 1
    $\begingroup$ It looks to me you are right because I verified the formula by checking all sequences for A(k,n,d): (A(5,7,4)=211; A(5,8,4)=1789; A(3,8,4)=21067; A(3,8,5)=79329; A(1,5,5)=2101; A(3,4,7)=25 $\endgroup$
    – miracle173
    Commented Jan 14, 2020 at 17:38
  • $\begingroup$ Thank you for your comment. I checked for some cases and it seems right. Mathematical reasoning is also fair and understanding. $\endgroup$
    – user741849
    Commented Jan 14, 2020 at 18:48
  • $\begingroup$ This proof for Levenshtein's formula bypasses the recursive argument--that's nice! But how do you get to it (accounting for occurrences of $T_1$) from the formula your proof yields, which I believe is $\sum_{p_1=1}^{n-k+1} q^j (q-1)^{n-k-p_1} \binom{n-p_1}{k-1}$? $\endgroup$ Commented Nov 30, 2022 at 10:04

You must log in to answer this question.