Longest Repeated (Scattered) Subsequence in a String

Question

Informal Problem Statement:

Given a string, e.g. $ACCABBAB$, we want to colour some letters red and some letters blue (and some not at all), such that reading only the red letters from left to right yields the same result as reading only the blue letters.

In the example we could colour them like this: $A\color{blue}{C}\color{red}{CAB}B\color{blue}{AB}$

Therefore, we say $CAB$ is a repeated subsequence of $ACCABBAB$. It is also a longest repeated subsequence (which is easy to check).

Can we compute longest repeated subsequences efficiently?

Formal Question:

Is it NP-hard to decide for a string and some $k$, whether a repeated subsequence of length $k$ exists in the string?

If so: Which problem can be reduced to this problem?
If not: What is an efficient algorithm? (obviously, this algorithm can then be used to compute a longest repeated subsequence)

Bonus Question:

Will their always be a repeated subsequence of length $n/2 - o(n)$ if the size of the alphabet is bounded by a constant?

(This is known to be true for binary alphabets.)

Edit 2: The negative answer to the Bonus Question is already known for alphabets of size at least $5$. In fact for alphabets of size $Σ$, there are strings with longest repeated subsequences of a length of merely $O(n · Σ^{-1/2})$. Random strings suffice to show this. The result already existed, but I overlooked it.

Edit: Note:

Some people mean "substring" when they say "subsequence". I don't. This is not the problem of finding a substrings twice.

Sekti, thank you. You are right: my first comment was erroneous; I've now deleted it. On the other hand, my remaining comment is talking about non-contiguous subsequences. If $k$ is fixed, there is a way to solve your problem (with non-contiguous subsequences, mandated to be non-overlapping) in $O(n^{2k+2})$ time or so. Each d.p. subproblem keeps track of the indices of all of the red letters and all of the blue letters chosen so far. This is probably uninteresting, because it doesn't tell us what happens when $k$ is part of the input. — D.W., Commented Jun 16, 2014 at 22:36
@BryceKille, I don't know; maybe it can. If you figure out how to do it I hope you'll write an answer! — D.W., Commented Apr 24, 2019 at 4:39

pcpthm · Accepted Answer · 2020-10-04 09:23:54Z

5

The special case of $k = n/2$ is the same problem as this CST.SE question How hard is unshuffling a string? asks.

Buss and Soltys proved NP-completeness of this problem [1] by reducing 3-Partition problem to this problem.

[1]: Buss, Sam, and Michael Soltys. "Unshuffling a square is NP-hard." Journal of Computer and System Sciences 80.4 (2014): 766-776.

answered Oct 4, 2020 at 9:23

pcpthm

2,8175 silver badges15 bronze badges

2

$\begingroup$ There seems to have been a preceding slide presentation. $\endgroup$
– greybeard
Commented Oct 4, 2020 at 9:32

Add a comment |

noplogist · Accepted Answer · 2015-12-12 00:02:27Z

-3

This can be solved in ~~polynomial time~~ by constructing a graph $G$ where each node represents a point $(i,j)$ in some repeated subsequence of $S$ such that $S[i]=S[j]$. Edge between nodes $u$ and $v$ means that $u$ can be continued by $v$ to form a repeated subsequence of length 2.

1. Find the nodes. This can be done in $O(n^2)$ time by building a sorted list of indices for each character $c$, and then enumerating the unique pairs. There are no more than $m=n^2$ nodes.

2. Find the edges. It takes $O(1)$ time to check if node $u$ can be continued by node $v$, so by considering all pairs $(u,v)$ this step takes $O(m^2)$ time.

3. Note that the longest path in $G$ may not be a valid repeated subsequence. Consider paths $ab$ and $bc$. If there exists an edge $ac$ then $abc$ is a valid repeated subsequence of length 3. Therefore it takes $O(m^4)$ time to find all repeated subsequences of length 3. In the general case it takes linear time to check whether two valid paths of length $n$ can be combined into a valid path of length $n+1$.

4. Iterate step 3 until no longer paths can be found.

edited Dec 12, 2015 at 0:02

answered Dec 11, 2015 at 22:47

noplogist

274 bronze badges

$\begingroup$ Hmm. Too quick. In step 3 the number of subsequences to consider gets bigger and bigger. So it's not polynomial. $\endgroup$
– noplogist
Commented Dec 11, 2015 at 23:07
1

$\begingroup$ Welcome to CS.SE, and thank you for trying to solve this problem! I'm afraid I don't understand your algorithm. What is Step 3? All I see in "3." are some declarative statements/observations, but I don't see a procedural specification of what the algorithm is supposed to do. Also, I don't see what it means to iterate Step 3, or the rationale for your claim that $O(nnm^2)$ time suffices. Your subsequent comment makes it sound like you no longer believe your answer is correct. If so, it might be better to delete the answer, to avoid confusion. $\endgroup$
– D.W. ♦
Commented Dec 11, 2015 at 23:22
$\begingroup$ You can always undelete or post a new answer if you figure out the answer later. $\endgroup$
– D.W. ♦
Commented Dec 11, 2015 at 23:22
$\begingroup$ D.W., thanks. Input to step 3 is all repeated subsequences of length n, and output is all repeated subsequences of length n+1. I believe the algorithm is correct but that it is not a polynomial time algorithm. I have now marked the claims that I consider incorrect. $\endgroup$
– noplogist
Commented Dec 11, 2015 at 23:40
1

$\begingroup$ Thank you. I understand. Unfortunately, with those revisions, I'm afraid this answer does not answer the question that was asked. The question was: Is this NP-hard? Is there an efficient algorithm? Showing that there exists an exponential-time algorithm does not help in answering either of those questions. Indeed, it's already trivial to see that there is an exponential-time algorithm for the problem, without invoking any fancy techniques. I do appreciate your attempt. $\endgroup$
– D.W. ♦
Commented Dec 12, 2015 at 2:11

Add a comment |

Stack Exchange Network

Longest Repeated (Scattered) Subsequence in a String

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
algorithms
complexity-theory
np-complete
strings
subsequences
or ask your own question.

Hot Network Questions

Longest Repeated (Scattered) Subsequence in a String

2 Answers 2

Not the answer you're looking for? Browse other questions tagged algorithmscomplexity-theorynp-completestringssubsequences or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
algorithms
complexity-theory
np-complete
strings
subsequences
or ask your own question.