1
$\begingroup$

I have a set $S$ with a total of 20000 items. I am also given a list $L$ of 0.5 million sets, with each set having 1-20 elements from the original set. I am given an integer $n$. Now I need a new set $S' \subseteq S$ containing $n$ items from $S$. $n$ is in the range of 1000-5000. The new set $S'$ generated should be such that a maximum number of sets from the list $L$ are a subset of $S'$. Is there an efficient algorithm that can give me the set $S'$? Brute force is probably factorial time complexity.

Background: this question is related to a set of items available in a store and the the list of orders that the store had historically. We are trying to reduce the inventory from 20000 to 1000-5000 so that most of the historical orders could have been fulfilled.

$\endgroup$
5
  • $\begingroup$ Does this answer your question? Is Minimum Coverage Problem NP-hard? $\endgroup$
    – D.W.
    Commented Apr 20 at 3:59
  • 1
    $\begingroup$ @Tarique, it looks like you might be a little bit new here, so I don't know how familiar you are with how the site works. Questions that aren't sufficiently clear or need additional work are put on hold to give you a chance to revise them before anyone answers; once they're edited, they can be considered for re-opening (and I have voted to re-open your question, and if you give it a little time I'm confident it will be re-opened). Our experience is that this helps avoid problems. $\endgroup$
    – D.W.
    Commented Apr 20 at 20:04
  • 1
    $\begingroup$ To learn more, see math.stackexchange.com/help/closed-questions and math.meta.stackexchange.com/a/37370/14578 and meta.stackexchange.com/q/19126/160917. To me, it looks like what happened in this case was that the question was put on hold because there were some ambiguities that were not resolved. Then you edited the question to resolve those ambiguities, but also added editorial remarks that don't belong in questions. I edited the question to remove those editorial remarks and focus the question, and it has now been re-opened. $\endgroup$
    – D.W.
    Commented Apr 20 at 20:04
  • 2
    $\begingroup$ @D.W Thank you for providing a detailed answer and re-opening the question. I understand that editorial remarks don't belong in the question but I only did it after multiple edits to get the question reopened was ignored. It is not at all helpful for a QnA site to close questions without providing definite reasons. We are all trying to learn here $\endgroup$
    – Tarique
    Commented Apr 21 at 1:18
  • 1
    $\begingroup$ Thank you for your kind words. There were reasons provided for the closure: I left comments which explained the ambiguities, and those preceded the closure. Those issues were not addressed until after the question was closed. However, I agree that the question could have been re-opened more promptly, once the ambiguities were addressed by your subsequent edit. $\endgroup$
    – D.W.
    Commented Apr 21 at 3:01

1 Answer 1

2
$\begingroup$

The problem is NP-hard, as proven in Is Minimum Coverage Problem NP-hard?. That answer points out that the problem is called the minimum k-union problem.

When you have a NP-hard problem, you have to find ways to deal with it. One approach is to use an approximation algorithm. In this case, there is a greedy algorithm that does not find the optimal answer but gets within some approximation factor. Follow the links I provided to find it.

Another approach is to use an off-the-shelf solver. In this case, your problem can be formulated as an instance of integer linear programming, and then you can apply any existing ILP solver. To formulate it that way, I suggest introducing a zero-or-one variable $x_i$ per item $i \in S$, and a zero-or-one variable $y_j$ per set $j \in L$. The intended meaning is that $x_i=1$ means that item $i$ is included in set $S'$, and $y_j=1$ means that set $j$ is a subset of $S'$. Add linear inequalities to capture the consistency requirements between these variables: specifically, $x_i \ge y_j$ for each $i,j$ such that item $i$ is contained in set $j$. The requirement that $|S'| \le n$ can be captured by $\sum_i x_i \le n$. Finally, your goal is to maximize the number of sets in $L$ that are a subset of $S'$, i.e., to maximize $\sum_j y_j$. This forms a system of linear inequalities that can be solved with an ILP solver. The running time still might ultimately be exponential, but it lets you take advantage of all of the non-trivial optimizations found in existing ILP solvers.

For your particular problem parameters, I suspect the problem is likely too large to be solved to optimality within any reasonable amount of time. However, many ILP solvers allow you to set a timeout and will produce the best solution found so far within the available time, which might be a reasonable approach to finding the best solution you can for your problem.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .