My problem requires choosing a fixed number of vectors from a large set of vectors such that the sum of these vectors is close to some known target vector. That is, given known parameters:
$$ l, m, n \in \mathbb{N} $$ $$ \{a_{i,j} \in \mathbb{R}: i \in 1..m, j \in 1..l\} $$ $$ \{c_i \in \mathbb{R}: i \in 1..m\} $$ minimize $$ \sum_{i=1}^m \left| c_i - \sum_{j=1}^l a_{i,j} x_j \right| $$ s.t. $$\sum_{j=1}^l x_j = n$$ $$x_j \in \{0,1\} ~\forall j$$
The typical problem size has $l \approx 1000 - 10000, m \approx 100, n \approx 2m$, though I may eventually want to explore larger numbers.
If the distributional properties of $a_{i,j}$ and $c_i$ are relevant, for the purposes of this problem, let's suppose the following:
- The vectors $[a_{1,j},a_{2,j},...,a_{m,j}]$ are i.i.d. realisations from some $m$-dimensional random distribution, with $-k < a_{i,j} < k~\forall i,j$ for some known $k$.
- The vector $[c_1,c_2,...,c_m]$ has been calculated as the sum of some set of $n$ vectors drawn from that same $m$-dimensional random distribution.
Formulating this as a MILP and solving via HiGHS works pretty well for smaller problem sizes, up to approx. $l = 2000, m = 30, n = 100$, but past that it's not scaling very well.
When I run it on larger problems, almost all the improvement comes in the first few minutes of run-time, and the solution I get after four hours is only marginally better than the one I had at three minutes. I don't need the absolute minimum but I need something pretty small, and for larger problems the solutions I'm getting aren't good enough.
There is some leeway to modify the problem:
The length of $l$ is not fixed. I can in fact increase or decrease it if I want, changing the number of $a_{i,j}$ parameters accordingly. However, if $l$ is too small, then there will be no possible "good enough" solution, and making it too large seems to be counterproductive, making the solution slightly worse for the same run-time. In theory, increasing $l$ should never worsen the global optimum, but I suspect the increased problem size is making it harder to find that optimum - the MIP gap is calculated as very close to 100%.
I am not too particular about the exact form of the objective function. If it makes things easier I'm happy to replace it with anything that gives a sensible norm for the vector $[(c_1- \sum_{j=1}^l a_{1,j} x_j),(c_2- \sum_{j=1}^l a_{2,j} x_j),...,(c_m- \sum_{j=1}^l a_{m,j} x_j)]$.
Currently I'm using HiGHS as a solver because that's what I have at hand, but I'm open to other free options if they can be easily applied in Python. I might be able to get access to Gurobi but this would be a less convenient option.
I'm happy to do some experimentation here but the time I have available to work on this is limited, so I'm trying to identify the best options for exploration.
My questions:
Does this problem of minimizing sum(abs($Ax-c$)) for binary $x$ have a standard name? It seems very similar to a knapsack problem, and I guess it could be translated to a multidimensional knapsack problem where each element of $Ax-c$ represents two constraints, but I'm not sure if that's a helpful way to formulate it. Not having a good term for this problem is getting in the way of looking for established methods.
Am I likely to get better results by changing approaches (e.g. transforming the problem, or switching to a different MILP solver, or to a constraint solver)? Or is this just inherently difficult to scale?