3
$\begingroup$

I have a sparse $60000\times10000$ matrix where each element is either a $1$ or $0$ as follows.

$$M=\begin{bmatrix}1 & 0 & 1 & \cdots & 1 \\1 & 1 & 0 & \cdots & 1 \\0 & 1 & 0 & \cdots & 1 \\\vdots & \vdots & \vdots & \ddots & \vdots \\0 & 0 & 0 & \cdots & 0 \end{bmatrix}$$

Each row represents a different point in time such that the first row is $t=1$ and the last row is $t=60000$. Each column in the matrix is a different combination of signals (ie. $1$s and $0$s). I want to choose five column vectors from $M$, $c_1, c_2, c_3, c_4, c_5$ and take the Hadamard (ie. element-wise) product:

$$s = c_1 \circ c_2 \circ c_3 \circ c_4 \circ c_5$$

The result is a new vector $s$ where $1$s only remain in a row if they were present in every vector $c_1, c_2, c_3, c_4, c_5$ for that row. I need an algorithm that finds the optimal strategy vector $s$ that is most similar to the optimal strategy vector $s^{*}$. The optimal strategy vector $s^{*}$ is static and is of the same dimensions as $s$.

The easiest solution is to test every combination possible; however, choosing $5$ columns from $10000$ results in $10^{20}$ different possible combinations. Note that the five vectors chosen from $M$ must not be distinct. As such, I want to find an algorithm that returns the five vectors $c_{1}$ through $c_{5}$ that yield the optimal strategy $s^{*}$.

I am not too sure on how to approach this problem but I believe it is a discrete optimisation problem where I want to maximise the similarity, $\operatorname{sim}(s,s^{*})$. I am defining the similarity using the Jaccard Index below as it rewards having $1$s in the correct place but does not penalize $0$s. This is good because I'd rather have a false negative than a false positive. The metric is bounded between $0$ and $1$ where $1$ denotes the maximum similarity. It is commonly used for 'binary vectors' and so seems ideal here as each element is either a $1$ or $0$.

$$\operatorname{sim}(s,s^{*})=\frac{s \cdot s^{*}}{\sum_{i=1}^{60000}\left (s_{i} + s_{i}^{*} \right ) - s \cdot s^{*}}$$

Note that the operation ($\cdot$) refers to the dot product and not the Hadamard product.

I posted a similar question before but it was removed for lack of context. I have added more context now but let me know if anything is unclear. I am asking this question to aid me in a personal project I am working on.

FYI, the algorithm will be run on a computer. Any help is much appreciated -- thanks!

$\endgroup$
6
  • $\begingroup$ Shouldn't $s\circ s=s$? Will $s^*$ also only have $0,1$ as entries? $\endgroup$
    – Sudix
    Commented Aug 2, 2021 at 15:59
  • $\begingroup$ You are right for the first question. I meant to write the sum of all elements in $s$. I'll fix that. $s^{*}$ is derived from elementwise operation of the other vectors (which exclusively have $0$s and $1$s) so it too only has $0$, $1$ entries. $\endgroup$ Commented Aug 2, 2021 at 16:12
  • $\begingroup$ The core part of your question lies within the chosen similarity measure (which, formally spoken, isn't a metric). If we had a classic metric (e.g. euclidian or taxicab metric) and were to minimize the distance of $s$ and $s^*$ there'd most likely exist results already, and with them algorithms. Therefore it's critical to make sure the measure you chose actually does what you want. Two points here are: 1. Is the non-linearity essential? 2. What do you do if $s^*=\vec{0}$? $\endgroup$
    – Sudix
    Commented Aug 2, 2021 at 16:55
  • 1
    $\begingroup$ For my purposes, $s^*\neq\vec{0}$. In regards to the metric @Sudix, I first saw it on the article link and upon further research now see that it is known as the Jaccard Index/Similarity (although in the paper they refer to it as the Tanimoto coefficient). This metric is regarded as better for 'binary vectors' (ie. only 1s and 0s). $\endgroup$ Commented Aug 2, 2021 at 18:00
  • $\begingroup$ Your dimensions seem to be a bit off. You start with a $100000 \times 60000$ matrix, then say it has $60000$ rows and that you are choosing five columns from $10000$ (not $100000$. $\endgroup$
    – prubin
    Commented Aug 3, 2021 at 19:02

2 Answers 2

1
$\begingroup$

Let $m:=100000, n:=60000$.

We'll show that just deciding whether or not there exists an $s_\text{OPT}$ with $s_\text{OPT}=s^*$ is NP-hard.

Let $s_\text{OPT} :=c_1 \circ c_2 \circ c_3 \circ c_4 \circ c_5$ denote the optimal solution, and $c_1,...,c_5$ a combination that produces this optimal solution.

If for any $i$ we have $(c_i)_x=0$, then $(s_\text{OPT})_x=0$. Therefore, if $s_\text{OPT}=s_*$, whenever we have $s^*_x=1$, then for all $i$ it must hold that $(c_i)_x=1$.

Let $M_{\downarrow i}$ denote the $i$-th column of $M$.

So if we look for each column $M_{\downarrow i}$ at the index set $S_i\subseteq \{1,..,n\}$, which tells us which cells of $M_{\downarrow i}$ are $0$, then the hadamard product of columns becomes the union of their index sets.

Formally:
For each column $M_{\downarrow i}$ define $S_i$ via the equivalence $\forall x \in \{1,...,n\}:\quad x\in S_i\Leftrightarrow (M_{\downarrow i})_x=0$.

Then the following holds: $$ (M_i\circ M_j )_x=0 \Leftrightarrow x\in (S_i \cup S_j) $$

Using this mapping of our row vectors to sets, we can now reduce this problem to a known one (technically I'd have to show the reduction the other way around, but the reduction in the other direction works similar):

Let $S_*$ be the set corresponding to $s^*$.

If we assume that $s_\text{OPT}=s_*$, then we only have to look at columns that have at least a one everywhere $s^*$ has a one, so we discard all sets $S_i$ where $S_i\supsetneq S_*$.

We now have an instance of Maximum k-Set Cover, which is a special case of the Set Cover Problem in which we want to find the biggest union of at most $k$ sets.

This problem however is in $W[2]$ when parameterized after $k$, so there can't be any efficient algorithm, see this link.

If you want an exact algorithm, your best chance is to look for an FPT that is only exponential in the size $\|s_\text{OPT}-s^*\|_1$.

Otherwise you'll probably have to live with a heuristic. While not exactly your problem, I think the ideas of heuristics for the Maximum Coverage Problem should be somewhat transferable.

$\endgroup$
8
  • $\begingroup$ Thanks for your response and suggestions @Sudix, I'll take a deeper look into the links you provided a bit later. I'm a bit confused about some of the things you wrote. First, why do we only need to look at rows that have a one everywhere $s^*$ has a one to find the exact solution? Wouldn't one need to be sure that other rows in $s_\text{OPT}$ don't also have ones? Also, you mentioned that the Set Cover Problem seeks to find the biggest union. Am I not trying to maximise the intersection here? $\endgroup$ Commented Aug 3, 2021 at 13:22
  • $\begingroup$ For your first point, you're right. I was a bit too inprecise there. For your second point: If we were looking at the ones, then we'd look at the maximum intersection. However, if we're looking at zeroes, then if there is just one component with a zero at index $x$, then there's a zero in the final Hadamard product as well, so we're looking at the union instead. $\endgroup$
    – Sudix
    Commented Aug 3, 2021 at 14:23
  • $\begingroup$ Having taken a better look at your answer now @Sudix, there are still some things I don't fully understand. First, why are you creating sets alongst each row. Should it not each column be a set as I want to choose 5 different columns? Also, here you take the Hadamard product of two different rows $(M_i\circ M_j )_x=0 \Leftrightarrow x\in (S_i \cup S_j)$. Should it not be of different columns? $\endgroup$ Commented Aug 5, 2021 at 9:04
  • $\begingroup$ Yes, you're right. Seems I misread "column" as "row" in your question $\endgroup$
    – Sudix
    Commented Aug 5, 2021 at 12:43
  • $\begingroup$ I'm having difficulties seeing how to apply the concepts from the maximum k-set cover problem as it will never be the case that $𝑠_{OPT}=𝑠^{∗}$ for me. I merely want to find the $𝑠_{OPT}$ most similar to $𝑠^{∗}$ as defined by the equation above (or by an easier one if needed, like $∥𝑠−𝑠^∗∥$). $\endgroup$ Commented Aug 8, 2021 at 23:05
0
$\begingroup$

You can formulate this as a binary linear program and exploit sparsity if you are willing to settle for a somewhat different objective function (your chosen similarity measure being nonlinear).

Let $O_i = \lbrace j : M_{ij} = 1\rbrace$ be the set of column indices where row $i$ has a one. Let $x_j\in \lbrace 0, 1 \rbrace$ be 1 if column $j$ is selected and 0 if not, and let $s_i \in [0,1]$ for each row index $i$. (You can make the $s_i$ binary or not. It's hard to say which way will work better with respect to solver time.) Now add the constraints\begin{align*} \sum_{j}x_{j} & =5\\ \sum_{j\in O_{i}}x_{j} & \le4+s_{i}\quad\forall i\\ \sum_{j\in O_{i}}x_{j} & \ge5s_{i}\quad\forall i. \end{align*} The first says that you have to select exactly five columns. The second says that if $s_i$ is to be 0, you cannot select more than four columns with value 1 in row $i$. The third says that if $s_i$ is to be 1, you must choose five columns that are all 1 in row $i$. Note that the number of constraints is approximately twice the number of rows, and the only dense constraint is the first one.

That brings us to the objective function. If you want to maximize the number of ones in $s^*$ that get matched, you can just maximize $$\sum_{j:s^*_j=1}s_j.$$ If you want to minimize the number of times $s^*$ is 0 but $s$ is 1 (false positive?), you can minimize $$\sum_{j:s^*_j=0} s_j.$$ To minimize the $\parallel s-s^*\parallel_1$, just minimize $$\sum_{j:s^*_j=0}s_j + \sum_{j:s^*_j=1}(1-s_j).$$

$\endgroup$
1
  • $\begingroup$ Would the final function to minimise not be $\sum_{j:s^*_j=0}s_j - \sum_{j:s^*_j=1}(s_j)$ instead since $\sum_{j:s^*_j=1}s_j$ is not bounded by 0 and 1 @prubin? Also, to clarify, is $s_i$ just whether row $s^{*}=0$ at row $i$? If so, then does it work to have your second and third constraints for ALL rows $i$ when there is no solution $s=s^{*}$ in my problem? $\endgroup$ Commented Aug 5, 2021 at 9:23

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .