2
$\begingroup$

Given a matrix $A$ I need to select a "representative" subset of its columns so that the each non-selected column is as close as possible to a selected one.

Formally:

Given $A \in \mathbb{R}^{d \times n}$ and $0<k<n$, find a subset of column indices $S=\{i_1,i_2,\ldots,i_k\}$, so that $$ \sum_{j \notin S}{\min_{r=1,\ldots,k}\angle(A_j, A_{i_r})} $$ is minimized.

If instead of taking the minimum angle between $A_{i_r}$ and a single column in $S$ we were looking at minimizing the projection of $A_{i_r}$ on $Span(S)$, it would have been a straightforward CSSP (Column Subset Selection Problem). As it is, it seems more of a coreset-like problem but I am failing to find a ready-made solution.

Even a decent approximation would be great.

$\endgroup$
3
  • $\begingroup$ Fixing a mistake in the formula. $\endgroup$ Commented Mar 16, 2021 at 11:05
  • $\begingroup$ Have you looked at CUR matrix decompositions or even information theoretic forward selection algorithms? $\endgroup$ Commented Mar 19, 2021 at 18:25
  • $\begingroup$ @Jean-LucBouchot Yes, CUR (or, rather, CX) was my first go-to but it does not seem to do the trick because, roughly speaking, it approximates the original matrix holistically by a submatrix, while I need to approximate each column separately. I am somewhat familiar with feature selections algs but usually they are heuristic; I was looking for an optimal solution if possible. Thanks anyway, these are certainly good suggestions. $\endgroup$ Commented Mar 19, 2021 at 18:31

1 Answer 1

3
$\begingroup$

You can solve this as a $k$-median (also known as $p$-median) problem. Each column is both a customer and a candidate facility location. Let $d_{c,f}$ denote the cost of assigning customer $c$ to facility $f$. Let binary decision variable $x_{c,f}$ indicate whether customer $c$ is assigned to facility $f$, and let binary decision variable $y_f$ indicate whether a facility is built at location $f$. The problem is to minimize $\sum_{c,f} d_{c,f} x_{c,f}$ subject to linear constraints \begin{align} \sum_f x_{c,f} &= 1 &&\text{for all $c$} \tag1 \\ x_{c,f} &\le y_f &&\text{for all $c$ and $f$} \tag2 \\ \sum_f y_f &= k \tag3 \\ \end{align} Constraint $(1)$ assigns each customer to exactly one facility. Constraint $(2)$ forces facility $f$ to be built if any customer is assigned to it. Constraint $(3)$ selects exactly $k$ facilities to be built.

$\endgroup$
1
  • $\begingroup$ Thanks, Rob! This is just what I needed and it works like a charm. $\endgroup$ Commented Mar 19, 2021 at 18:29

Not the answer you're looking for? Browse other questions tagged or ask your own question.