9
$\begingroup$

Sometimes I want to do an exact test by examining all possible combinations of the data to build an empirical distribution against which I can test my observed differences between means. To find the possible combinations I'd typically use the combn function. The choose function can show me how many possible combinations there are. It is very easy for the number of combinations to get so large that it is not possible to store the result of the combn function, e.g. combn(28,14) requires a 2.1 Gb vector. So I tried writing an object that steped through the same logic as the combn function in order to provide the values off an imaginary "stack" one at a time. However, this method (as I instantiated it) is easily 50 times slower than combn at reasonable combination sizes, leading me to think it will also be painfully slow for larger combination sizes.

Is there a better algorithm for doing this sort of thing than the algorithm used in combn?Specifically is there a way to generate and pull the Nth possible combination without calculating through all previous combinations?

$\endgroup$
11
  • $\begingroup$ Has anyone noticed that the number of questions that should be in StackOverflow R rocketed up here recently? $\endgroup$
    – John
    Commented Aug 5, 2010 at 5:32
  • 1
    $\begingroup$ Why not making random sampling? $\endgroup$
    – user88
    Commented Aug 5, 2010 at 7:21
  • 4
    $\begingroup$ @John: If you feel that way discuss the issue at meta.stats.stackexchange.com/questions/248/…, no need to be snarky. $\endgroup$ Commented Aug 5, 2010 at 7:59
  • $\begingroup$ @mbq: Random sampling will quickly provide a reasonable approximate, especially with well behaved data. However, I did specify that my goal was an exact test. $\endgroup$ Commented Aug 5, 2010 at 8:04
  • 1
    $\begingroup$ There are ways to mitigate this issue somewhat (which will solve the problem for one case or another), but by choosing larger cases again, this combinatorial explosion catches you up quickly. By contrast, sampling with replacement usually avoids problems associated with the explosive growth in the number of combinations. $\endgroup$
    – Glen_b
    Commented Aug 27, 2016 at 0:16

3 Answers 3

2
$\begingroup$

It sounds like the OP is describing an iterator. There is a package called iterators that seems promising, however a major drawback is that one needs to generate the object upfront in order to iterate. That does not help us in the OP’s case where we are trying to avoid generating the entire object upfront.

There is a package RcppAlgos (I am the author) that provides flexible combinatorial iterators that allow one to traverse forwards, backwards, and even allows random access via [[. The underlying algorithms are written in C++ for maximum efficiency. Results are produces on the fly, which keeps memory usage in check, all while preserving the state.

library(RcppAlgos)
it <- comboIter(25, 5)

## Get the first iteration
it@nextIter()
#> [1] 1 2 3 4 5

## See the current state
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 1
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 53129
## Get the next 7 iterations
it@nextNIter(n = 7)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    3    4    6
#> [2,]    1    2    3    4    7
#> [3,]    1    2    3    4    8
#> [4,]    1    2    3    4    9
#> [5,]    1    2    3    4   10
#> [6,]    1    2    3    4   11
#> [7,]    1    2    3    4   12

## See the current state
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 8
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 53122
## Go back to the previous iteration
it@prevIter()
#> [1]  1  2  3  4 11

## Skip ahead to the 20000th iteration instantly
it[[20000]]
#> [1]  3  4  8  9 14

## See the current state. Notice we are at 20000
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 20000
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 33130
## Start iterating again
it@nextIter()
#> [1]  3  4  8  9 15

## Reset the iterator
it@startOver()

## Results are identical
identical(
    replicate(choose(25, 5), it@nextIter(), simplify = "array"),
    combn(25, 5)
)
#> [1] TRUE

These iterators are very efficient. Even with all of the communication back and forth between R and C++, iterating “one at a time” is almost as fast as combn generating all of them upfront.

library(microbenchmark)
options(digits = 4)
options(width = 90)

## helper functions for resetting the iterator
one_at_a_time <- function() {
    it@startOver()
    replicate(choose(25, 5), it@nextIter())
}

microbenchmark(
    RcppAlgos_single = one_at_a_time(),
    combn_all = combn(25, 5),
    unit = "relative"
)
#> Unit: relative
#>              expr   min    lq  mean median   uq   max neval
#>  RcppAlgos_single 2.138 2.106 2.094  2.126 2.09 1.658   100
#>         combn_all 1.000 1.000 1.000  1.000 1.00 1.000   100

Even better, when we shift from “one at a time” to just a few at a time via nextNIter, the speed up is substantial. Observe:

multiple_at_a_time <- function(n) {
    it@startOver()
    replicate(choose(25, 5) / n, it@nextNIter(n = n))
}

microbenchmark(
    RcppAlgos_chunks = multiple_at_a_time(30),
    combn_all = combn(25, 5),
    unit = "relative"
)
#> Unit: relative
#>              expr   min    lq  mean median    uq    max neval
#>  RcppAlgos_chunks 1.000 1.000 1.000  1.000 1.000 1.0000   100
#>         combn_all 8.057 7.981 7.017  7.977 7.923 0.9138   100

We can even evaluate each combination on the fly using the FUN parameter:

## the FUN.VALUE parameter is optional. If NULL (the default),
## when multiple results are requested via nextNIter, prevNIter,
## nextRemaining, and prevRemaining, a list will be returned.
## This parameter is modeled after the usage in vapply
it_fun <- comboIter(25, 5, FUN = cumprod, FUN.VALUE = cumprod(1:5))

it_fun@nextIter()
#> [1]   1   2   6  24 120

it_fun@nextNIter(n = 2)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    6   24  144
#> [2,]    1    2    6   24  168

## See the previous results in reverse order
it_fun@prevRemaining()
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    6   24  144
#> [2,]    1    2    6   24  120

Finally, the [[ method allows for random access of a single result or multiple results:

## As seen above
it[[20000]]
#> [1]  3  4  8  9 14

## Pass a random sample. N.B. In this case the state is unaffected.
## That is, it will remain whatever it was prior to passing the
## vector. In our case, it will still be on the 20000 index.
set.seed(42)
it[[sample(choose(25, 5), 10)]]
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    5    7   10   11   25
#>  [2,]    1   10   11   13   20
#>  [3,]    8   12   20   21   22
#>  [4,]    9   10   13   15   21
#>  [5,]    2    8   12   13   16
#>  [6,]    1    8   10   11   25
#>  [7,]    5   12   14   18   22
#>  [8,]    1   10   14   22   23
#>  [9,]    5    6   18   19   25
#> [10,]    8    9   14   15   17
## State unaffected
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 20000
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 33130

## Sample with replacement
it[[sample(choose(25, 5), 5, replace = TRUE)]]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    2    5    6   14   23
#> [2,]    6   10   18   21   23
#> [3,]    4    6   12   14   20
#> [4,]    7    8   13   19   20
#> [5,]    6   12   13   20   25

There are also combinatorial sampling functions in RcppAlgos. For our current case, we would call upon comboSample.

## Same as above:
##     set.seed(42)
##     it[[sample(choose(25, 5), 10)]]
comboSample(25, 5, seed = 42, n = 10)
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    5    7   10   11   25
#>  [2,]    1   10   11   13   20
#>  [3,]    8   12   20   21   22
#>  [4,]    9   10   13   15   21
#>  [5,]    2    8   12   13   16
#>  [6,]    1    8   10   11   25
#>  [7,]    5   12   14   18   22
#>  [8,]    1   10   14   22   23
#>  [9,]    5    6   18   19   25
#> [10,]    8    9   14   15   17

See my answer to How to resample in R without repeating permutations? for more info.

$\endgroup$
6
$\begingroup$

If you wish to trade processing speed for memory (which I think you do), I would suggest the following algorithm:

  • Set up a loop from 1 to N Choose K, indexed by i
  • Each i can be considered an index to a combinadic, decode as such
  • Use the combination to perform your test statistic, store the result, discard the combination
  • Repeat

This will give you all N Choose K possible combinations without having to create them explicitly. I have code to do this in R if you'd like it (you can email me at mark dot m period fredrickson at-symbol gmail dot com).

$\endgroup$
1
  • $\begingroup$ I'm accepting this answer because it solves (what I think) is the harder of the problems I was looking for a solution to - picking a particular combination out without calculating the preceding values. Unfortunately, it is still very slow. Perhaps as mentioned here and elsewhere a binary search would help speed things up. Perhaps the best approach is to have one thread generating the combinations stepwise as in mbq's answer and another thread reading them off and testing them. $\endgroup$ Commented Aug 12, 2010 at 7:15
1
$\begingroup$

Generating combinations is pretty easy, see for instance this; write this code in R and then process each combination at a time it appears.

$\endgroup$
3
  • $\begingroup$ But will this cope with very large combinations? $\endgroup$ Commented Aug 5, 2010 at 10:40
  • $\begingroup$ @csgillespie Well, I believe so -- it works in situ, so only one combination is stored in memory at a time, and the results of simulation can be also aggregated to eliminate the need of storing them. This will of course work terribly long, but exhaustive searches usually do. For speed it could be written in C, but then along with the simulation part, which probably is way slower than a generator step. $\endgroup$
    – user88
    Commented Aug 5, 2010 at 11:11
  • 2
    $\begingroup$ That looks almost identical to how R's combn function is already doing things. I wrote up a version of combn that does take combinations off the stack one at a time, and as mbq says because it is only storing one combination in memory at a time it can handle very large combinations. The problem with doing it in R is that doing a step-by-step approach in a function typically involves reading the state variables into the function, manipulating them, then storing them back out to global - which seems to just slow everything /way/ down. $\endgroup$ Commented Aug 5, 2010 at 15:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.