How can I obtain some of all possible combinations in R?

Question

Sometimes I want to do an exact test by examining all possible combinations of the data to build an empirical distribution against which I can test my observed differences between means. To find the possible combinations I'd typically use the combn function. The choose function can show me how many possible combinations there are. It is very easy for the number of combinations to get so large that it is not possible to store the result of the combn function, e.g. combn(28,14) requires a 2.1 Gb vector. So I tried writing an object that steped through the same logic as the combn function in order to provide the values off an imaginary "stack" one at a time. However, this method (as I instantiated it) is easily 50 times slower than combn at reasonable combination sizes, leading me to think it will also be painfully slow for larger combination sizes.

Is there a better algorithm for doing this sort of thing than the algorithm used in combn?Specifically is there a way to generate and pull the Nth possible combination without calculating through all previous combinations?

Has anyone noticed that the number of questions that should be in StackOverflow R rocketed up here recently? — John, Commented Aug 5, 2010 at 5:32
@John: If you feel that way discuss the issue at meta.stats.stackexchange.com/questions/248/…, no need to be snarky. — russellpierce, Commented Aug 5, 2010 at 7:59
@mbq: Random sampling will quickly provide a reasonable approximate, especially with well behaved data. However, I did specify that my goal was an exact test. — russellpierce, Commented Aug 5, 2010 at 8:04
There are ways to mitigate this issue somewhat (which will solve the problem for one case or another), but by choosing larger cases again, this combinatorial explosion catches you up quickly. By contrast, sampling with replacement usually avoids problems associated with the explosive growth in the number of combinations. — Glen_b, Commented Aug 27, 2016 at 0:16

Joseph Wood · Accepted Answer · 2024-05-27 19:17:31Z

It sounds like the OP is describing an iterator. There is a package called iterators that seems promising, however a major drawback is that one needs to generate the object upfront in order to iterate. That does not help us in the OP’s case where we are trying to avoid generating the entire object upfront.

There is a package RcppAlgos (I am the author) that provides flexible combinatorial iterators that allow one to traverse forwards, backwards, and even allows random access via [[. The underlying algorithms are written in C++ for maximum efficiency. Results are produces on the fly, which keeps memory usage in check, all while preserving the state.

library(RcppAlgos)
it <- comboIter(25, 5)

## Get the first iteration
it@nextIter()
#> [1] 1 2 3 4 5

## See the current state
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 1
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 53129

## Get the next 7 iterations
it@nextNIter(n = 7)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    3    4    6
#> [2,]    1    2    3    4    7
#> [3,]    1    2    3    4    8
#> [4,]    1    2    3    4    9
#> [5,]    1    2    3    4   10
#> [6,]    1    2    3    4   11
#> [7,]    1    2    3    4   12

## See the current state
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 8
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 53122

## Go back to the previous iteration
it@prevIter()
#> [1]  1  2  3  4 11

## Skip ahead to the 20000th iteration instantly
it[[20000]]
#> [1]  3  4  8  9 14

## See the current state. Notice we are at 20000
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 20000
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 33130

## Start iterating again
it@nextIter()
#> [1]  3  4  8  9 15

## Reset the iterator
it@startOver()

## Results are identical
identical(
    replicate(choose(25, 5), it@nextIter(), simplify = "array"),
    combn(25, 5)
)
#> [1] TRUE

These iterators are very efficient. Even with all of the communication back and forth between R and C++, iterating “one at a time” is almost as fast as combn generating all of them upfront.

library(microbenchmark)
options(digits = 4)
options(width = 90)

## helper functions for resetting the iterator
one_at_a_time <- function() {
    it@startOver()
    replicate(choose(25, 5), it@nextIter())
}

microbenchmark(
    RcppAlgos_single = one_at_a_time(),
    combn_all = combn(25, 5),
    unit = "relative"
)
#> Unit: relative
#>              expr   min    lq  mean median   uq   max neval
#>  RcppAlgos_single 2.138 2.106 2.094  2.126 2.09 1.658   100
#>         combn_all 1.000 1.000 1.000  1.000 1.00 1.000   100

Even better, when we shift from “one at a time” to just a few at a time via nextNIter, the speed up is substantial. Observe:

multiple_at_a_time <- function(n) {
    it@startOver()
    replicate(choose(25, 5) / n, it@nextNIter(n = n))
}

microbenchmark(
    RcppAlgos_chunks = multiple_at_a_time(30),
    combn_all = combn(25, 5),
    unit = "relative"
)
#> Unit: relative
#>              expr   min    lq  mean median    uq    max neval
#>  RcppAlgos_chunks 1.000 1.000 1.000  1.000 1.000 1.0000   100
#>         combn_all 8.057 7.981 7.017  7.977 7.923 0.9138   100

We can even evaluate each combination on the fly using the FUN parameter:

## the FUN.VALUE parameter is optional. If NULL (the default),
## when multiple results are requested via nextNIter, prevNIter,
## nextRemaining, and prevRemaining, a list will be returned.
## This parameter is modeled after the usage in vapply
it_fun <- comboIter(25, 5, FUN = cumprod, FUN.VALUE = cumprod(1:5))

it_fun@nextIter()
#> [1]   1   2   6  24 120

it_fun@nextNIter(n = 2)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    6   24  144
#> [2,]    1    2    6   24  168

## See the previous results in reverse order
it_fun@prevRemaining()
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    2    6   24  144
#> [2,]    1    2    6   24  120

Finally, the [[ method allows for random access of a single result or multiple results:

## As seen above
it[[20000]]
#> [1]  3  4  8  9 14

## Pass a random sample. N.B. In this case the state is unaffected.
## That is, it will remain whatever it was prior to passing the
## vector. In our case, it will still be on the 20000 index.
set.seed(42)
it[[sample(choose(25, 5), 10)]]
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    5    7   10   11   25
#>  [2,]    1   10   11   13   20
#>  [3,]    8   12   20   21   22
#>  [4,]    9   10   13   15   21
#>  [5,]    2    8   12   13   16
#>  [6,]    1    8   10   11   25
#>  [7,]    5   12   14   18   22
#>  [8,]    1   10   14   22   23
#>  [9,]    5    6   18   19   25
#> [10,]    8    9   14   15   17

## State unaffected
it@summary()
#> $description
#> [1] "Combinations of 25 choose 5"
#> 
#> $currentIndex
#> [1] 20000
#> 
#> $totalResults
#> [1] 53130
#> 
#> $totalRemaining
#> [1] 33130

## Sample with replacement
it[[sample(choose(25, 5), 5, replace = TRUE)]]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    2    5    6   14   23
#> [2,]    6   10   18   21   23
#> [3,]    4    6   12   14   20
#> [4,]    7    8   13   19   20
#> [5,]    6   12   13   20   25

There are also combinatorial sampling functions in RcppAlgos. For our current case, we would call upon comboSample.

## Same as above:
##     set.seed(42)
##     it[[sample(choose(25, 5), 10)]]
comboSample(25, 5, seed = 42, n = 10)
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    5    7   10   11   25
#>  [2,]    1   10   11   13   20
#>  [3,]    8   12   20   21   22
#>  [4,]    9   10   13   15   21
#>  [5,]    2    8   12   13   16
#>  [6,]    1    8   10   11   25
#>  [7,]    5   12   14   18   22
#>  [8,]    1   10   14   22   23
#>  [9,]    5    6   18   19   25
#> [10,]    8    9   14   15   17

See my answer to How to resample in R without repeating permutations? for more info.

Mark M. Fredrickson · Accepted Answer · 2010-08-06 01:40:54Z

6

If you wish to trade processing speed for memory (which I think you do), I would suggest the following algorithm:

Set up a loop from 1 to N Choose K, indexed by i
Each i can be considered an index to a combinadic, decode as such
Use the combination to perform your test statistic, store the result, discard the combination
Repeat

This will give you all N Choose K possible combinations without having to create them explicitly. I have code to do this in R if you'd like it (you can email me at mark dot m period fredrickson at-symbol gmail dot com).

answered Aug 6, 2010 at 1:40

Mark M. Fredrickson

1612 bronze badges

$\begingroup$ I'm accepting this answer because it solves (what I think) is the harder of the problems I was looking for a solution to - picking a particular combination out without calculating the preceding values. Unfortunately, it is still very slow. Perhaps as mentioned here and elsewhere a binary search would help speed things up. Perhaps the best approach is to have one thread generating the combinations stepwise as in mbq's answer and another thread reading them off and testing them. $\endgroup$
– russellpierce
Commented Aug 12, 2010 at 7:15

Add a comment |

user88user88 · Accepted Answer · 2010-08-05 10:17:47Z

1

Generating combinations is pretty easy, see for instance this; write this code in R and then process each combination at a time it appears.

answered Aug 5, 2010 at 10:17

user88

$\begingroup$ But will this cope with very large combinations? $\endgroup$
– csgillespie
Commented Aug 5, 2010 at 10:40
$\begingroup$ @csgillespie Well, I believe so -- it works in situ, so only one combination is stored in memory at a time, and the results of simulation can be also aggregated to eliminate the need of storing them. This will of course work terribly long, but exhaustive searches usually do. For speed it could be written in C, but then along with the simulation part, which probably is way slower than a generator step. $\endgroup$
– user88
Commented Aug 5, 2010 at 11:11
2

$\begingroup$ That looks almost identical to how R's combn function is already doing things. I wrote up a version of combn that does take combinations off the stack one at a time, and as mbq says because it is only storing one combination in memory at a time it can handle very large combinations. The problem with doing it in R is that doing a step-by-step approach in a function typically involves reading the state variables into the function, manipulating them, then storing them back out to global - which seems to just slow everything /way/ down. $\endgroup$
– russellpierce
Commented Aug 5, 2010 at 15:34

Add a comment |

Stack Exchange Network

How can I obtain some of all possible combinations in R?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
r
nonparametric
combinatorics
or ask your own question.

Linked

Hot Network Questions

How can I obtain some of all possible combinations in R?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged rnonparametriccombinatorics or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
r
nonparametric
combinatorics
or ask your own question.