Advanced Tupling

Question

I had asked this question before but I guess I did not make the question clear enough and I apologize for that. Here is the problem: Tuples gives me more data than I want. A very bright user gave me a solution to restrict the output using a mechanism of filtering results as they are selected by his algorithm. The problem was even though it was efficient, it could not render data very well even on a machine with a lot of memory. Memory was the bottleneck and nothing else. It also tuples from all the lists and puts them in any of the positions. So let me present the problem hopefully more clearly, with some actual data I am trying to analyze.

Suppose the following, I have the following lists:

a = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 21, 22, 23, 25, 26, 28}

b = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 30, 31, 33, 37, 41}

c = {6, 10, 13, 14, 15, 16, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 35, 37, 39}

d = {17, 19, 25, 30, 31, 33, 34, 35, 36, 38, 44}

e = {31, 41, 45, 47}

f = {23, 26, 31, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43, 45, 46, 47, 48, 49, 50, 51, 52, 53}

What I am trying to achieve:

A tuple of length of 6 integer digits

The first digit can only be a tuple of list a, second digit can only be tupled from list b, and so on.

No number in any of the tupled outputs should be the same from any list, for example, {2,2,25,25,31,31} is not a desired combination and should be excluded.

Order does not matter in this problem and should be excluded, for example, {2, 1, 10, 9, 42, 41} would be considered the same as {1, 2, 9, 10, 41, 42} and would be uneeded.

Obviously we have a large set of possible combinations, and as previously mentioned, M! loves to use memory and is not at all processor dependent, so limiting the memory footprint would be a priority.

Notice that the lists are of different lengths.

For large numbers of combinations, M! has difficulty rendering such a potentially large dataset on even a highend consumer grade computer, it would probably be advantageous to write these out to a text file as they are generated so the data can be viewed with an external text editor like Ultraedit.

Hope this is a detailed enough explaination and someone can share their thoughts! Thanks in advance!

You might want to be more specific about what a "highend consumer grade computer" is, because it ought to be up to the task. E.g., Clear[unique]; SetAttributes[unique, Orderless]; DeleteDuplicates[ Select[(unique @@ # &) /@ Tuples[{a, b, c, d, e, f}], Length[#] == 6 &]] requires 3 GB RAM and 58 seconds on my two-year old low-end workstation. (It finds 5,266,239 tuples.) — whuber, Commented Mar 5, 2012 at 18:41
16GB RAM (Intel XMP 1800Mhz), Core i7 2.36Ghz, 24GB pagefile (unbounded), Hybrid SSD in RAID-0. Pretty beefy machine, no? Processor utilitzation never much goes over 25%, but memory would get gobbled with the last code I used when trying to render the output. It does well with memory during processing. But alas, it mixes the positions of each list as well, so it didn't quite work as I had wanted. — Sinistar, Commented Mar 5, 2012 at 18:49
Also, I am coming up with 3806711 combinations using this data, but with different code and it completed in a few hours, last set I threw at it was 15.2 million, and that took about two days (wasn't watching the whole time, ;) — Sinistar, Commented Mar 5, 2012 at 19:44
I'm afraid this code doesn't quite work as I was trying to get. See, none of the resulting tuples should have anything but 31, 41, 45, or 47 in the 5th position since the 5th list e should be the source for this position exclusive. Same for each of the other positions, they should only tuple with the source, a is the source list for the first output digits, b for the second digit, and c for the third, and so on. Any thoughts? This did however give me the same amount of tuples as you got and finished in under 10 minutes, but it's just intermingling the lists in the tupled output. — Sinistar, Commented Mar 5, 2012 at 19:54

whuber · Accepted Answer · 2012-03-06 01:01:58Z

Here's a moderately parallelizable solution. It constructs the tuples sequentially so that, at the end, you can be assured the first element is from the first list, the second from the second, etc.

newTuples[t_, x_] := Flatten[
    ParallelTable[Append[s, #] & /@ Complement[x, s], {s, t}], 1];
Timing[ts = Fold[newTuples, {{}}, {a, b, c, d, e, f}];]

Let's unwind this. Fold builds the output one step at a time. After processing a, the output is just a list of the elements of a, each as a singleton vector ("1-tuple"):

{{1}, {2}, {3}, {4}, {5}, <<...>> {26}, {28}}

To process b, newTuples will loop (via ParallelTable) over all the 1-tuples it has just created. For each such 1-tuple s, Complement obtains the elements of b that would not cause any duplication ("x" refers to b at this stage, and later to c, d, etc.). Each of these elements is systematically tacked on to the current 1-tuple (via Append) to create a set of unique new 2-tuples. For example, when working on the 1-tuple {2}, the first thing we do is remove 2 from b. Then each remaining element of b is tacked on to {2}, giving {{2,3}, {2,4}, <<...>>, {2,41}}.

Once ParallelTable has augmented each 1-tuple into a list of 2-tuples in this manner, Flatten restructures the table of lists of 2-tuples as a single list:

{{1,2},  {1,3},  {1,4}, <<...>>,  {1,41},
         {2,3},  {2,4}, <<...>>,  {2,41},
 {3,2},          {3,4}, <<...>>,  {3,41},
<<...>>
 {28,2}, {28,3}, {28,4}, <<...>>, {28,41}}

Fold then repeats this procedure, augmenting each 2-tuple with new elements of c, then augmenting the resulting 3-tuples with elements of d, etc, until it has processed all six lists.

When there is lots of duplication among lists, this sequential strategy will be more sparing of RAM (and CPU cycles) than a more direct method, because it avoids creating a large list from which the qualifying 6-tuples will be selected: at each stage, the list of intermediate tuples is as small as possible.

Timing was 37 seconds (3.33 GHz Xeon 3580) and RAM usage was approximately 1.5 GB. (It goes a little faster if you process the smaller lists first.) The output has 12,336,674 elements. Testing on short lists gave correct results. Try, for example,

Fold[newTuples, {{}}, {{0, 1}, {2, 3, 4}, {5}, {6, 7}, {6, 7}, {6, 7, 8}}]

It should produce 2*3*1*2*1*1 = 12 6-tuples and indeed it does.

Screenshot

BTW, before attempting such a procedure generally it's a good idea to estimate how large the output might be, as in

Times @@ (Length /@ {a, b, c, d, e, f})

The output, 15874232, is the number of tuples that would be generated with duplicates allowed. Much larger answers would indicate extreme caution is needed, lest Mathematica grab all your RAM in its effort to complete the calculation.

ParallelTableAppend shold be ParallelTable[Append, tried to edit but couldn't submit. I tried it and took on my machine 32 seconds (using AbsoluteTiming). Nice solution! — FJRA, Commented Mar 5, 2012 at 20:28
I see the sytax error pointed out above. I make that small change and it still crashes. I'll give the error in a moment. — Sinistar, Commented Mar 5, 2012 at 20:30
Hmm...it looks like a "[" was lost when I pasted the code into the reply: ParallelTableAppend was originally ParallelTable[Append. I have edited the reply and then pasted the reply code back into Mathematica as a check. When running it, make sure you have initialized a, b, ..., f. — whuber, Commented Mar 5, 2012 at 20:34
$RecursionLimit::reclim: Recursion depth of 256 exceeded. >> $RecursionLimit::reclim: Recursion depth of 256 exceeded. >> Throw::nocatch: Uncaught Throw[{SubKernels`Protected`kernelWrite,local[local],EvaluatePacket[Subscript[iid, 137][Hold[Table[(<<1>>&)/@Complement[<<2>>],{s,{<<1>>}}]]]]}] returned to top level. >> — Sinistar, Commented Mar 5, 2012 at 20:38
Nice! fast and elegant. @Sinistar, if you haven't done so already, running the code with a fresh kernel may help. — kglr, Commented Mar 5, 2012 at 20:59

rm -rf · Accepted Answer · 2012-03-07 20:05:36Z

Here is a very procedural approach. Could be shortened somewhat; not sure it would give a speed gain or otherwise be worth the trouble.

ll = {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17,
     18, 21, 22, 23, 25, 26, 28},
   {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
    20, 21, 22, 23, 25, 26, 27, 28, 30, 31, 33, 37, 41},
   {6, 10, 13, 14, 15, 16, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
     30, 32, 35, 37, 39},
   {17, 19, 25, 30, 31, 33, 34, 35, 36, 38, 44},
   {31, 41, 45, 47},
   {23, 26, 31, 32, 33, 34, 35, 36, 38, 39, 40, 41, 42, 43, 45, 46, 
    47, 48, 49, 50, 51, 52, 53}};

In[54]:= Clear[tups, tuples]

Create tuples of length 1 from the first sublist. For each i, Iterate over tuples of length i. Adjoin kth element of sublist i+1 to each (jth) such tuple, if it will provide a new valid tuple of length i+1. This is determined by the criteria specified in the query.

In[55]:= Timing[Map[(tups[#] = True) &, ll[[1]]];
 tuples[1] = Map[{#} &, ll[[1]]];
 Do[ituples = tuples[i];
  tuples[i + 1] = Reap[
     Do[If[MemberQ[ituples[[j]], ll[[i + 1, k]]], Continue[]];
      tup = Append[ituples[[j]], ll[[i + 1, k]]];
      stup = Sort[tup];
      If[! TrueQ[tups[stup]],
       tups[stup] = True;
       Sow[tup]],
      {j, 1, Length[tuples[i]]}, {k, 1, Length[ll[[i + 1]]]}]][[2, 1]],
  {i, 1, Length[ll] - 1}];]

Out[55]= {120.776, Null}

In[56]:= tuples[6] // Length

Out[56]= 3806711

This differs from an earlier response in that it does not allow more than one tuple having the same elements. That earlier one might be made to give the same results as this, and perhaps faster, using a post processing step to remove tuples that are duplicates according to the given criteria.

The size of the result can be made considerably shorter using packed arrays.

Timing[Map[(tups[#] = True) &, ll[[1]]];
 tuples[1] = Map[{#} &, ll[[1]]];
 Do[ituples = tuples[i];
  tuples[i + 1] = 
   Developer`ToPackedArray[
    Reap[Do[If[MemberQ[ituples[[j]], ll[[i + 1, k]]], Continue[]];
       tup = Append[ituples[[j]], ll[[i + 1, k]]];
       stup = Sort[tup];
       If[! TrueQ[tups[stup]], tups[stup] = True;
        Sow[tup]], {j, 1, Length[tuples[i]]}, {k, 1, 
        Length[ll[[i + 1]]]}]][[2, 1]]], {i, 1, Length[ll] - 1}];]

Out[124]= {68.34, Null}

In[127]:= Length[tuples[6]]
          ByteCount[tuples[6]]

Out[127]= 3806711   
Out[128]= 91361232

This is a fraction of the size of the non-packed result. That said, I do not know if it will behave better in terms of rendering in the front end of Mathematica. And I'm not intending to find out.

M! seems to have this severe problem rendering data. The original answer will finish processing in 38 secs, but thats with no output. If I try to otherwise print, or render all the data or when not suppressing the output clicking 'Show All' the machine chokes on it. Even in 16 bit unicode, with 15 million sets of lists of 6 integers and 5 commas, it should weigh in around 300MB. Why does M! have to use all the memory, all the swap file and still take literally days to process this? I cannot examine the data for any accuracy this way. This really needs fixed. Any M! developers listening? — Sinistar, Commented Mar 6, 2012 at 14:49
Sinistar, how do you propose to "examine" 90 million integers? You can inspect any part of the output by indexing into it; there are zillions of ways to graph it; you can efficiently export it into many different file formats. Rendering data as decimal text strings within a notebook is not what you need! — whuber, Commented Mar 6, 2012 at 15:42
Very painstakingly. Ever see The Matrix? Anyway, the, when you scroll, the lines begin to animate, and I can glimpse an errata in the data when the data is sorted. — Sinistar, Commented Mar 6, 2012 at 19:01
You can let Mathematica check every tuple automatically and give a list of those which are in error (which should be empty if the code works correctly). The check on correctness of each tuple would be something like Length[#]==6 && #==DeleteDuplicates[#] && Count[a,#[[1]]]>0 && ... && Count[f,#[[6]]]>0 &, to find the non-matching you'd reverse the condition and use Select. That's more reliable than glimpsing. — celtschk, Commented Mar 9, 2012 at 9:52

Stack Exchange Network

Advanced Tupling

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
combinatorics
or ask your own question.

Linked

Hot Network Questions

Advanced Tupling

2 Answers 2

Not the answer you're looking for? Browse other questions tagged combinatorics or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
combinatorics
or ask your own question.