Python group by

Question

Assume that I have a set of data pair where index 0 is the value and index 1 is the type:

input = [
          ('11013331', 'KAT'), 
          ('9085267',  'NOT'), 
          ('5238761',  'ETH'), 
          ('5349618',  'ETH'), 
          ('11788544', 'NOT'), 
          ('962142',   'ETH'), 
          ('7795297',  'ETH'), 
          ('7341464',  'ETH'), 
          ('9843236',  'KAT'), 
          ('5594916',  'ETH'), 
          ('1550003',  'ETH')
        ]

I want to group them by their type (by the 1st indexed string) as such:

result = [ 
           { 
             'type': 'KAT', 
             'items': ['11013331', '9843236'] 
           },
           {
             'type': 'NOT', 
             'items': ['9085267', '11788544'] 
           },
           {
             'type': 'ETH', 
             'items': ['5238761', '962142', '7795297', '7341464', '5594916', '1550003'] 
           }
         ]

How can I achieve this in an efficient way?

dfrankow · Accepted Answer · 2023-06-22 10:24:29Z

Do it in 2 steps. First, create a dictionary.

>>> input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')]
>>> from collections import defaultdict
>>> res = defaultdict(list)
>>> for v, k in input: res[k].append(v)
...

Then, convert that dictionary into the expected format.

>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}]

It is also possible with itertools.groupby but it requires the input to be sorted first.

>>> sorted_input = sorted(input, key=itemgetter(1))
>>> groups = groupby(sorted_input, key=itemgetter(1))
>>> [{'type':k, 'items':[x[0] for x in v]} for k, v in groups]
[{'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}]

Note: before python 3.7, both of these do not respect the original order of the keys. You need an OrderedDict if you need to keep the order.

>>> from collections import OrderedDict
>>> res = OrderedDict()
>>> for v, k in input:
...   if k in res: res[k].append(v)
...   else: res[k] = [v]
... 
>>> [{'type':k, 'items':v} for k,v in res.items()]
[{'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}]

On or after python 3.7, a regular dict keeps insertion order.

How can this be done if the input tuple has one key and two or more values, like this: [('11013331', 'red', 'KAT'), ('9085267', 'blue' 'KAT')] where the last element of tuple is key and the first two as value. Result should be like this: result = [{ type:'KAT', items: [('11013331', red), ('9085267', blue)] }] — user1144616, Commented Mar 6, 2012 at 18:52
step 1 can be done without the import: d= {}; for k,v in input: d.setdefault(k, []).append(v) — ecoe, Commented Oct 19, 2018 at 21:10
I'm working on a MapReduce program in python, just wondering is there any way to group by values in a list without dealing with dictionaries or external library such as pandas? If not, then how can I get rid of items and type in my result? — Agent 0, Commented Nov 26, 2018 at 5:13

PaulMcG · Accepted Answer · 2018-03-19 16:08:18Z

Python's built-in itertools module actually has a groupby function , but for that the elements to be grouped must first be sorted such that the elements to be grouped are contiguous in the list:

from operator import itemgetter
sortkeyfn = itemgetter(1)
input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), 
 ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), 
 ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')] 
input.sort(key=sortkeyfn)

Now input looks like:

[('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'),
 ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH'), ('11013331', 'KAT'),
 ('9843236', 'KAT'), ('9085267', 'NOT'), ('11788544', 'NOT')]

groupby returns a sequence of 2-tuples, of the form (key, values_iterator). What we want is to turn this into a list of dicts where the 'type' is the key, and 'items' is a list of the 0'th elements of the tuples returned by the values_iterator. Like this:

from itertools import groupby
result = []
for key,valuesiter in groupby(input, key=sortkeyfn):
    result.append(dict(type=key, items=list(v[0] for v in valuesiter)))

Now result contains your desired dict, as stated in your question.

You might consider, though, just making a single dict out of this, keyed by type, and each value containing the list of values. In your current form, to find the values for a particular type, you'll have to iterate over the list to find the dict containing the matching 'type' key, and then get the 'items' element from it. If you use a single dict instead of a list of 1-item dicts, you can find the items for a particular type with a single keyed lookup into the master dict. Using groupby, this would look like:

result = {}
for key,valuesiter in groupby(input, key=sortkeyfn):
    result[key] = list(v[0] for v in valuesiter)

result now contains this dict (this is similar to the intermediate res defaultdict in @KennyTM's answer):

{'NOT': ['9085267', '11788544'], 
 'ETH': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 
 'KAT': ['11013331', '9843236']}

(If you want to reduce this to a one-liner, you can:

result = dict((key,list(v[0] for v in valuesiter)
              for key,valuesiter in groupby(input, key=sortkeyfn))

or using the newfangled dict-comprehension form:

result = {key:list(v[0] for v in valuesiter)
              for key,valuesiter in groupby(input, key=sortkeyfn)}

I'm working on a MapReduce program in python, just wondering is there any way to group by values in a list without dealing with dictionaries or external library such as pandas? If not, then how can I get rid of items and type in my result? — Agent 0, Commented Nov 26, 2018 at 5:13
@Kourosh - Post as a new question, but be sure to indicate what you mean by "get rid of items and type in my result", and "without dealing with dictionaries". — PaulMcG, Commented Nov 26, 2018 at 5:58
The first example -result=[] - is a list, inside a dict and values in a list. The second - result ={} - is a dict and values in a list. Smart the way you improved the code. — Timo, Commented Aug 28, 2021 at 19:51

ronen · Accepted Answer · 2020-02-18 14:16:30Z

This answer is similar to @PaulMcG's answer but doesn't require sorting the input.

For those into functional programming, groupBy can be written in one line (not including imports!), and unlike itertools.groupby it doesn't require the input to be sorted:

from functools import reduce # import needed for python3; builtin in python2
from collections import defaultdict

def groupBy(key, seq):
 return reduce(lambda grp, val: grp[key(val)].append(val) or grp, seq, defaultdict(list))

(The reason for ... or grp in the lambda is that for this reduce() to work, the lambda needs to return its first argument; because list.append() always returns None the or will always return grp. I.e. it's a hack to get around python's restriction that a lambda can only evaluate a single expression.)

This returns a dict whose keys are found by evaluating the given function and whose values are a list of the original items in the original order. For the OP's example, calling this as groupBy(lambda pair: pair[1], input) will return this dict:

{'KAT': [('11013331', 'KAT'), ('9843236', 'KAT')],
 'NOT': [('9085267', 'NOT'), ('11788544', 'NOT')],
 'ETH': [('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH')]}

And as per @PaulMcG's answer the OP's requested format can be found by wrapping that in a list comprehension. So this will do it:

result = {key: [pair[0] for pair in values],
          for key, values in groupBy(lambda pair: pair[1], input).items()}

A lot less code, yet understandable. Also good because it doesn't reinvent the wheel. — devdanke, Commented Jul 11, 2020 at 16:26
This solution also works as groupBy, that is -- it can accept a callable as a key too. Very handy. — Orwellophile, Commented Apr 1, 2022 at 12:57

KyleMit · Accepted Answer · 2023-01-28 13:03:17Z

11

I also liked pandas simple grouping. it's powerful, simple and most adequate for large data set

result = pandas.DataFrame(input).groupby(1).groups

edited Jan 28, 2023 at 13:03

KyleMit♦

34.6k72 gold badges488 silver badges686 bronze badges

answered Nov 2, 2016 at 5:06

akiva

2,6974 gold badges32 silver badges40 bronze badges

Add a comment |

Phrogz · Accepted Answer · 2016-10-28 21:39:20Z

The following function will quickly (no sorting required) group tuples of any length by a key having any index:

# given a sequence of tuples like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)],
# returns a dict grouping tuples by idx-th element - with idx=1 we have:
# if merge is True {'c':(3,6,88,4),     'a':(7,2,45,0)}
# if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))}
def group_by(seqs,idx=0,merge=True):
    d = dict()
    for seq in seqs:
        k = seq[idx]
        v = d.get(k,tuple()) + (seq[:idx]+seq[idx+1:] if merge else (seq[:idx]+seq[idx+1:],))
        d.update({k:v})
    return d

In the case of your question, the index of key you want to group by is 1, therefore:

group_by(input,1)

gives

{'ETH': ('5238761','5349618','962142','7795297','7341464','5594916','1550003'),
 'KAT': ('11013331', '9843236'),
 'NOT': ('9085267', '11788544')}

which is not exactly the output you asked for, but might as well suit your needs.

I'm working on a MapReduce program in python, just wondering is there any way to group by values in a list without dealing with dictionaries or external library such as pandas? If not, then how can I get rid of items and type in my result? — Agent 0, Commented Nov 26, 2018 at 5:13

user4813927user4813927 · Accepted Answer · 2018-07-05 10:17:12Z

result = []
# Make a set of your "types":
input_set = set([tpl[1] for tpl in input])
>>> set(['ETH', 'KAT', 'NOT'])
# Iterate over the input_set
for type_ in input_set:
    # a dict to gather things:
    D = {}
    # filter all tuples from your input with the same type as type_
    tuples = filter(lambda tpl: tpl[1] == type_, input)
    # write them in the D:
    D["type"] = type_
    D["itmes"] = [tpl[0] for tpl in tuples]
    # append D to results:
    result.append(D)

result
>>> [{'itmes': ['9085267', '11788544'], 'type': 'NOT'}, {'itmes': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'itmes': ['11013331', '9843236'], 'type': 'KAT'}]

westandskif · Accepted Answer · 2021-11-12 10:22:55Z

0

You could use convtools library which generates ad-hoc code for your exact task and allows for dynamic code generation.

from convtools import conversion as c

# grouping by second elements of tuples;
# aggregate defines the schema of the expected output elements
converter = c.group_by(c.item(1)).aggregate({
    "type": c.item(1),
    "items": c.ReduceFuncs.Array(c.item(0)),
}).gen_converter()

# now you have a function which does what you asked,
# store it somewhere for further reuse
converter(input_data)

edited Nov 12, 2021 at 10:22

answered Jul 16, 2021 at 21:42

westandskif

9826 silver badges10 bronze badges

Add a comment |

Kshitij Zutshi · Accepted Answer · 2022-04-18 02:05:13Z

Following Snippet is also a way to get the desired results -

res = []
dict1 = {}
for item in input:
  if item[1] not in dict1:
    dict1[item[1]] = [item[0]]
  elif item[1] in dict1:
    dict1[item[1]].append(item[0])
for k, v in dict1.items():
  res.append({"type": k, "items": v})


# res = [ { type:'KAT', items: ['11013331', '9843236'] },{ type:'NOT',  items: ['9085267', '11788544'] },{ type:'ETH',  items: ['5238761', '962142', '7795297', '7341464', '5594916', '1550003'] }]

Denziloe · Accepted Answer · 2022-08-25 19:16:58Z

0

This is not very efficient, but it is Pythonic. Basically, work out the distinct groups by taking the set of group values, and then for each of these groups, get the items that are in that group.

[
    {
        "type": group,
        "items": [item[0] for item in input if item[1] == group]
    }
    for group in {item[1] for item in input}
]

answered Aug 25, 2022 at 19:16

Denziloe

7,9833 gold badges29 silver badges37 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Python group by

9 Answers 9

Not the answer you're looking for? Browse other questions tagged
python
group-by
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Not the answer you're looking for? Browse other questions tagged pythongroup-by or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
group-by
or ask your own question.