Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

Question

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

And this is the shape of my data: (441079, 12)

My desired output is:

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

Related: How to unnest (explode) a column in a pandas DataFrame? — cs95, Commented Dec 13, 2019 at 21:29

cs95 · Accepted Answer · 2023-04-27 11:56:04Z

pandas >= 1.3

In more recent versions, pandas allows you to explode multiple columns at once using DataFrame.explode, provided all values have lists of equal size. Thus, you are able to use this:

df.explode(['B', 'C', 'D', 'E']).reset_index(drop=True)

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

pandas >= 0.25

For slightly older versions, you can apply Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.

Funnily enough, this happens to be faster than calling df.explode, according to my tests. YMMV.

explode methods are quite performant in general:

df2 = pd.concat([df] * 100, ignore_index=True)

%timeit df2.explode(['B', 'C', 'D', 'E']).reset_index(drop=True)
%timeit df2.set_index(['A']).apply(pd.Series.explode).reset_index() # fastest
%%timeit
(df2.set_index('A')
    .apply(lambda x: x.apply(pd.Series).stack())
    .reset_index()
    .drop('level_1', axis=1))


2.59 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.27 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
120 ms ± 9.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Zero · Accepted Answer · 2017-08-23 18:36:11Z

26

Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

edited Aug 23, 2017 at 18:36

answered Aug 23, 2017 at 18:19

Zero

76k22 gold badges152 silver badges154 bronze badges

Add a comment |

MaxU - stand with Ukraine · Accepted Answer · 2017-08-23 18:20:01Z

def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

Usage:

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Henry Ecker · Accepted Answer · 2023-01-01 23:52:37Z

As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):

DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)

So now this operation is as simple as:

df.explode(['B', 'C', 'D', 'E'])

    A   B   C   D   E
0  x1  v1  c1  d1  e1
0  x1  v2  c2  d2  e2
1  x2  v3  c3  d3  e3
1  x2  v4  c4  d4  e4
2  x3  v5  c5  d5  e5
2  x3  v6  c6  d6  e6
3  x4  v7  c7  d7  e7
3  x4  v8  c8  d8  e8

Or if wanting unique indexing:

df.explode(['B', 'C', 'D', 'E'], ignore_index=True)

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

benwshul · Accepted Answer · 2020-08-18 20:52:50Z

Building on @cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:

Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].

df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)

     A   B   C   D   E
0   x1  v1  c1  d1  e1
0   x1  v2  c2  d2  e2
1   x2  v3  c3  d3  e3
1   x2  v4  c4  d4  e4
2   x3  v5  c5  d5  e5
2   x3  v6  c6  d6  e6
3   x4  v7  c7  d7  e7
3   x4  v8  c8  d8  e8

Vladimir Vilimaitis · Accepted Answer · 2022-12-01 00:03:28Z

Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:

from collections.abc import Sequence
import pandas as pd
import numpy as np


def explode_by_delimiter(
    df: pd.DataFrame,
    columns: str | Sequence[str],
    delimiter: str = ",",
    reindex: bool = True
) -> pd.DataFrame:
    """Convert dataframe with columns separated by a delimiter into an
    ordinary dataframe. Requires pandas 1.3.0+."""
    if isinstance(columns, str):
        columns = [columns]

    col_dict = {
        col: df[col]
        .str.split(delimiter)
        # Without .fillna(), .explode() will fail on empty values
        .fillna({i: [np.nan] for i in df.index})
        for col in columns
    }
    df = df.assign(**col_dict).explode(columns)
    return df.reset_index(drop=True) if reindex else df

can you provide some syntax on how this can be used?
– KahlilG
Commented Jan 11, 2023 at 5:45 — KahlilG, Commented Jan 11, 2023 at 5:45

nphaibk · Accepted Answer · 2023-01-17 20:16:05Z

Here is my solution using 'apply' function. Main features/differences:

offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)

Notes: option 'trim' was developed for my need, out of scope for this question

    def lenx(x):
        return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1

    def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
        jcols = [j for j,v in enumerate(row.index) if v in cols]
        if len(jcols)<1:
            jcols = range(len(row.index))
        Ls = [lenx(x) for x in row.values]
        if not Ls[:-1]==Ls[1:]:
            vals = [v if isinstance(v,list) else [v] for v in row.values]
            if fill_mode=='external':
                vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                        else e + [fill_value]*(max(Ls)-lenx(e))
                        for j,e in enumerate(vals)]
            elif fill_mode == 'internal':
                vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                        else e+[e[-1]]*(max(Ls)-lenx(e)) 
                        for j,e in enumerate(vals)]
            else:
                vals = [e[0:min(Ls)] for e in vals]
            row = pd.Series(vals,index=row.index.tolist())
        return row

Examples:

    df=pd.DataFrame({
        'a':[[1],2,3],
        'b':[[4,5,7],[5,4],4],
        'c':[[4,5],5,[6]]
    })
    print(df)
    df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
    df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
    df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
    df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'trim\', all columns\n', df4)

Output:

         a          b       c
    0  [1]  [4, 5, 7]  [4, 5]
    1    2     [5, 4]       5
    2    3          4     [6]
    
    fill_mode='external', all columns, fill_value = 'OK'
         a  b   c
    0   1  4   4
    0  OK  5   5
    0  OK  7  OK
    1   2  5   5
    1  OK  4  OK
    2   3  4   6
    
    fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
         a  b       c
    0   1  4  [4, 5]
    0  OK  5      OK
    0  OK  7      OK
    1   2  5       5
    1  OK  4      OK
    2   3  4       6
    
    fill_mode='internal', cols = ['a', 'b']
        a  b       c
    0  1  4  [4, 5]
    0  1  5  [4, 5]
    0  1  7  [4, 5]
    1  2  5       5
    1  2  4       5
    2  3  4       6
    
    fill_mode='trim', all columns
        a  b  c
    0  1  4  4
    1  2  5  5
    2  3  4  6

This line returns an error Ls = [lenx(x) for x in row.values] "lenx is not defined" — KahlilG, Commented Jan 11, 2023 at 5:49
sorry, I missed an customized function, I added it to the code — nphaibk, Commented Jan 17, 2023 at 20:14

Collectives™ on Stack Overflow

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

7 Answers 7

pandas >= 1.3

pandas >= 0.25

Not the answer you're looking for? Browse other questions tagged
python
json
pandas
dataframe
pandas-explode
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

pandas >= 1.3

pandas >= 0.25

Not the answer you're looking for? Browse other questions tagged pythonjsonpandasdataframepandas-explode or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
json
pandas
dataframe
pandas-explode
or ask your own question.