Count groups of adjacent rows that meet condition in pandas dataframe - python

Question

I'm struggling to find an elegant solution to create a running total column in my dataframe. It should start the count if two criteria are met and reset any time they aren't.

If the user in the example frame below is the same as the row before and the 'Value Col' is 0, the running total should start and increase by one for every row until either the user changes or the Value Col is NOT 0.

This is being run on a very large dataset (30+ million rows), so I'm hoping there can be a solution using built in, optimised functions, but I can brute force it with .apply if that's the only option.

Example:

User	Value Col	Running total
One	2	0
One	0	1
One	0	2
One	0	3
One	1	0
One	3	0
One	0	1
One	0	2
Two	0	1
Two	0	2
Two	0	3
Two	3	0
Two	0	1
Two	0	2

Chrysophylaxs · Accepted Answer · 2023-11-02 14:37:51Z

There's a common trick for it in pandas: use cumsum on a boolean mask to create groups of consecutive rows. Then use a groupby + cumcount to label the values inside each group!

import pandas as pd

df = pd.read_clipboard() # Your df here

groups = df["Value Col"].ne(0).cumsum()

df["Running total"] = df.groupby(["User", groups]).cumcount()

df:

   User  Value Col  Running total
0   One          2              0
1   One          0              1
2   One          0              2
3   One          0              3
4   One          1              0
5   One          3              0
6   One          0              1
7   One          0              2
8   Two          0              0 # <-- The count resets to 0 here,
9   Two          0              1 # <-- in your example we have 1, 2, 3 instead,
10  Two          0              2 # <-- is that a mistake? or intentional?
11  Two          3              0
12  Two          0              1
13  Two          0              2

thesydne · Accepted Answer · 2023-11-02 14:46:24Z

1

Following the next approach takes about 3 seconds for 30million rows on my machine:

mask = (df["User"] == df["User"].shift(1)) & (df["Value Col"] == 0)
running_total = 0
def compute_rt(row):
    global running_total
    running_total = running_total + 1 if row else 0
    return running_total

df["Running total"] = mask.apply(compute_rt)

If I try another approaches to avoid the global variable the time increases.

answered Nov 2, 2023 at 14:46

thesydne

1278 bronze badges

Add a comment |

Shubham Sharma · Accepted Answer · 2023-11-02 15:02:13Z

Create a boolean mask to flag the rows where either user in current row is different from the previous row or value is 0, then calculate cumulative sum on the mask to distinguish between different blocks of rows then group the dataframe by the blocks and cumulatively count the zeros per block

m1 = df['User'] != df['User'].shift()
m2 = df['Value Col'].ne(0)
m = m1 | m2 # reset flag

df['Running total'] = (~m2).groupby(m.cumsum()).cumsum()

   User  Value Col  Running total
0   One          2              0
1   One          0              1
2   One          0              2
3   One          0              3
4   One          1              0
5   One          3              0
6   One          0              1
7   One          0              2
8   Two          0              1
9   Two          0              2
10  Two          0              3
11  Two          3              0
12  Two          0              1
13  Two          0              2

Panda Kim · Accepted Answer · 2023-11-02 15:05:16Z

Code

cond1 = df['Value Col'].ne(0)
grp = cond1.groupby(df['User']).cumsum().mask(cond1)
df["Running total"] = df.groupby([grp, 'User'])['Value Col'].cumcount().add(1).fillna(0).astype('int')

df:

    User    Value Col   Running total
0   One     2           0
1   One     0           1
2   One     0           2
3   One     0           3
4   One     1           0
5   One     3           0
6   One     0           1
7   One     0           2
8   Two     0           1
9   Two     0           2
10  Two     0           3
11  Two     3           0
12  Two     0           1
13  Two     0           2

Example Code

import pandas as pd
data = {'User': ['One', 'One', 'One', 'One', 'One', 'One', 'One', 'One', 
                 'Two', 'Two', 'Two', 'Two', 'Two', 'Two'], 
        'Value Col': [2, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0]}
df = pd.DataFrame(data)

Collectives™ on Stack Overflow

Count groups of adjacent rows that meet condition in pandas dataframe - python

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
python
pandas
numpy
or ask your own question.

Hot Network Questions

User	Value Col	Running total
One	2	0
One	0	1
One	0	2
One	0	3
One	1	0
One	3	0
One	0	1
One	0	2
Two	0	1
Two	0	2
Two	0	3
Two	3	0
Two	0	1
Two	0	2

User	Value Col	Running total
One	2	0
One	0	1
One	0	2
One	0	3
One	1	0
One	3	0
One	0	1
One	0	2
Two	0	1
Two	0	2
Two	0	3
Two	3	0
Two	0	1
Two	0	2

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged pythonpandasnumpy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
numpy
or ask your own question.

User	Value Col	Running total
One	2	0
One	0	1
One	0	2
One	0	3
One	1	0
One	3	0
One	0	1
One	0	2
Two	0	1
Two	0	2
Two	0	3
Two	3	0
Two	0	1
Two	0	2