1

I'm struggling to find an elegant solution to create a running total column in my dataframe. It should start the count if two criteria are met and reset any time they aren't.

If the user in the example frame below is the same as the row before and the 'Value Col' is 0, the running total should start and increase by one for every row until either the user changes or the Value Col is NOT 0.

This is being run on a very large dataset (30+ million rows), so I'm hoping there can be a solution using built in, optimised functions, but I can brute force it with .apply if that's the only option.

Example:

User Value Col Running total
One 2 0
One 0 1
One 0 2
One 0 3
One 1 0
One 3 0
One 0 1
One 0 2
Two 0 1
Two 0 2
Two 0 3
Two 3 0
Two 0 1
Two 0 2

4 Answers 4

3

There's a common trick for it in pandas: use cumsum on a boolean mask to create groups of consecutive rows. Then use a groupby + cumcount to label the values inside each group!

import pandas as pd

df = pd.read_clipboard() # Your df here

groups = df["Value Col"].ne(0).cumsum()

df["Running total"] = df.groupby(["User", groups]).cumcount()

df:

   User  Value Col  Running total
0   One          2              0
1   One          0              1
2   One          0              2
3   One          0              3
4   One          1              0
5   One          3              0
6   One          0              1
7   One          0              2
8   Two          0              0 # <-- The count resets to 0 here,
9   Two          0              1 # <-- in your example we have 1, 2, 3 instead,
10  Two          0              2 # <-- is that a mistake? or intentional?
11  Two          3              0
12  Two          0              1
13  Two          0              2
0
1

Following the next approach takes about 3 seconds for 30million rows on my machine:

mask = (df["User"] == df["User"].shift(1)) & (df["Value Col"] == 0)
running_total = 0
def compute_rt(row):
    global running_total
    running_total = running_total + 1 if row else 0
    return running_total

df["Running total"] = mask.apply(compute_rt)

If I try another approaches to avoid the global variable the time increases.

1

Create a boolean mask to flag the rows where either user in current row is different from the previous row or value is 0, then calculate cumulative sum on the mask to distinguish between different blocks of rows then group the dataframe by the blocks and cumulatively count the zeros per block

m1 = df['User'] != df['User'].shift()
m2 = df['Value Col'].ne(0)
m = m1 | m2 # reset flag

df['Running total'] = (~m2).groupby(m.cumsum()).cumsum()

   User  Value Col  Running total
0   One          2              0
1   One          0              1
2   One          0              2
3   One          0              3
4   One          1              0
5   One          3              0
6   One          0              1
7   One          0              2
8   Two          0              1
9   Two          0              2
10  Two          0              3
11  Two          3              0
12  Two          0              1
13  Two          0              2
1

Code

cond1 = df['Value Col'].ne(0)
grp = cond1.groupby(df['User']).cumsum().mask(cond1)
df["Running total"] = df.groupby([grp, 'User'])['Value Col'].cumcount().add(1).fillna(0).astype('int')

df:

    User    Value Col   Running total
0   One     2           0
1   One     0           1
2   One     0           2
3   One     0           3
4   One     1           0
5   One     3           0
6   One     0           1
7   One     0           2
8   Two     0           1
9   Two     0           2
10  Two     0           3
11  Two     3           0
12  Two     0           1
13  Two     0           2

Example Code

import pandas as pd
data = {'User': ['One', 'One', 'One', 'One', 'One', 'One', 'One', 'One', 
                 'Two', 'Two', 'Two', 'Two', 'Two', 'Two'], 
        'Value Col': [2, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0]}
df = pd.DataFrame(data)

Not the answer you're looking for? Browse other questions tagged or ask your own question.