Get minimum of each group based on hour criteria using pandas

Question

I have a requirement which I posted in SO. However, I have already written code and works fine. I request your review to know whether it is fool proof/elegant and efficient. Please find the details here in SO post Question in SO

my code

df['time_1']= pd.to_datetime(df['time_1'])
s=pd.to_timedelta(24,unit='h')-(df.time_1-df.time_1.dt.normalize())
df['tdiff'] = df.groupby(df.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
df['t_d'] = df['tdiff'].dt.total_seconds()/3600
df['hr'] = df['time_1'].dt.hour
df['date'] = df['time_1'].dt.date
df['day'] = pd.DatetimeIndex(df['time_1']).day

# here I get the freq and cumsum of each val for each day and each hour.Since sort = 'False', timeorder is retained as is

temp_1 = pd.DataFrame(df.groupby(['subject_id','date','hr','val'], sort=False)['t_d'].agg({'cumduration':sum,'freq':'count'}).reset_index())

# here i remove the `hour` component and sum the value duration in same day but different hours (for example `5` was in 12th hour and 13th hour. we sum them)

temp_2 = pd.DataFrame(temp_1.groupby(['subject_id','date','val'], sort=False)['cumduration'].agg({'sum_of_cumduration':sum,'freq':'count'}).reset_index())

# Later, I create a mask for `> 1` hr criteria  

mask = temp_2.groupby(['subject_id','date'])['sum_of_cumduration'].apply(lambda x: x > 1)
output_1 = pd.DataFrame(temp_2[mask].groupby(['subject_id','date'])['val'].min()).reset_index()

# I check for `< 1 ` hr records here 

output_2 = pd.DataFrame(temp_2[~mask].groupby(['subject_id','date'])['val'].min()).reset_index()

# I finally check for `subject_id` and `date` and then append


output = output_1.append(output_2[~output_2['subject_id'].isin(output_1['subject_id'])])

output

Welcome to Code Review! Your indendation seems off in the lines following comments. Could you verify that this code works as posted? — AlexV, Commented Aug 30, 2019 at 9:26
I pasted the code from here to jupyter and it works after fixing the alignment — The Great, Commented Aug 30, 2019 at 9:33

Maarten Fabré · Accepted Answer · 2019-08-30 09:40:49Z

leave the original data intact

Since df is the original data, adding columns to it can have a strange side effect in another part of the calculation. Best is not to touch the original data, and do all the calculations etc in separate series or copied DataFrames

pd.grouper

You can use pd.Grouper to group per day and hour

You can group the date per subject and hour with:

df.groupby(["subject_id", pd.Grouper(key="time_1", freq="h"), "val"])

Then you don't need all those extra columns "hr", "date" and "day" anymore. There is no reason for the column "t_d" either. since timedelta's can be counted, summed etc as well.

aggregation function

Apart from that, you can design a specific function to get the requested minimum from one such group.

If you supply it data grouped per subject and date, dropping rows with the same val and checking whether more than 1hr has passed is a lot easier

def day_grouper(group):
    """
    Returns the minimum in `val` that lasted more than 1 hour
    """
    # drop consecutive same values.
    # The first value is retained because it becomes NaN and `NaN != 0`
    group = group.loc[group["val"].diff() != 0]
    # drop those that take less than 1hr
    more_than_1h = group["time_1"].diff().shift(-1) > pd.Timedelta("1h")
    # Check whether the last value started before 23:00
    more_than_1h.iloc[-1] = group["time_1"].iloc[-1].time() < datetime.time(23)
    return group.loc[more_than_1h, "val"].min()

This can then be called like this:

(
    df.groupby(["subject_id", pd.Grouper(key="time_1", freq="d")])
    .apply(day_grouper)
    .rename("val")
    .reset_index()
)

The 2 last rows are to get the format you want

code formatting

Try to follow Pep-8.

£some of your lines are really long, and convention is a space after a comma.

I use black to format my source code, so I don't have to worry about long lines, spaces, etc any more

Thank you. Upvoted. Will incorporate your suggestions. Is my logic foolproof? — The Great, Commented Aug 30, 2019 at 10:00

Stack Exchange Network

Get minimum of each group based on hour criteria using pandas

1 Answer 1

leave the original data intact

pd.grouper

aggregation function

code formatting

Not the answer you're looking for? Browse other questions tagged
python
pandas
or ask your own question.

Hot Network Questions

Get minimum of each group based on hour criteria using pandas

1 Answer 1

leave the original data intact

pd.grouper

aggregation function

code formatting

Not the answer you're looking for? Browse other questions tagged pythonpandas or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
pandas
or ask your own question.