Intercolumn statistics between columns in a dataframe

Question

I have a df and need to count how many adjacent columns have the same sign as other columns based on the sign of the first column, and multiply by the sign of the first column.

What I need to speed up is the calc_df function, which runs like this on my computer:

%timeit calc_df(df)
6.38 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The output of my code is:

        a_0       a_1       a_2       a_3       a_4       a_5       a_6       a_7       a_8       a_9
0  0.097627  0.430379  0.205527  0.089766 -0.152690  0.291788 -0.124826  0.783546  0.927326 -0.233117
1  0.583450  0.057790  0.136089  0.851193 -0.857928 -0.825741 -0.959563  0.665240  0.556314  0.740024
2  0.957237  0.598317 -0.077041  0.561058 -0.763451  0.279842 -0.713293  0.889338  0.043697 -0.170676
3 -0.470889  0.548467 -0.087699  0.136868 -0.962420  0.235271  0.224191  0.233868  0.887496  0.363641
4 -0.280984 -0.125936  0.395262 -0.879549  0.333533  0.341276 -0.579235 -0.742147 -0.369143 -0.272578

0    4.0
1    4.0
2    2.0
3   -1.0
4   -2.0

My code is as follows, where the generate_data function generates demo data, which is consistent with my actual data volume.

import numpy as np
import pandas as pd
from numba import njit

np.random.seed(0)

pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)


# This function generates demo data.
def generate_data() -> pd.DataFrame:
    col = [f'a_{x}' for x in range(10)]
    df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
    return df


@njit
def calc_numba(s: np.array) -> float:
    a = s[0]
    b = 1
    for sign in s[1:]:
        if sign == a:
            b += 1
        else:
            break
    b *= a
    return b


def calc_series(s: pd.Series) -> float:
    return calc_numba(s.to_numpy())


def calc_df(df: pd.DataFrame) -> pd.DataFrame:
    df1 = np.sign(df)
    df['count'] = df1.apply(calc_series, axis=1)
    return df


def main() -> None:
    df = generate_data()
    print(df.head(5))
    df = calc_df(df)
    print(df['count'].head(5))
    return


if __name__ == '__main__':
    main()

Toby Speight · Accepted Answer · 2022-08-25 10:11:32Z

Just avoiding switching between numpy array and pandas dataframe so much can grant you an easy x4 speedup:

def calc_df(df: pd.DataFrame) -> pd.DataFrame:
    df1 = np.sign(df)
    df['count'] = df1.apply(calc_series, axis=1)
    return df

%%timeit df = generate_data()
calc_df(df)

1.6 s ± 75.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With less casting from pandas to numpy:

def optimized_calc_df(df: pd.DataFrame) -> pd.DataFrame:
    array = np.sign(df.to_numpy())
    df['count'] = np.apply_along_axis(calc_numba, 1, array)
    return df

%%timeit odf = generate_data()
optimized_calc_df(odf)

415 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using your solution and with numpy:

import numpy as np
import pandas as pd
from numba import njit

np.random.seed(0)

pd.set_option('display.max_columns', None)
pd.set_option('expand_frame_repr', False)


# This function generates demo data.
def generate_data() -> pd.DataFrame:
    col = [f'a_{x}' for x in range(10)]
    np.random.seed(0)
    df = pd.DataFrame(data=np.random.uniform(-1, 1, [280000, 10]), columns=col)
    return df

# This function generates demo data.
def generate_data_array() -> np.array:
    np.random.seed(0)
    return np.random.uniform(-1, 1, [280000, 10])

%%timeit df = generate_data()
df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out_df = m.sum(1)*df1.iloc[:,0]

76.6 ms ± 2.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit array = generate_data_array()
array2 = np.sign(array)
array3 = np.minimum.accumulate(np.equal(array2, np.expand_dims(array2[:,0], axis=1)), 1)
out_array = array3.sum(axis=1) * array2[:,0]

44.3 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

jaried · Accepted Answer · 2022-08-25 07:46:12Z

Someone answered me this way:

You can use vectorial code here.

For example with a mask:

df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = df1.where(m).sum(1)

Output (5 first rows):

0    4.0
1    4.0
2    2.0
3   -1.0
4   -2.0
dtype: float64

Time to run on whole data:

269 ms ± 37.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Faster alternative:

df1 = np.sign(df)
m = df1.eq(df1.iloc[:,0], axis=0).cummin(1)
out = m.sum(1)*df1.iloc[:,0]

148 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And you can probably do even better with pure numpy (you have to write a cummin equivalent).

Stack Exchange Network

Intercolumn statistics between columns in a dataframe

2 Answers 2

Using your solution and with numpy:

Not the answer you're looking for? Browse other questions tagged
python
performance
numpy
pandas
numba
or ask your own question.

Hot Network Questions

Intercolumn statistics between columns in a dataframe

2 Answers 2

Using your solution and with numpy:

Not the answer you're looking for? Browse other questions tagged pythonperformancenumpypandasnumba or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
performance
numpy
pandas
numba
or ask your own question.