Alternatives to iterrow loops in python pandas dataframes

Question

I have a piece of code to calculate price sensitivity based on the product and its rating.

Below is the original data set with product type, reported year, customer’s rating, price per unit, and quantity. There are 2 products, Product 1 and Product 2, in the below data set, which contains price and quantity info of the 5 continuous years each.

Input data:

data = pd.DataFrame([['Product 1', 'Year 1', 'Good', 34, 7], ['Product 1', 'Year 2', 'Good', 22, 5], ['Product 1', 'Year 3', 'Good', 30, 2], ['Product 1', 'Year 4', 'Good', 50, 1], ['Product 1', 'Year 5', 'Good', 44, 103], ['Product 2', 'Year 1', 'Bad', 200, 12], ['Product 2', 'Year 2', 'Bad', 103, 50], ['Product 2', 'Year 3', 'Bad', 150, 192], ['Product 2', 'Year 4', 'Bad', 309, 20], ['Product 2', 'Year 5', 'Bad', 200, 12]], columns = ['Product', 'Year', 'Rating', 'Price', 'Quantity'])

I then created 2 correlation matrices, namely product and rating matrices.

Product correlation matrix:

from itertools import cycle

import pandas as pd

row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_name = pd.DataFrame.from_records([[0.5]*col]*row)
df_corr_name = df_corr_name.loc[ : , df_corr_name.columns != 0]
df_corr_name

#CREATE NEW COLUMNS


#year
tenor_list = cycle(['Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5'])
df_corr_name['Year'] = [next(tenor_list) for i in range(len(df_corr_name))]
df_corr_name.insert(0, 'Year', df_corr_name.pop('Year'))


#product
name_list = data['Product'].unique().tolist()
rep = 5
df_corr_name['Product'] = [ele for ele in name_list for i in range(rep)]
df_corr_name.insert(1, 'Product', df_corr_name.pop('Product'))

#rating
df_tcc_quality = data[['Product', 'Rating']].drop_duplicates()
quality_list = [list(i) for i in zip(df_tcc_quality['Product'], df_tcc_quality['Rating'])]
tcc_list_100 = df_corr_name['Product'].tolist()
L = []
for i in range(len(tcc_list_100)):
    for j in range(len(quality_list)):
        if tcc_list_100[i] == quality_list[j][0]:
            L.append(quality_list[j][1])
df_corr_name['Rating'] = L
df_corr_name.insert(2, 'Rating', df_corr_name.pop('Rating'))


#HEADERS
#Year
df_corr_name.loc[-1] = ['', '', ''] + [next(tenor_list) for i in range(len(df_corr_name))]
df_corr_name.iloc[-1] = df_corr_name.iloc[-1].astype(str)
df_corr_name.index = df_corr_name.index + 1 
df_corr_name = df_corr_name.sort_index()

#Name
df_corr_name.loc[-1] = ['', '', ''] + [ele for ele in name_list for i in range(rep)]
df_corr_name.iloc[-1] = df_corr_name.iloc[-1].astype(str)
df_corr_name.index = df_corr_name.index + 1  
df_corr_name = df_corr_name.sort_index()

#Quality
df_corr_name.loc[-1] = ['', '', ''] + L
df_corr_name.iloc[-1] = df_corr_name.iloc[-1].astype(str)
df_corr_name.index = df_corr_name.index + 1  
df_corr_name = df_corr_name.sort_index()


new_labels = pd.MultiIndex.from_arrays([df_corr_name.columns, df_corr_name.iloc[0], df_corr_name.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_name = df_corr_name.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)


#POPULATE CORRELATION
for i, j in df_corr_name.iterrows(): 
    i = df_corr_name.index.tolist()[0]
    while i <= len(df_corr_name.index):
        df_corr_name.iloc[i:i+5, i+3:i+8] = 1.0
        i += 5


for i, j in df_corr_name.iterrows():
    df_corr_name.iloc[i][i+1] = float(0)

The idea is that:

Values at diagonal are 0
If any cell has the same column and row’s product names, its value is 1, otherwise 0.5, such as the below output:

Rating correlation matrix:

#Rating
row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_quality = pd.DataFrame.from_records([[float(1)]*col]*row)
df_corr_quality = df_corr_quality.loc[ : , df_corr_quality.columns != 0]
df_corr_quality

#CREATE NEW COLUMNS


#year
tenor_list = cycle(['Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5'])
df_corr_quality['Year'] = [next(tenor_list) for i in range(len(df_corr_quality))]
df_corr_quality.insert(0, 'Year', df_corr_quality.pop('Year'))


#product
name_list = data['Product'].unique().tolist()
rep = 5
df_corr_quality['Product'] = [ele for ele in name_list for i in range(rep)]
df_corr_quality.insert(1, 'Product', df_corr_quality.pop('Product'))

#rating
df_tcc_quality = data[['Product', 'Rating']].drop_duplicates()
quality_list = [list(i) for i in zip(df_tcc_quality['Product'], df_tcc_quality['Rating'])]
tcc_list_100 = df_corr_quality['Product'].tolist()
L = []
for i in range(len(tcc_list_100)):
    for j in range(len(quality_list)):
        if tcc_list_100[i] == quality_list[j][0]:
            L.append(quality_list[j][1])
df_corr_quality['Rating'] = L
df_corr_quality.insert(2, 'Rating', df_corr_quality.pop('Rating'))


#HEADERS
#Year
df_corr_quality.loc[-1] = ['', '', ''] + [next(tenor_list) for i in range(len(df_corr_quality))]
df_corr_quality.iloc[-1] = df_corr_quality.iloc[-1].astype(str)
df_corr_quality.index = df_corr_quality.index + 1 
df_corr_quality = df_corr_quality.sort_index()

#Name
df_corr_quality.loc[-1] = ['', '', ''] + [ele for ele in name_list for i in range(rep)]
df_corr_quality.iloc[-1] = df_corr_quality.iloc[-1].astype(str)
df_corr_quality.index = df_corr_quality.index + 1  
df_corr_quality = df_corr_quality.sort_index()

#Quality
df_corr_quality.loc[-1] = ['', '', ''] + L
df_corr_quality.iloc[-1] = df_corr_quality.iloc[-1].astype(str)
df_corr_quality.index = df_corr_quality.index + 1  
df_corr_quality = df_corr_quality.sort_index()


new_labels = pd.MultiIndex.from_arrays([df_corr_quality.columns, df_corr_quality.iloc[0], df_corr_quality.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_quality = df_corr_quality.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)



#CHANGE CELL VALUE TO 0.8 IF Rating is "Bad"
for i, j in df_corr_quality.iterrows():
    for k in range(3, len(df_corr_quality.columns)):
        if (df_corr_quality.columns[k][1] == 'Bad' and df_corr_quality.iloc[i,2] == 'Good') or (df_corr_quality.columns[k][1] == 'Good' and df_corr_quality.iloc[i,2] == 'Bad'):
            df_corr_quality.iloc[i][k-2] = 0.8


#POPULATE CORRELATION 0 AT DIAGONAL
            

for i, j in df_corr_quality.iterrows():
    df_corr_quality.iloc[i, i+3] = float(0)

Diagonal values should be set to 0
In each cell, if both row and column have the same ratings (i.e., both are “Good”), populate 1, otherwise 0.8 (i.e., if row is “Good”, column is “Bad”, set to 0.8). Output:

Finally, I multiplied column ”Price” in the original dataset with its transpose and the product of these 2 matrices.

    df_name = df_corr_name.iloc[:, 3:]
    df_quality = df_corr_quality.iloc[:, 3:]
    df_pkl = df_name.to_numpy() * df_quality.to_numpy()    
    
    s = data [['Price']].to_numpy()
    v = df_pkl
    t = np.multiply(s, s.transpose())
    u = np.multiply(t, v)
    z = pd.DataFrame(u)

The final output is:

print(z)

The point is, if my data is limited to less than 1000 rows, my code runs quite well. However, if I increase it to more than 10 000 rows, it goes through an endless loop. The running time is more than 3 hours, causing it to crash. I’d assume the root cause is my loops in the matrix parts. Do you have other better options to optimize mine?

Can you assume that for each year and for each product there will be exactly one series of data (i.e. one rating, one price, one quantity)? — Eman Yalpsid, Commented Jul 8 at 16:34
Each product has 5 reported years, the rating is the same every year, but price and quantity vary each year — Laura, Commented Jul 8 at 17:41

tdy · Accepted Answer · 2024-07-09 20:37:50Z

I’d assume the root cause is my loops in the matrix parts.

Yes, looping is an anti-pattern in pandas, so iterrows should almost always be avoided.

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just build a numpy array.

Suggested approach:

Use vectorized pandas methods. Compute in long form and then reshape into your desired wide form:

Merge into a long cross table
Compute the weights and weighted prices
Pivot into a wide cross table
Mask the diagonal

Not only does this reduce your 100+ lines of code to ~15 lines:

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d['Product_w'] * d['Rating_w'],
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)

But it's also significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

And retains the dataframe labels:

>>> prices
Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

Thank you, that works! do you know a syntax that gives a sum amount of all the red values above? — Laura, Commented Jul 9 at 7:49
Assuming this final dataframe is stored as weighted_prices, then weighted_prices.loc['Product 1', 'Product 1'].sum() — tdy, Commented Jul 9 at 15:20
Thanks, I want to sum all values regardless of the year and product type. I found a solution here: weighted_prices.to_numpy().sum() — Laura, Commented Jul 9 at 16:07

J_H · Accepted Answer · 2024-07-08 16:37:47Z

4

I don't understand why you're requesting review for this code. This feature branch is not yet ready to be merged down to main.

Please clean up the diagnostics, and re-submit as a new Question.

Pandas 2.2.2 output, under Python 3.12.4:

PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  .drop("index", axis=1)

FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_corr_name.iloc[i][i + 1] = float(0)

It would be helpful if Review Context or # comments described jargon such as "tenor" and "tcc quality". And apparently "pkl" is unrelated to pickle.

If there is some author you relied on when developing this code, then it would be very helpful to include a citation.

edited Jul 8 at 16:37

answered Jul 8 at 16:31

J_H

30.8k3 gold badges30 silver badges119 bronze badges

\$\begingroup\$ Thanks for your feedback, I'll add what libraries I used, explain the variables and clarify what I did more. Tenor, tcc, quality, pkl are just names of variables, nothing relating to python modules, "pickle" as you mentioned. \$\endgroup\$
– Laura
Commented Jul 8 at 17:44
\$\begingroup\$ I have resubmitted another question here: codereview.stackexchange.com/questions/292891/… \$\endgroup\$
– Laura
Commented Jul 8 at 18:50
\$\begingroup\$ "resubmitted" Thank you. But I don't see any change in the symptom. Was the CoW diagnostic maybe not clear? It is a "gotcha!" that will trip people up often enough. I have seen my colleagues (A.) ignore it, and then (B.) struggle with incorrect output which stems from it. So I tend to view such a "warning" as a "fatal error", one which should be cleaned up immediately before moving on to implement new features. Are you using pandas 2.2.2? How best should we address the current diagnostics? \$\endgroup\$
– J_H
Commented Jul 8 at 19:01
\$\begingroup\$ Sorry, I don't think I get your points. I don't know which pandas version I installed, I simply typed pip install pandas to have it. I clarified the steps in compiling the two correlation matrices, also changed the variables' names to not to confuse the readers. I also added all python libraries used in the code. \$\endgroup\$
– Laura
Commented Jul 8 at 19:14
\$\begingroup\$ Typical way to check would be python --version and then pip freeze | grep pandas. Or after an import you can consult pd._version_meson.__version__. Are you telling me that your currently installed version of pandas does not output the diagnostic error messages I posted? We review "working code" on this site. It is important to first resolve those diagnostics before we continue on to the finer points. \$\endgroup\$
– J_H
Commented Jul 8 at 19:17

Add a comment |

Stack Exchange Network

Alternatives to iterrow loops in python pandas dataframes

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
python
numpy
pandas
loops
dataframe
or ask your own question.

Linked

Hot Network Questions

Alternatives to iterrow loops in python pandas dataframes

2 Answers 2

Not the answer you're looking for? Browse other questions tagged pythonnumpypandasloopsdataframe or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
numpy
pandas
loops
dataframe
or ask your own question.