Optimize loops in Numpy correlation matrices

Question

I have a piece of code to calculate price sensitivity based on the product and its rating.

Below is the original data set with product type, reported year, customer’s rating, price per unit, and quantity. There are 2 products, Product 1 and Product 2, in the below data set, which contains price and quantity info of the 5 continuous years each.

Input data:

data = pd.DataFrame([['Product 1', 'Year 1', 'Good', 34, 7], ['Product 1', 'Year 2', 'Good', 22, 5], ['Product 1', 'Year 3', 'Good', 30, 2], ['Product 1', 'Year 4', 'Good', 50, 1], ['Product 1', 'Year 5', 'Good', 44, 103], ['Product 2', 'Year 1', 'Bad', 200, 12], ['Product 2', 'Year 2', 'Bad', 103, 50], ['Product 2', 'Year 3', 'Bad', 150, 192], ['Product 2', 'Year 4', 'Bad', 309, 20], ['Product 2', 'Year 5', 'Bad', 200, 12]], columns = ['Product', 'Year', 'Rating', 'Price', 'Quantity'])

I then created 2 correlation matrices, namely product and rating matrices.

Product correlation matrix:

row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_name = pd.DataFrame.from_records([[0.5]*col]*row)
df_corr_name = df_corr_name.loc[ : , df_corr_name.columns != 0]
df_corr_name

#CREATE NEW COLUMNS


#year
tenor_list = cycle(['Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5'])
df_corr_name['Year'] = [next(tenor_list) for i in range(len(df_corr_name))]
df_corr_name.insert(0, 'Year', df_corr_name.pop('Year'))


#product
name_list = data['Product'].unique().tolist()
rep = 5
df_corr_name['Product'] = [ele for ele in name_list for i in range(rep)]
df_corr_name.insert(1, 'Product', df_corr_name.pop('Product'))

#rating
df_tcc_quality = data[['Product', 'Rating']].drop_duplicates()
quality_list = [list(i) for i in zip(df_tcc_quality['Product'], df_tcc_quality['Rating'])]
tcc_list_100 = df_corr_name['Product'].tolist()
L = []
for i in range(len(tcc_list_100)):
    for j in range(len(quality_list)):
        if tcc_list_100[i] == quality_list[j][0]:
            L.append(quality_list[j][1])
df_corr_name['Rating'] = L
df_corr_name.insert(2, 'Rating', df_corr_name.pop('Rating'))


#HEADERS
#Year
df_corr_name.loc[-1] = ['', '', ''] + [next(tenor_list) for i in range(len(df_corr_name))]
df_corr_name.iloc[-1] = df_corr_name.iloc[-1].astype(str)
df_corr_name.index = df_corr_name.index + 1 
df_corr_name = df_corr_name.sort_index()

#Name
df_corr_name.loc[-1] = ['', '', ''] + [ele for ele in name_list for i in range(rep)]
df_corr_name.iloc[-1] = df_corr_name.iloc[-1].astype(str)
df_corr_name.index = df_corr_name.index + 1  
df_corr_name = df_corr_name.sort_index()

#Quality
df_corr_name.loc[-1] = ['', '', ''] + L
df_corr_name.iloc[-1] = df_corr_name.iloc[-1].astype(str)
df_corr_name.index = df_corr_name.index + 1  
df_corr_name = df_corr_name.sort_index()


new_labels = pd.MultiIndex.from_arrays([df_corr_name.columns, df_corr_name.iloc[0], df_corr_name.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_name = df_corr_name.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)


#POPULATE CORRELATION
for i, j in df_corr_name.iterrows(): 
    i = df_corr_name.index.tolist()[0]
    while i <= len(df_corr_name.index):
        df_corr_name.iloc[i:i+5, i+3:i+8] = 1.0
        i += 5


for i, j in df_corr_name.iterrows():
    df_corr_name.iloc[i][i+1] = float(0)

The idea is that:

Values at diagonal are 0
If any cell has the same column and row’s product names, its value is 1, otherwise 0.5, such as the below output:

Rating correlation matrix:

#Rating
row, col = 5 * len(data['Product'].unique().tolist()), 5 * len(data['Product'].unique().tolist()) + 1
df_corr_quality = pd.DataFrame.from_records([[float(1)]*col]*row)
df_corr_quality = df_corr_quality.loc[ : , df_corr_quality.columns != 0]
df_corr_quality

#CREATE NEW COLUMNS


#year
tenor_list = cycle(['Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5'])
df_corr_quality['Year'] = [next(tenor_list) for i in range(len(df_corr_quality))]
df_corr_quality.insert(0, 'Year', df_corr_quality.pop('Year'))


#product
name_list = data['Product'].unique().tolist()
rep = 5
df_corr_quality['Product'] = [ele for ele in name_list for i in range(rep)]
df_corr_quality.insert(1, 'Product', df_corr_quality.pop('Product'))

#rating
df_tcc_quality = data[['Product', 'Rating']].drop_duplicates()
quality_list = [list(i) for i in zip(df_tcc_quality['Product'], df_tcc_quality['Rating'])]
tcc_list_100 = df_corr_quality['Product'].tolist()
L = []
for i in range(len(tcc_list_100)):
    for j in range(len(quality_list)):
        if tcc_list_100[i] == quality_list[j][0]:
            L.append(quality_list[j][1])
df_corr_quality['Rating'] = L
df_corr_quality.insert(2, 'Rating', df_corr_quality.pop('Rating'))


#HEADERS
#Year
df_corr_quality.loc[-1] = ['', '', ''] + [next(tenor_list) for i in range(len(df_corr_quality))]
df_corr_quality.iloc[-1] = df_corr_quality.iloc[-1].astype(str)
df_corr_quality.index = df_corr_quality.index + 1 
df_corr_quality = df_corr_quality.sort_index()

#Name
df_corr_quality.loc[-1] = ['', '', ''] + [ele for ele in name_list for i in range(rep)]
df_corr_quality.iloc[-1] = df_corr_quality.iloc[-1].astype(str)
df_corr_quality.index = df_corr_quality.index + 1  
df_corr_quality = df_corr_quality.sort_index()

#Quality
df_corr_quality.loc[-1] = ['', '', ''] + L
df_corr_quality.iloc[-1] = df_corr_quality.iloc[-1].astype(str)
df_corr_quality.index = df_corr_quality.index + 1  
df_corr_quality = df_corr_quality.sort_index()


new_labels = pd.MultiIndex.from_arrays([df_corr_quality.columns, df_corr_quality.iloc[0], df_corr_quality.iloc[1]], names=['Year', 'Rating', 'Product'])
df_corr_quality = df_corr_quality.set_axis(new_labels, axis=1).iloc[3:].reset_index().drop('index', axis = 1)



#CHANGE CELL VALUE TO 0.8 IF Rating is "Bad"
for i, j in df_corr_quality.iterrows():
    for k in range(3, len(df_corr_quality.columns)):
        if (df_corr_quality.columns[k][1] == 'Bad' and df_corr_quality.iloc[i,2] == 'Good') or (df_corr_quality.columns[k][1] == 'Good' and df_corr_quality.iloc[i,2] == 'Bad'):
            df_corr_quality.iloc[i][k-2] = 0.8


#POPULATE CORRELATION 0 AT DIAGONAL
            

for i, j in df_corr_quality.iterrows():
    df_corr_quality.iloc[i, i+3] = float(0)

Diagonal values should be set to 0
In each cell, if both row and column have the same ratings (i.e., both are “Good”), populate 1, otherwise 0.8 (i.e., if row is “Good”, column is “Bad”, set to 0.8). Output:

Finally, I multiplied column ”Price” in the original dataset with its transpose and the product of these 2 matrices.

    df_name = df_corr_name.iloc[:, 3:]
    df_quality = df_corr_quality.iloc[:, 3:]
    df_pkl = df_name.to_numpy() * df_quality.to_numpy()    
    
    s = data [['Price']].to_numpy()
    v = df_pkl
    t = np.multiply(s, s.transpose())
    u = np.multiply(t, v)
    z = pd.DataFrame(u)

The point is, if my data is limited to less than 1000 rows, my code runs quite well. However, if I increase it to more than 10 000 rows, it goes through an endless loop. The running time is more than 3 hours, causing it to crash. I’d assume the root cause is my loops in the matrix parts. Do you have other better options to optimize mine?

Many thanks in advance.

Is there any numpy syntax to shorten the code and get rid of the loops? — Laura, Commented Jul 8 at 8:50
You can find some ideas for performance improvement in this answer. Another idea is to use polars, which is built with more efficiency in mind. — Code_beginner, Commented Jul 8 at 9:15
i don't know how to replace the iterrows with the suggestions in the post... Do you have a solution? — Laura, Commented Jul 8 at 9:41
It would help if you could point out the part which is taking the most time, after that one could investigate. — Code_beginner, Commented Jul 9 at 10:51
You really need a minimal reproducible example. Nobody's going to be able to parse all that code to find the bit that's maybe slow. — Daniel F, Commented Jul 11 at 6:22

Collectives™ on Stack Overflow

Optimize loops in Numpy correlation matrices

0

Browse other questions tagged
python
pandas
dataframe
numpy
for-loop
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pythonpandasdataframenumpyfor-loop or ask your own question.

Linked

Browse other questions tagged
python
pandas
dataframe
numpy
for-loop
or ask your own question.