-4

Goal: Here is a sample of a dataset that has "ID", "PHASENAME", "CDAYS", "MULTI_FACTOR", "DAY_COUNTER", and "DAILY_LABOR_PERCENT". The goal is to take every "ID", "PHASENAME", "CDAYS" and increase the "DAY_COUNTER" from whatever the last day is increase it to 100 days. I am also to repeat this for "DAILY_LABOR_PERCENT".

currently

ID PHASENAME C_DAYS Multi_Factor DAY_COUNTER DAILY_LABOR_%
BAR C 14.0 7.142857 1 1.0
BAR C 14.0 7.142857 2 5.0
BAR C 14.0 7.142857 13 4.0
BAR C 14.0 7.142857 14 0.0

need it to be

ID PHASENAME C_DAYS Multi_Factor DAY_COUNTER DAILY_LABOR_%
BAR C 14.0 7.142857 1 .14
BAR C 14.0 7.142857 2 some number
BAR C 14.0 7.142857 99 some number
BAR C 14.0 7.142857 100 0.0

In this sample dataset, this person worked for 14 days and their labor percentage is recorded for each day. Instead of have 14 days of labor, I want to have 100 days of labor ( formulas: 'DAY_COUNTER' * 'Multi_Factor' and "DAILY_LABOR_PERCENT" / 'Multi_Factor ).

Also, I'm worried that once I enlarge this dataset having the range as 100 won't work.

Looking for

  1. The Labor is repeating. So when I apply my logic I am not getting the true labor percentage.
  2. Suggestions on how to approach this with a larger dataset.

code

import pandas as pd
import numpy as np


data={
    "ID": [ "BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR"],
    "PHASENAME": [ "C","C","C","C","C","C","C","C","C","C","C","C","C","C"],
    "C_DAYS": [ 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
    "Multi_Factor": [7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857],
    "DAY_COUNTER": [1,2,3,4,5,6,7,8,9,10,11,12,13,14],
    "DAILY_LABOR_PERCENT": [1.0,5.0,9.0,11.0,10.0,9.0,9.0,9.0,8.0,10.0,8.0,7.0,4.0,0.0],
    }

df=pd.DataFrame(data)
df1=df.copy()


n = 100
first_day = df["DAY_COUNTER"].iloc[0]

rep = np.tile(df.values, (n // len(df) + 1, 1))
out = pd.DataFrame(rep, columns=df.columns).iloc[:n]

out["DAY_COUNTER"] = range(first_day, n + first_day)


def labor_factor(row):
    return row['DAILY_LABOR_PERCENT'] / row['Multi_Factor']

#creating empty list to store rows for each id, phasename, and cdays
new_rows= []

#iterating through each index range and applying function to get day_counter to 100

for i in range(len(out)):
    out['new_labor']= out.apply(labor_factor,axis=1)

7
  • Please show the full traceback of the error.
    – Barmar
    Commented Jul 3 at 21:52
  • The value of row is not a row, it's the string BAR.
    – Barmar
    Commented Jul 3 at 21:54
  • @Barmar I’m not sure if I’m understanding. I thought that my math function could create a new value and the iteration would have created new rows for that value in the new dataset.
    – Ty Kendall
    Commented Jul 3 at 22:09
  • The problem is how you're passing the arguments to the function. .apply() isn't passing the entire row, it's just passing one value.
    – Barmar
    Commented Jul 3 at 22:11
  • I think you may be confusing how Pandas.DataFrame.apply and Pandas.Series.apply work. .iloc returns a series.
    – Barmar
    Commented Jul 3 at 22:12

1 Answer 1

0

As a general rule, you want to avoid iterating over dataframe rows, as that's very inefficient.

IIUC, you can instead repeat your dataframe using something vectorized like numpy.tile until you have the desired number of rows, then adjust the values accordingly:

n = 100
first_day = df["DAY_COUNTER"].iloc[0]

rep = np.tile(df.values, (n // len(df) + 1, 1))
out = pd.DataFrame(rep, columns=df.columns).iloc[:n]

out["DAY_COUNTER"] = range(first_day, n + first_day)
     ID PHASENAME C_DAYS Multi_Factor  DAY_COUNTER DAILY_LABOR_PERCENT
0   BAR         C   14.0     7.142857            1                 1.0
1   BAR         C   14.0     7.142857            2                 5.0
2   BAR         C   14.0     7.142857            3                 9.0
3   BAR         C   14.0     7.142857            4                11.0
4   BAR         C   14.0     7.142857            5                10.0
..  ...       ...    ...          ...          ...                 ...
95  BAR         C   14.0     7.142857           96                 7.0
96  BAR         C   14.0     7.142857           97                 4.0
97  BAR         C   14.0     7.142857           98                 0.0
98  BAR         C   14.0     7.142857           99                 1.0
99  BAR         C   14.0     7.142857          100                 5.0
7
  • Where is the multiplication by Multi_Factor?
    – Barmar
    Commented Jul 3 at 22:41
  • @Barmar it's not needed to "get day_counter to 100".
    – e-motta
    Commented Jul 3 at 22:48
  • @e-motta thanks, for teaching me about numpy tile and for the refresher about for loops. However, now, I am stuck with expanding the "DAILY_LABOR_PERCENT". Right now, the labor is repeating. I will update my problem and query above.
    – Ty Kendall
    Commented Jul 3 at 23:45
  • @Barmar thanks, for teaching me about numpy tile and for the refresher about for loops. However, now, I am stuck with expanding the "DAILY_LABOR_PERCENT". Right now, the labor is repeating. I will update my problem and query above.
    – Ty Kendall
    Commented Jul 3 at 23:45
  • @TyKendall If you need to divide column 'DAILY_LABOR_PERCENT' proportionately by how much the dataframe was enlarged, you can do out["DAILY_LABOR_PERCENT"] = out["DAILY_LABOR_PERCENT"] / (n / len(df)). n / len(df) in this case is 7.142857, but this way you make it dynamic and don't need to store a repeated value in another column.
    – e-motta
    Commented Jul 4 at 0:01

Not the answer you're looking for? Browse other questions tagged or ask your own question.