0
$\begingroup$

I am using pd.get_dummies on a categorical column to create dummy variables.

The Training pipeline is something like this

  1. Normalization
  2. Dummy variable Creation
  3. Training

Now my training data has a column Months for which I have created dummy variables.

But In inference code how to handle such cases eg. Test data might not have all the months so it might not create all the necessary columns.

Please guide me on this.

$\endgroup$

1 Answer 1

0
$\begingroup$

Your test frame will need to have the same column as the training set. You can achieve this as follows:

This code creates sample data

# initialize list of lists
data = [['tom', 10], ['carol', 15], ['juli', 14], ['winston', 20], ['carol', 11], ['juli', 14]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Test_Score'])

train = df.loc[0:2]
test  = df.loc[3:]

train = pd.get_dummies(train)
test = pd.get_dummies(test)

print("test before")
print(test)

Code to adjust the test set so that the columns are similar to the train.

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

print("test after")
print(test)

The code above is identifying columns in the training set that are missing in the test set. The columns that are missing are then added as 0 columns.

Reference:

$\endgroup$
2
  • $\begingroup$ I may not have access to Train data while running the inference code pipeline. $\endgroup$
    – Sociopath
    Commented Jul 1 at 4:07
  • $\begingroup$ could you share part of your code? are you using an sklearn estimator? or their pipeline object? $\endgroup$
    – nwaldo
    Commented Jul 1 at 19:22

Not the answer you're looking for? Browse other questions tagged or ask your own question.