How to create consistent dummy variables in Inference code?

Question

I am using pd.get_dummies on a categorical column to create dummy variables.

The Training pipeline is something like this

Normalization
Dummy variable Creation
Training

Now my training data has a column Months for which I have created dummy variables.

But In inference code how to handle such cases eg. Test data might not have all the months so it might not create all the necessary columns.

Please guide me on this.

nwaldo · Accepted Answer · 2024-06-28 17:33:44Z

Your test frame will need to have the same column as the training set. You can achieve this as follows:

This code creates sample data

# initialize list of lists
data = [['tom', 10], ['carol', 15], ['juli', 14], ['winston', 20], ['carol', 11], ['juli', 14]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Test_Score'])

train = df.loc[0:2]
test  = df.loc[3:]

train = pd.get_dummies(train)
test = pd.get_dummies(test)

print("test before")
print(test)

Code to adjust the test set so that the columns are similar to the train.

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

print("test after")
print(test)

The code above is identifying columns in the training set that are missing in the test set. The columns that are missing are then added as 0 columns.

Reference:

I may not have access to Train data while running the inference code pipeline. — Sociopath, Commented Jul 1 at 4:07
could you share part of your code? are you using an sklearn estimator? or their pipeline object? — nwaldo, Commented Jul 1 at 19:22

Stack Exchange Network

How to create consistent dummy variables in Inference code?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
one-hot-encoding
dummy-variables
or ask your own question.

Hot Network Questions

How to create consistent dummy variables in Inference code?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learningone-hot-encodingdummy-variables or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
one-hot-encoding
dummy-variables
or ask your own question.