NULL imputation in training and testing data set [duplicate]

Question

I have a dataset and I'd like to to use it for classification purposes. There are some columns with NULL values that I need to impute. I want to impute with either median or mean but what I want to know if I should impute that with median/mean before spliting into train and test or I should first split into train and test, then impute with median/mean in train data set and take the value for median/mean from the training data set and apply that to my test data set?

Is any of the variables you want to impute the dependent variable, or are they all independent variables? — Kodiologist, Commented Aug 11, 2017 at 17:21

Kodiologist · Accepted Answer · 2017-08-11 18:30:40Z

0

In a classification context, it's fine to impute values of the independent variables for all cases before the train–test split (so long as your imputation scheme ignores the dependent variable, as mean or median imputation would). The train–test split is only supposed to hide values of the dependent variable, not the independent variables.

answered Aug 11, 2017 at 18:30

Kodiologist

20.4k2 gold badges42 silver badges78 bronze badges

Add a comment |

Stack Exchange Network

NULL imputation in training and testing data set [duplicate]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
classification
data-imputation
or ask your own question.

Linked

Hot Network Questions

NULL imputation in training and testing data set [duplicate]

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learningclassificationdata-imputation or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
classification
data-imputation
or ask your own question.