1
$\begingroup$

I have a dataset and I'd like to to use it for classification purposes. There are some columns with NULL values that I need to impute. I want to impute with either median or mean but what I want to know if I should impute that with median/mean before spliting into train and test or I should first split into train and test, then impute with median/mean in train data set and take the value for median/mean from the training data set and apply that to my test data set?

$\endgroup$
2
  • $\begingroup$ Is any of the variables you want to impute the dependent variable, or are they all independent variables? $\endgroup$ Commented Aug 11, 2017 at 17:21
  • $\begingroup$ they are the independant (feature) variable $\endgroup$
    – HHH
    Commented Aug 11, 2017 at 18:28

1 Answer 1

0
$\begingroup$

In a classification context, it's fine to impute values of the independent variables for all cases before the train–test split (so long as your imputation scheme ignores the dependent variable, as mean or median imputation would). The train–test split is only supposed to hide values of the dependent variable, not the independent variables.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.