33
$\begingroup$

I have a data set with N ~ 5000 and about 1/2 missing on at least one important variable. The main analytic method will be Cox proportional hazards.

I plan to use multiple imputation. I will also be splitting into a train and test set.

Should I split the data and then impute separately, or impute and then split?

If it matters, I will be using PROC MI in SAS.

$\endgroup$
4
  • 3
    $\begingroup$ 50% missing values for a crucial variable? Ugh. Rather than impute, why not create a 'Missing' category for the variable? $\endgroup$
    – RobertF
    Commented Apr 24, 2014 at 19:39
  • $\begingroup$ No one variable has 50% missing, but about 50% is missing on at least one. Also, they are continuous, so "missing" would mess things up. $\endgroup$
    – Peter Flom
    Commented Apr 24, 2014 at 20:05
  • 1
    $\begingroup$ Ah. I get nervous using imputation. I wonder about the merits of having a continuous variable with 50% values imputed vs. converting the cont. variable to categorical with a 'Missing' category plus enough bins to capture the behavior the non-missing values? $\endgroup$
    – RobertF
    Commented Apr 24, 2014 at 20:10
  • 1
    $\begingroup$ I don't like binning continuous variables. $\endgroup$
    – Peter Flom
    Commented Apr 24, 2014 at 20:15

3 Answers 3

42
$\begingroup$

You should split before pre-processing or imputing.

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only get to test your trained model once.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.

Added from comments: if you use the test data to affect the training data, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk overfitting, and it was to discourage this that you separated out the test data in the first place

$\endgroup$
12
  • 2
    $\begingroup$ When you say "you do it the same way on both sets", do you mean: "use the same method to impute missing data in the test set, but NOT the same data"? $\endgroup$
    – colorlace
    Commented May 5, 2018 at 4:22
  • 2
    $\begingroup$ @colorlace Use the past/future analogy. You used the training set in the past, and imputed some values. You now get the test set in the future, and want to impute some of its values; you presumably will use the same method as before applied to the test data (though you are free to incorporate what you learned from the training data) $\endgroup$
    – Henry
    Commented May 5, 2018 at 14:16
  • 1
    $\begingroup$ If you "are free to incorporate what you learned from the training data", then how is that different from just not splitting before imputing. $\endgroup$
    – colorlace
    Commented May 7, 2018 at 22:51
  • 1
    $\begingroup$ Are you suggesting: You can inform your test set imputations with training data, but you can't inform training imputations with test data. ? $\endgroup$
    – colorlace
    Commented May 7, 2018 at 22:54
  • 4
    $\begingroup$ @colorlace: that final point is precisely what I am saying: nothing you do with the training data should be informed by the test data (the analogy is that the future should not affect the past), but what you do do with the test data can be informed by the training data (the analogy is that you can use the past to help predict the future) $\endgroup$
    – Henry
    Commented May 8, 2018 at 19:05
4
$\begingroup$

I think you'd better split before you do imputation. For instances, you may want to impute missing values with column mean. In this case, if you impute first with train+valid data set and split next, then you have used validation data set before you built your model, which is how a data leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reducex the chance of making a mistake. See Pipeline

$\endgroup$
2
$\begingroup$

Just to add on the above I would also favour spliting before imputing or any type of pre-processing. Nothing you do with the training data should be informed by the test data (the analogy is that the future should not affect the past). You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets (the analogy is that you can use the past to help predict the future).

If you use the test data to affect the training data in any way, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk over fitting, and it was to discourage this that you separated out the test data in the first place!

I think the caret package in r is very useful in that setting. I found in specific that post to be extremely helpful https://topepo.github.io/caret/model-training-and-tuning.html

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.