Timeline for Imputation before or after splitting into train and test?

Current License: CC BY-SA 4.0

15 events

when toggle format	what		by	license	comment
Jul 1, 2022 at 14:30	comment	added	Henry		Imputing typically involves using more complete information to inform replacing missing values (means, or nearby points, or whatever). If you do this before splitting then you are in effect using the test data to adjust the training data, and should expect this to bias your model towards appearing as a better predictor on unseen data then it really is.
Jul 1, 2022 at 12:26	comment	added	ADF		I know this is an old thread, but can someone post a link to a formal description of why we need to impute after splitting the data? It strikes me that if imputation takes place only over the predictors with no reference to outcomes, and if a split is truly random, the sequence here might not make any difference at all.
Jun 30, 2019 at 12:39	vote	accept	Peter Flom
May 29, 2018 at 14:20	comment	added	Henry		@colorlace - that only makes sense if you have more new test data you have not seen before, which you might then use to test the revised model. Once you incorporate the original test data into the model, it ceases to be test data and it becomes training data, and so cannot be used for testing.
May 29, 2018 at 14:06	comment	added	colorlace		@Henry I agree you ARE using test data to affect training data (which in turn affects the model). But in this case one could conceivably continue to incorporate all the future data into the imputation calculation - incrementally updating the training data (and thus the model) as we get more data about the distribution of the input vars.
May 22, 2018 at 23:09	history	edited	Henry	CC BY-SA 4.0	added 319 characters in body
May 22, 2018 at 22:41	comment	added	Alexis		Henry your last comment should be the first two sentences of your answer. :)
May 8, 2018 at 21:56	comment	added	Henry		@colorlace - if you use the test data to affect the training data, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk overfitting, and it was to discourage this that you separated out the test data in the first place
May 8, 2018 at 19:10	comment	added	colorlace		I see. So, If we're assuming that the predictor variables are coming from the same distribution -- why not just, upon getting new batch of data to predict on, recompute the imputations [using all available data]?
May 8, 2018 at 19:05	comment	added	Henry		@colorlace: that final point is precisely what I am saying: nothing you do with the training data should be informed by the test data (the analogy is that the future should not affect the past), but what you do do with the test data can be informed by the training data (the analogy is that you can use the past to help predict the future)
May 7, 2018 at 22:54	comment	added	colorlace		Are you suggesting: You can inform your test set imputations with training data, but you can't inform training imputations with test data. ?
May 7, 2018 at 22:51	comment	added	colorlace		If you "are free to incorporate what you learned from the training data", then how is that different from just not splitting before imputing.
May 5, 2018 at 14:16	comment	added	Henry		@colorlace Use the past/future analogy. You used the training set in the past, and imputed some values. You now get the test set in the future, and want to impute some of its values; you presumably will use the same method as before applied to the test data (though you are free to incorporate what you learned from the training data)
May 5, 2018 at 4:22	comment	added	colorlace		When you say "you do it the same way on both sets", do you mean: "use the same method to impute missing data in the test set, but NOT the same data"?
Apr 24, 2014 at 19:05	history	answered	Henry	CC BY-SA 3.0