Skip to main content
added 319 characters in body
Source Link
Henry
  • 40.5k
  • 1
  • 79
  • 135

You should split before pre-processing or imputing.

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only get to test your trained model once.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.

Added from comments: if you use the test data to affect the training data, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk overfitting, and it was to discourage this that you separated out the test data in the first place

You should split before pre-processing or imputing.

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only get to test your trained model once.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.

You should split before pre-processing or imputing.

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only get to test your trained model once.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.

Added from comments: if you use the test data to affect the training data, then the test data is being used to build your model, so it ceases to be test data and will not provide a fair test of your model. You risk overfitting, and it was to discourage this that you separated out the test data in the first place

Source Link
Henry
  • 40.5k
  • 1
  • 79
  • 135

You should split before pre-processing or imputing.

The division between training and test set is an attempt to replicate the situation where you have past information and are building a model which you will test on future as-yet unknown information: the training set takes the place of the past and the test set takes the place of the future, so you only get to test your trained model once.

Keeping the past/future analogy in mind, this means anything you do to pre-process or process your data, such as imputing missing values, you should do on the training set alone. You can then remember what you did to your training set if your test set also needs pre-processing or imputing, so that you do it the same way on both sets.