Capping before or after splitting the data into train and test?

Question

I have a data set with N ~ 9000 and about 50% missing on at least one important variable. There are more than 50 continuous variables and for each variable, the values after 95th percentile seems drastically huge from the values in the previous percentiles. So, I want to cap the variables at their respective 95th percentiles. Should I do it before or after train-test split?

I am of the view that I should do it after but one concern is the extreme values might not appear in the training dataset to begin with.

I'm working on Python

sam chakerian · Accepted Answer · 2023-01-30 14:27:29Z

1

Capping it before would be leaking data from the test set into the train set.

You should cap on the train set (after the split). Then, you could run the cap on the test set (informed by data from the train set).

The point is the train set should be treated as if it doesn't have the benefit of future knowledge.

edited Jan 30, 2023 at 14:27

answered Jan 9, 2023 at 18:33

sam chakerian

112 bronze badges

1

$\begingroup$ you are right, I've edited it $\endgroup$
– sam chakerian
Commented Jan 30, 2023 at 14:27
$\begingroup$ A related point is at stats.stackexchange.com/questions/95083/… and $\endgroup$
– Henry
Commented Jan 30, 2023 at 18:41

Add a comment |

Stack Exchange Network

Capping before or after splitting the data into train and test?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
cross-validation
missing-data
train
train-test-split
or ask your own question.

Linked

Hot Network Questions

Capping before or after splitting the data into train and test?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythoncross-validationmissing-datatraintrain-test-split or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
cross-validation
missing-data
train
train-test-split
or ask your own question.