0
$\begingroup$

I have a data set with N ~ 9000 and about 50% missing on at least one important variable. There are more than 50 continuous variables and for each variable, the values after 95th percentile seems drastically huge from the values in the previous percentiles. So, I want to cap the variables at their respective 95th percentiles. Should I do it before or after train-test split?

I am of the view that I should do it after but one concern is the extreme values might not appear in the training dataset to begin with.

I'm working on Python

$\endgroup$

1 Answer 1

1
$\begingroup$

Capping it before would be leaking data from the test set into the train set.

You should cap on the train set (after the split). Then, you could run the cap on the test set (informed by data from the train set).

The point is the train set should be treated as if it doesn't have the benefit of future knowledge.

$\endgroup$
2

Not the answer you're looking for? Browse other questions tagged or ask your own question.