Return to Answer

deleted 5 characters in body

Source Link

edited May 22, 2018 at 23:38

43.1k
28
85
159

I think you'd better splittingsplit before you do imputation. For instances, Youyou may want to impute missing valuevalues with column mean. In this case, if you impute first with train+valid datasetdata set and split next, then you have used validation datasetdata set before you built your model being built, which is how Data Leakagea data leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation evaluation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reducereducex the chance to makeof making a mistake. See Pipeline

Source Link

created May 22, 2018 at 22:37

chen636489

I think you'd better splitting before you do imputation. For instances, You want to impute missing value with column mean. In this case, if you impute first with train+valid dataset and split next, then you have used validation dataset before your model being built, which is how Data Leakage problem comes into picture.