Skip to main content
deleted 5 characters in body
Source Link
Michael R. Chernick
  • 43.1k
  • 28
  • 85
  • 159

I think you'd better splittingsplit before you do imputation. For instances, Youyou may want to impute missing valuevalues with column mean. In this case, if you impute first with train+valid datasetdata set and split next, then you have used validation datasetdata set before you built your model being built, which is how Data Leakagea data leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation evaluation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reducereducex the chance to makeof making a mistake. See Pipeline

I think you'd better splitting before you do imputation. For instances, You want to impute missing value with column mean. In this case, if you impute first with train+valid dataset and split next, then you have used validation dataset before your model being built, which is how Data Leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation evaluation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reduce the chance to make a mistake. See Pipeline

I think you'd better split before you do imputation. For instances, you may want to impute missing values with column mean. In this case, if you impute first with train+valid data set and split next, then you have used validation data set before you built your model, which is how a data leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reducex the chance of making a mistake. See Pipeline

Source Link

I think you'd better splitting before you do imputation. For instances, You want to impute missing value with column mean. In this case, if you impute first with train+valid dataset and split next, then you have used validation dataset before your model being built, which is how Data Leakage problem comes into picture.

But you might ask, if I impute after splitting, it may be too tedious when I need to do cross validation evaluation. My suggest for that is to use sklearn pipeline. It really simplifies your code, and reduce the chance to make a mistake. See Pipeline