1
$\begingroup$

I am self learning machine learning right now, and I am confused with what should I do first.

  1. Should I impute the missing value before encoding the categorical variable?
  2. Also, I am learning from Kaggle, and it always split to train, test set before doing any feature engineering stuff. What is the reason behind it? Can I doing it for the entire dataset?
  3. When should I perform cross validation? Before splitting the data?

I also hope to know the reason behind all the decision because I don’t want to just memorize it. It was difficult to learn by myself for this extremely complex topic.

$\endgroup$
4
  • $\begingroup$ Note that data splitting is typically a bad idea unless n > 20,000. $\endgroup$ Commented Jul 31, 2022 at 12:13
  • $\begingroup$ @FrankHarrell Do you mean that one should not split the dataset into train and test set before doing any feature engineering, unless n > 20000? If so, why? $\endgroup$ Commented Jul 31, 2022 at 18:34
  • 1
    $\begingroup$ I meant that data splitting is an enormously wasteful statistical procedure, giving unstable results unless the true signal:noise ratio is very high (outcomes are easy to predict) or n > 20,000. Details here. What is your sample size and distribution of Y? Most often resampling (100 repeats of 10-fold CV or 400 bootstrap reps) is more efficient than data splitting and also exposes the silliness of feature selection. $\endgroup$ Commented Jul 31, 2022 at 20:17

1 Answer 1

0
$\begingroup$

Most times imputing missing values are for numeric features and has nothing to do with encoding which is for categorical data. So, deal with missing value before encoding will seem like a good choice.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.