I am self learning machine learning right now, and I am confused with what should I do first.
- Should I impute the missing value before encoding the categorical variable?
- Also, I am learning from Kaggle, and it always split to train, test set before doing any feature engineering stuff. What is the reason behind it? Can I doing it for the entire dataset?
- When should I perform cross validation? Before splitting the data?
I also hope to know the reason behind all the decision because I don’t want to just memorize it. It was difficult to learn by myself for this extremely complex topic.