7
$\begingroup$

It seems to me that when you scale a numeric variable you should do it separately in train and test set.

For example, if you have a numeric variable X. Normalized X is : ( X - m(X) ) / s(X). When you have train and test sets, m(X) and s(X) should be calculated on each population. If not some information ( contained in global m and s ) pass from train to test set through globally normalized variables.

Is it correct to think so? Thanks for your answer.

$\endgroup$
0

2 Answers 2

7
$\begingroup$

Any kind of transformation of the data representation that "takes" information from the data should only be "fitted" on the training data. This is because:

  1. If you were using all data you would have a information leakage from the validation or test (also called: holdout) data into your model. This is forbidden! As a result your validation/test score estimates will be skewed.
  2. The model should also be only trained on a specific data representation. The data representation transformation should be applied like in the training stages in most cases (example of an exception: some kind of online settings).

So in the usual cases of batch training with ERM evaluation or stochastic optimization in deep learning, this kind of normalization should only be done on the training stage.

This is also why this transformation is grouped into a pipeline together with the model in most ML library designs. Because then they can be fitted together as well as deployed as one.

Of course this can lead to breaking of assumptions during runtime. Say you min-max-normalize, you would expect that attribute to fall into $[0, 1]$ after. Say the max was $m$, then it could very well be that new data has the very attribute with a value $x > m$, thus applying the min-max-normalization you would get a transformed value $\tilde{x} > 1$. This does not work so well in some cases, so you would do some kind of truncation and setting the value to $1$. If you expect many outliers you may want to take a look at RobustScaler in scikit-learn for example.

$\endgroup$
1
  • 1
    $\begingroup$ Thanks for your answer. I keep in mind RobustScaler. $\endgroup$ Commented Nov 4, 2018 at 7:39
2
$\begingroup$

You should use train data's mean and std. deviation. For example, in scikit-learn library, StdScaler class, you'll first use fit() function with training data and transform() function with the test data, which transforms the given test data using fitted m and s values calculated from training data you have. If you have significantly different mean and std. dev. values for training and test sets you've, then your training set might not be a good representative for your overall population, which may result in more serious problems not limited to standard scaling you're asking. Or, your test set is very skewed. If it is skewed, when you standard scale it, there is a danger that your test samples might be of similar nature with your training samples, e.g. situations that are anomaly might seem normal. Also, what were you going to do if I give you one test sample every day, would you wait for enough samples to calculate the test mean?

$\endgroup$
2
  • $\begingroup$ Thanks for the answer. I have just thought of this case : I have my data I first split in train and test set. Then with my train set, I fit a model using n-fold cross validation : my train set is splitted in n folds each fold being used for training and testing during the cv loop. I think i shall scale in the cv loop. Is that right? If yes, is there a way to deal with this using scikitlearn GridSearchCV ? $\endgroup$ Commented Nov 4, 2018 at 7:38
  • $\begingroup$ GridSearchCV takes an estimator as input, and it calls fit and transform methods implemented in the estimator. You can write a wrapper class for your model and in its fit method, you call StdScaler's and your model's fit methods in order. Therefore, when GridSearchCV calls fit method of your estimator, validation set will be isolated. $\endgroup$
    – gunes
    Commented Nov 4, 2018 at 10:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.