1
$\begingroup$

There is lots of discussion about pre-processing methods and if they need to be included within a cross-validation procedure or if they can happen prior to splitting the data -- questions on stackexchange: 1 2 3 4 and papers analysing this 5 6.

The main view seems to be that any pre-processing should not include validation/test data as:

Cross-validation is best viewed as a method to estimate the performance of a statistical procedure, rather than a statistical model. Thus in order to get an unbiased performance estimate, you need to repeat every element of that procedure separately in each fold of the cross-validation, which would include normalisation.

However, unsupervised methods are sometimes considered an admissible form of weak data leakage and 7 states "initial unsupervised screening steps can be done before samples are left out", giving the example of variance-based feature selection.

In these scenarios, as outlined e.g. here, it is generally assumed that you have a procedure to train a model on some data and you want to be able to use this to predict on totally new samples.

However, suppose you have a dataset that captures all possible observations you are interested in (i.e. an exhaustive dataset) but some of the target labels are missing that you want to predict --- say you have data on every single house in Europe (X_all, y_all) and for a portion of this data you don't know the price and want to predict this.

In this case, it seems unproblematic to me to use all of X_all in any unsupervised preprocessing steps, and therefore can do this prior to any cross validation using the labelled data. X_all is a constant that is fixed in both training and testing scenarios and so the unsupervised preprocessing steps applied to X_all don't depend on the target labels and so no leakage occurs.

Is my thinking correct or am I missing something?

$\endgroup$

1 Answer 1

0
$\begingroup$

The basic recommendation when you have true class labels is that you are supposed to perform transformations (unsupervised) on the full set of training data, pull estimates, and apply the estimates to objects in the test fold. For example, if I'm using 10-fold CV, I first normalize or mean-zero standardize feature values for objects assigned to folds 1-9, then apply the min or mean/s.d. from training objects to normalize or mean-zero standardize the test objects in fold 10 prior to testing. Recall, all the objects in folds other than the test fold are used collectively for training, so you don't need 9 estimates of min(x) or mean/s.d. from each training fold.

However, if you don't have true class labels for objects, then you might view unsupervised as merely a mathematical transformation prior to classification. If you don't know class labels (truth) then you won't be "cheating" via information leakage from training data to testing.

$\endgroup$
1
  • $\begingroup$ Sorry but I don't feel like your answer addresses the point I'm trying to make. I realise this may be my lack of clarity. In your first paragraph, you outline the usual approach that is taken, similar to those I've linked to. Your second paragraph discusses when you don't have any class labels at all, which isn't the problem I'm describing. My point is that, though y_all may contain some class labels, since X_all is fixed and constant across both training and testing scenarios unsupervised pre-processing steps applied to X_all don't depend on the target labels at all and so no leakage occurs. $\endgroup$
    – A. Bollans
    Commented Feb 4 at 14:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.