3
$\begingroup$

My supervisor has instructed another person in my lab to use both training and testing data to impute missing values for use building a machine learning model. The results of the analysis haven't been put into a publication but my feeling was A) this is wrong, the model should be trained without receiving any information from the testing set and B) if this were to go into a publication it would be dodgey at best and potentially illegal if you didn't report it. I would be surprised if the results were published if it was reported.

Are my suspicions correct? Is there a rigorous reason why?

$\endgroup$
3
  • 3
    $\begingroup$ My guess is that if a paper was accepted without reporting this, and the journal found out afterwards, you would be asked to retract your paper at minimum. And if you submit a paper and report that you did this, I bet your reviewers will say "why did you do this? Now your estimates of predictive power on new data are biased." I trust you understand why falsely reporting methods would be unethical. As for rigor, this link explains why this practice is bad. stats.stackexchange.com/questions/95083/… $\endgroup$
    – wzbillings
    Commented Aug 27, 2021 at 1:34
  • $\begingroup$ @wz-billings frankly my supervisor suggesting we do this made me nervous $\endgroup$ Commented Aug 27, 2021 at 1:59
  • $\begingroup$ it would make me nervous as well. It is probably best to discuss with your supervisor why they have recommended this--it could be the result of a miscommunication or misinformation or a momentary bit of forgetfulness, so it is best to clear the air if possible. $\endgroup$
    – wzbillings
    Commented Aug 27, 2021 at 13:20

1 Answer 1

3
$\begingroup$

The out-of-sample data mimic the real situation of applying the model to unseen data, such as expecting Siri or Alexa to understand speech that has yet to be uttered, perhaps even by people who have yet to be born. When you are modeling, you treat the out-of-sample data as if they do not exist.$^{\dagger}$ Consequently, this approach is unacceptable.

I like the analogy to speech recognition and would gladly use it if a colleague of mine suggested this buffoonery. I invite you to use the analogy.

$^{\dagger}$It gets somewhat more complicated than this because of ideas like cross validation and having a train/validate/test set. Using the ($5$-fold) cross validation idea, you take your four folds for training and do all of the modeling steps, including the imputation, on them, ignoring the fifth fold. Then you do the same but ignore a different fold (which means you train on previously-ignored data), et cetera. This way, all training sets are ignorant of the out-of-sample data.

$\endgroup$
6
  • $\begingroup$ But you do cross validation using a trainings set, the cross-validation is to avoid overfitting then you prove the model is not over fit with the test set. Correct? $\endgroup$ Commented Aug 27, 2021 at 1:58
  • $\begingroup$ That sounds about right, though your question warrants it’s own post. (My paragraph is a footnote for a reason, even if it’s $60\%$ of the post.) $\endgroup$
    – Dave
    Commented Aug 27, 2021 at 2:01
  • $\begingroup$ I will turn it into a post then, $\endgroup$ Commented Aug 27, 2021 at 2:04
  • $\begingroup$ Scratch that I believe there are several posts about this already one here: stats.stackexchange.com/questions/223408/… and another here: stats.stackexchange.com/questions/18856/… $\endgroup$ Commented Aug 27, 2021 at 2:06
  • 1
    $\begingroup$ @Steely That sounds like the beginning of a new question to post. In your post, please indicate in what way my answer here is not clear. $\endgroup$
    – Dave
    Commented Jun 7, 2022 at 12:54

Not the answer you're looking for? Browse other questions tagged or ask your own question.