Imputing missing data using testing data

Question

My supervisor has instructed another person in my lab to use both training and testing data to impute missing values for use building a machine learning model. The results of the analysis haven't been put into a publication but my feeling was A) this is wrong, the model should be trained without receiving any information from the testing set and B) if this were to go into a publication it would be dodgey at best and potentially illegal if you didn't report it. I would be surprised if the results were published if it was reported.

Are my suspicions correct? Is there a rigorous reason why?

My guess is that if a paper was accepted without reporting this, and the journal found out afterwards, you would be asked to retract your paper at minimum. And if you submit a paper and report that you did this, I bet your reviewers will say "why did you do this? Now your estimates of predictive power on new data are biased." I trust you understand why falsely reporting methods would be unethical. As for rigor, this link explains why this practice is bad. stats.stackexchange.com/questions/95083/… — wzbillings, Commented Aug 27, 2021 at 1:34
@wz-billings frankly my supervisor suggesting we do this made me nervous — Angus Campbell, Commented Aug 27, 2021 at 1:59
it would make me nervous as well. It is probably best to discuss with your supervisor why they have recommended this--it could be the result of a miscommunication or misinformation or a momentary bit of forgetfulness, so it is best to clear the air if possible. — wzbillings, Commented Aug 27, 2021 at 13:20

Dave · Accepted Answer · 2021-08-27 01:43:49Z

3

The out-of-sample data mimic the real situation of applying the model to unseen data, such as expecting Siri or Alexa to understand speech that has yet to be uttered, perhaps even by people who have yet to be born. When you are modeling, you treat the out-of-sample data as if they do not exist.$^{\dagger}$ Consequently, this approach is unacceptable.

I like the analogy to speech recognition and would gladly use it if a colleague of mine suggested this buffoonery. I invite you to use the analogy.

$^{\dagger}$It gets somewhat more complicated than this because of ideas like cross validation and having a train/validate/test set. Using the ($5$-fold) cross validation idea, you take your four folds for training and do all of the modeling steps, including the imputation, on them, ignoring the fifth fold. Then you do the same but ignore a different fold (which means you train on previously-ignored data), et cetera. This way, all training sets are ignorant of the out-of-sample data.

answered Aug 27, 2021 at 1:43

Dave

65k7 gold badges101 silver badges286 bronze badges

$\begingroup$ But you do cross validation using a trainings set, the cross-validation is to avoid overfitting then you prove the model is not over fit with the test set. Correct? $\endgroup$
– Angus Campbell
Commented Aug 27, 2021 at 1:58
$\begingroup$ That sounds about right, though your question warrants it’s own post. (My paragraph is a footnote for a reason, even if it’s $60\%$ of the post.) $\endgroup$
– Dave
Commented Aug 27, 2021 at 2:01
$\begingroup$ I will turn it into a post then, $\endgroup$
– Angus Campbell
Commented Aug 27, 2021 at 2:04
$\begingroup$ Scratch that I believe there are several posts about this already one here: stats.stackexchange.com/questions/223408/… and another here: stats.stackexchange.com/questions/18856/… $\endgroup$
– Angus Campbell
Commented Aug 27, 2021 at 2:06
1

$\begingroup$ @Steely That sounds like the beginning of a new question to post. In your post, please indicate in what way my answer here is not clear. $\endgroup$
– Dave
Commented Jun 7, 2022 at 12:54

| Show 1 more comment

Stack Exchange Network

Imputing missing data using testing data

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
data-imputation
research-design
or ask your own question.

Linked

Hot Network Questions

Imputing missing data using testing data

1 Answer 1

Not the answer you're looking for? Browse other questions tagged data-imputationresearch-design or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
data-imputation
research-design
or ask your own question.