The Lifecycle of the Test Set

2 min readMar 2, 2024

The integrity of the test set is crucial in evaluating model performance. We want the test set to give us an unbiased estimate of how well the model generalizes to unseen data.

However, when we evaluate our model on the test set, we might see that the model is not performing well on a specific subset of the test set. We go back to the model development and add new features or change the data pre-processing. The next time we evaluate the model, we see that the model improves on the test set. We repeat this process until we are happy with the model’s performance. Good.

Or bad. Using a test set too often can result in bias, overfitting, and reduced model generalizability. As a result, our evaluation of the model becomes less meaningful.

Why does this happen?

By repeatedly evaluating the model performance on the test set, we begin using information from the test set in the training process. We introduce information leakage. Such leakage can lead to a selection bias when choosing hyperparameters or features. As a result, we overfit our model on the test set and thus reduce the model’s generalization ability.

Hence, every time we use the test set, its quality degrades. The test set becomes less independent and does not contain completely unseen data. Eventually, the test set will not be able to give us an unbiased estimate of our model’s ability to generalize on unseen data.

With a decreasing quality of the test set, the model evaluation becomes less meaningful. We compromise the validity of our performance metrics. They no longer reflect the model’s ability to generalize on unseen data. Hence, using a test set too often can lead to misguided decisions. We cannot be sure about the model’s effectiveness after deployment.

To avoid such problems, we should periodically refresh the test set with new unseen data. With this, we ensure that the test set provides a reliable benchmark for model evaluation.

Hence, I like to think about a test set having a lifecycle. The lifecycle reminds me about the consequences of using a test set too often.

1. Creation: We create the test set as a subset of the data set and put it aside.
2. Use: We use the test set to evaluate the model’s performance on unseen data.
3. Deterioration: The test set’s quality deteriorates the more often we use it. As a result, the evaluation becomes less effective.
4. Retirement: We should retire the over-used test set and replace it with a new one. The old test set can become part of our training set.

The Lifecycle of the Test Set

Written by Jonte Dancker