1
$\begingroup$

I am referring to this question/scenario Train neural network with unlimited training data but unfortunately I can not comment.

As I am not seeing any training batch multiple times I would guess that my model has less chance to overfit and generalize better. But since I am able to generate infinite data, do I still need to generate test data at some point? I could train my model unitl I reach a very low loss value (if possible) and then, since I generate the data in the same way as before (same distribution), I would already know that my model will perform well on the new unseen data as it was trained on data it only saw once.

What additionally confuses me regarding this topic is that most literature deals with optimization algorithms where one assumes one data set due to practical reasons. But generating always new data on the spot makes the mathematical setting actually more cleaner.

$\endgroup$

1 Answer 1

1
$\begingroup$

In short: You should still generate test data since train and test performance can still differ significantly.

Explanation

In the end, your model was just trained on a finit dataset and the risk of overfitting depends on complexity of the data distribution, complexity of the model and number of training sets. Of course you can reduce the risk by training with more and more data, but how will you know that you have used enough training data to avoid overfitting?

Well, there is one obvious way: test it!. So you should generate an independent testing set to test for overfitting.

Will the risk of overfitting become zero?

From a mathematical point of few, the probability that your sampled data is somehow skrewed will never be zero (but close enough for any practical assumptions).

The Always-Overfitting-Model

Even with an arbitrary amount of data, you might not be able to avoid overfitting. The following model is a corner case for theoretical considerations. I would not used it in any practical case.

The model is trained by storing all training samples. When performing a prediction, it looks for an exact match in the stored dataset and outputs the stored target value. If not exact match is found, is outputs nonsense (e.g. a constant value or a random value). This model will work perfect on training data and miserable on testing data independent of the amount of training data (as long as the generated data comes from a continuous distribution, e.g. a normal distribution).

$\endgroup$
3
  • $\begingroup$ I really like this last example you gave. Continuously training on generated data will probably lead to low loss but it does not tell me how it will perform on the same generated data when not being in 'train mode'. But from a theoretical point of view I am in a much more natural and intuitive mathematical setting as epochs are introduced due to practical reasons, i.e. having one limited data set and not a stream of data, right? $\endgroup$
    – ZenDen
    Commented Jun 27 at 9:39
  • $\begingroup$ Even if you have a stream of data, you will typically have training and application seprataed. In that case, your model is trained on a finite dataset (even though you can always train it on a larger one). So mathematically, it is always a finite set. The advantage of a data generator is, that you can have arbitrary large (but still finit) sets. That allows you to train the model without using any sample twice (and hence, epochs are meaningless with the data stream) $\endgroup$
    – Broele
    Commented Jun 28 at 7:46
  • $\begingroup$ Right. To analyse my case mathematically I would still define a finite training set but I would not have to define some pairwise disjoint cover for an epoch as it is usually done since I do not have to rely on epochs in the first place. $\endgroup$
    – ZenDen
    Commented Jun 28 at 8:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.