5
$\begingroup$

What is the best / typical way to plot the training and validation loss during the training of a neural network? Specifically, I am thinking of this as a task in order to help diagnose under / over fitting - perhaps for early stopping or some other method of parameter tuning (e.g. paper)

Here I am assuming there is a training set and an independent development / validation set.

Question #1:

First, does one plot performance / loss at the end of each epoch or each iteration (i.e. mini batch)?

Question #2:

Assuming we are plotting at the end of each mini-batch what is the process?

  • Read mini-batch, run through network, compute loss and update parameters

Now what?

Training:

Do we use

  • The entire training set (or at least the cumulative read from disk at that point - if we are reading from disk in batch or using a generator) to calculate the performance / loss?
  • Just the mini-batch we just used to update the parameters?
  • Some type of rolling average of the mini batches read so far?
  • Something else?

Development / Validation:

Do we use:

  • The entire development / validation set (or at least the cumulative read from disk at that point - if we are reading from disk in batch or using a generator) to calculate the performance / loss?
  • Some type of rolling average of previous iteration calculations?
  • Something else?
$\endgroup$

2 Answers 2

-1
$\begingroup$

Either way, whether you use online training (updating connection weights for each object as it is used in training) or batch mode (summing the partial derivatives over all the objects during a sweep or epoch, then updating connection weights at the end of the sweep), you only plot the error at the end of the sweep. Recall, the error is the sum of MSE or cross-entropy over all the objects as they are processed in each sweep, which is plotted at the end of the sweep, then zeroed before the next sweep.

It's not common to plot error for each object, since it jumps around to much, and is too expensive computationally.

$\endgroup$
12
  • $\begingroup$ Are you using the term 'sweep' to mean a full epoch or a mini-batch? Also, I wonder if you think I am referring to online learning as mini-batch of size 1? I am meaning reading data from disk or a generator incrementally (as opposed to holding the entire training set in memory - thus the whole training set is not an option for calculating loss). $\endgroup$
    – B_Miner
    Commented Mar 30, 2018 at 16:46
  • $\begingroup$ Sweep is an epoch, which means processing all the objects. I wouldn't use "mini-batch", since if you say "online" everyone knows you updated connection weights with the partial derivatives for only that object, then zeroed the partials for the next object. For large datasets, it is common to use online. Sometimes online is better than batch, because the gradient descent can shoot off into space when using larger derivatives associated with batch. But it depends on the data. $\endgroup$
    – user32398
    Commented Mar 30, 2018 at 16:53
  • 1
    $\begingroup$ So...if your response to my original question that you only plot the loss / performance at the end of each epoch? And thus the calculation of the performance / loss is over the entire training set? $\endgroup$
    – B_Miner
    Commented Mar 30, 2018 at 16:56
  • 1
    $\begingroup$ I am not sure this addresses the question. I am not asking about what is used for the weight updates or kfold cv. Thanks for the comments though. $\endgroup$
    – B_Miner
    Commented Mar 30, 2018 at 17:06
  • 1
    $\begingroup$ If you look at the paper linked, they plot iterations, which seem defined as mini-batches. $\endgroup$
    – B_Miner
    Commented Mar 30, 2018 at 17:13
-1
$\begingroup$

Whether you opt for online training (updating connection weights for each object as it's used in training) or batch mode (summing partial derivatives over all objects during a sweep or epoch and updating connection weights at the end), the error is only plotted at the end of the sweep. The error, represented as the sum of Mean Squared Error (MSE) or cross-entropy over all objects processed in each sweep, is plotted at the conclusion of the sweep and then reset before the next one.

Plotting the error for each object is uncommon due to its erratic nature and high computational cost.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.