23
$\begingroup$

Can someone please tell me how I am supposed to build a neural network using the batch method?

I have read that, in batch mode, for all samples in the training set, we calculate the error, delta and thus delta weights for each neuron in the network and then instead of immediately updating the weights, we accumulate them, and then before starting the next epoch, we update the weights.

I also read somewhere that, the batch method is like the online method but with the difference being one only needs to sum the errors for all samples in the training set and then take the average of it and then use that for updating the weights just like one does in the online method (the difference is just that average) like this:

for epoch=1 to numberOfEpochs

   for all i samples in training set

         calculate the errors in output layer
         SumOfErrors += (d[i] - y[i])
   end

   errorAvg = SumOfErrors / number of Samples in training set

   now update the output layer with this error
   update all other previous layers

   go to the next epoch

end
  • Which one of these are truly the correct form of batch method?
  • In case of the first one, doesn't accumulating all the delta weights result in a huge number?
$\endgroup$
1
  • 1
    $\begingroup$ The "correct" method depends on the context. It turns out that in many cases, updating the weights only once per epoch will converge much more slowly than stochastic updating (updating the weights after each example). I'll add there's a consensus that you'll generally want to use some form of batch updating, but much more often than 1x per epoch. $\endgroup$
    – Tahlor
    Commented Oct 16, 2017 at 2:40

3 Answers 3

16
$\begingroup$

Using average or sum are equivalent, in the sense that there exist pairs of learning rates for which they produce the same update.

To confirm this, first recall the update rule:

$$\Delta w_{ij} = -\alpha \frac{\partial E}{\partial w_{ij}}$$

Then, let $\mu_E$ be the average error for a dataset of size $n$ over an epoch. The sum of error is then $n\mu_E$, and because $n$ doesn't depend on $w$, this holds:

$$\Delta w_{ij} = -\alpha \frac{\partial (n\mu)}{\partial w_{ij}}= -\alpha n\frac{\partial \mu}{\partial w_{ij}}$$

To your second question, the phrase "accumulating the delta weights" would imply that one of these methods retains weight updates. That isn't the case: Batch learning accumulates error. There's only one, single $\Delta w$ vector in a given epoch. (Your pseudocode code omits the step of updating the weights, after which one can discard $\Delta w$.)

$\endgroup$
2
  • 1
    $\begingroup$ Are mini-batch gradient descent the same batch-gradient descent? I'm lost here! if not what's the difference between these? Correct me if I'm wrong, in Batch mode, the whole dataset needs to be read in batches, gradients get calculated, and when all are read, then they are averaged and then parameters are updated, while, in mini-batch, each batch is read, gradients get calculated and then parameters get updated, and then the next mini batch is read till the one epoch is over. $\endgroup$
    – Hossein
    Commented Feb 14, 2017 at 10:55
  • 1
    $\begingroup$ That's the generally given definition: Update parameters using one subset of the training data at a time. (There are some methods in which mini-batches are randomly sampled until convergence, i.e. The batch won't be traversed in an epoch.) See if this is helpful. $\endgroup$ Commented Feb 14, 2017 at 12:30
1
$\begingroup$

The two answers are equivalent. I personally would think of it as average error instead of the sum. But remember that gradient descent has a parameter called the learning rate, and that only a portion of the gradient of the error is subtracted. So whether the error is defined as total of average can be compensated by changing the learning rate.

$\endgroup$
1
  • $\begingroup$ thanks, but if they are really the same, why bother wasting so much memory on retaining the accumulative updates for each pattern, when we can only sum the errors which would be only a simple variable ? $\endgroup$
    – Hossein
    Commented Sep 30, 2015 at 4:06
1
$\begingroup$

Someone explained like; The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

Think of a batch as a for-loop iterating over one or more samples and making predictions. At the end of the batch, the predictions are compared to the expected output variables and an error is calculated. From this error, the update algorithm is used to improve the model, e.g. move down along the error gradient.

A training dataset can be divided into one or more batches.

When all training samples are used to create one batch, the learning algorithm is called batch gradient descent. When the batch is the size of one sample, the learning algorithm is called stochastic gradient descent. When the batch size is more than one sample and less than the size of the training dataset, the learning algorithm is called mini-batch gradient descent.

You can read more Difference Between a Batch and an Epoch in a Neural Network

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.