5
$\begingroup$

How does gradient descent work for training a neural network if I choose mini-batch (i.e., sample a subset of the training set)? I have thought of three different possibilities:

  1. Epoch starts. We sample and feedforward one minibatch only, get the error and backprop it, i.e. update the weights. Epoch over.

  2. Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.

  3. Epoch starts. We sample and feedforward a minibatch, get the error and store it. We repeat this until we have sampled the full data set. We somehow average the errors and backprop them by updating the weights. Epoch over.

$\endgroup$
2

1 Answer 1

9
$\begingroup$

Mini-batch is implemented basically as you describe in 2.

  1. Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.

Assuming that the network is minimizing the following objective function: $$ \frac{\lambda}{2}||\theta||^2 + \frac{1}{n}\sum_{i=1}^n E(x^{(i)}, y^{(i)}, \theta) $$

This is essentially the weights update step

$$ \theta = (1 - \alpha \lambda) \theta - \alpha \frac{1}{b}\sum_{k=i}^{i+b-1} \frac{\partial E}{\partial \theta}(x^{(k)}, y^{(k)}, \theta) $$

where the following symbols mean:

$E$ = the error measure (also sometimes denoted as cost measure $J$)

$\theta$ = weights

$\alpha$ = learning rate

$1 - \alpha \lambda$ = weight decay

$b$ = batch size

$x$ = variables

You loop over the consecutive batches (i.e. increment by $b$) and update the weights. This more frequent weight updating combined with vectorization is what allows mini-batch gradient descent to tend to converge more quickly than either generic batch of stochastic methods.

$\endgroup$
9
  • $\begingroup$ Two comments: 1) the update rule $\theta_j = ...$ assumes a particular loss function the way that you've written it. I suggest defining the update rule using $\nabla h_0(x)$ instead so that it is generic. 2) the update rule does not have a weight decay (also for the sake of generality), I would write it with the weight decay. $\endgroup$
    – Sobi
    Commented Dec 18, 2015 at 19:39
  • $\begingroup$ @Sobi, good point about $\nabla h_0(x)$, I am just so used to that assumed loss function. However I am not sure exactly what you mean when you say weight decay? Are you referring to the division by the batch size? Feel free to provide such a generalized function in an edit. $\endgroup$
    – cdeterman
    Commented Dec 18, 2015 at 19:47
  • $\begingroup$ Thanks, i realize you have used a quadratic error function, thats not a problem. Could you mb refer me to a publication that discusses this in details? $\endgroup$
    – Alex
    Commented Dec 18, 2015 at 20:10
  • $\begingroup$ @Sobi, regarding your edit, is $n$ supposed to be batch size? Also, I am unfamiliar with the l2 norm notation. $\endgroup$
    – cdeterman
    Commented Dec 18, 2015 at 21:06
  • 1
    $\begingroup$ @cdeterman, $n$ is the number of all training samples. Note that $n$ only appears in the original objective function (i.e. $\frac{\lambda}{2}||w||^2 + ...$). $\endgroup$
    – Sobi
    Commented Dec 18, 2015 at 21:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.