2
$\begingroup$

I’m following along this lecture on neural networks. The professor derives equations for the gradient of $e(w)$: $\frac{\partial e(w)}{w_{ij}^l}$ for every $w_{ij}^l$ where $e(w)=e(h(x_n),y_n)$ is the error on one data point $(x_n,y_n)$ and $w$ are the weights in the network.

For a node, $s$ is the input and $x=\theta(s)$ the output after applying some activation function $\theta$. Here, $0 \leq i \leq d^{l-1}$ represents the input layer, $1 \leq j \leq d^{l}$the output layer, and the network has $1 \leq l \leq L$ layers.

Starting at around the 50:00 mark, $\frac{\partial e(w)}{w_{ij}^l}$ is found to equal $\frac{\partial e(w)}{s_j^l}\frac{\partial s_j^l}{w_{ij}^l} =\delta_j^l x_i^{l-1}$. For the final layer, $\delta_1^L=\frac{\partial e(w)}{\partial s_1^L}$ can be calculated directly since $e$ is a function of $s_1^L$ and $y_n$. Then for every layer before the final layer, $\delta_i^{l-1}=\frac{\partial e(w)}{\partial s_i^{l-1}}=\sum_{j=1}^{d^l}\delta_j^l w_{ij}^l\theta'(s_i^{l-1})$.

These equations were derived with stochastic gradient descent in mind, but I'm wondering how they can be modified for mini-batch (or batch) gradient descent?

$\endgroup$
3
  • $\begingroup$ The changes you have calculated for each training example just need to be accumulated over many training examples (the size of the batch) so that the total changes can be applied ever few examples. $\endgroup$
    – ajax2112
    Commented Apr 6, 2020 at 7:45
  • $\begingroup$ @ajax2112 My question is how this accumulation is computed, for a batch size $B$, do we accumulate the gradient for each point in the batch (after $B$ backpropagations) then average them, or is it something more complicated? $\endgroup$
    – Yandle
    Commented Apr 6, 2020 at 18:05
  • $\begingroup$ Yes you accumulate the gradient for each point in the batch across the batch. However, it is common to take a sum of all of these gradients rather than an average, given that some of the values will be positive and some will be negative you can kind of treat them like a sum of vectors. $\endgroup$
    – ajax2112
    Commented Apr 6, 2020 at 23:34

1 Answer 1

1
$\begingroup$

Let $e_k$ be the error when you input the $k$-th training sample in a batch of size $B$. The gradient calculation will be modified as follows:

$$\nabla_w (e)=\sum_{k=1}^B\nabla_w(e_k)$$

When $B=1$ it becomes SGD update, when $B$ is the full training size it becomes batch update, and when in between it becomes mini-batch GD.

$\endgroup$
4
  • $\begingroup$ Are we doing backpropagation for each point individually then averaging each $\frac{\partial e(w)}{w_{ij}^l}$ for all points in the batch? So for a batch size $B$, we do both forward and backpropagation $B$ times but the weight are updated at a different frequency as determined by $B$? $\endgroup$
    – Yandle
    Commented Apr 6, 2020 at 17:57
  • $\begingroup$ correct @Yandle, average or sum (doesn't matter because we can adjust learning rate). And, weights are updated when the batch is finished. $\endgroup$
    – gunes
    Commented Apr 7, 2020 at 13:53
  • $\begingroup$ One question. weight derivative in stochastic gradient descent includes product of delta for next layer and input for current layer as op mentioned in his post. When moving to batch gradient descent which input should we choose? δljxl−1i - In this the x part can be multiple values, do we average these as well? $\endgroup$ Commented Dec 20, 2022 at 14:28
  • $\begingroup$ Each summand (gradient) here is obtained for a single input, and the gradients are accumulated/averaged. The expression $\nabla_w(e_k)$ has the input expressions in it, so this is already accounted there. $\endgroup$
    – gunes
    Commented Dec 21, 2022 at 8:55

Not the answer you're looking for? Browse other questions tagged or ask your own question.