How to adapt the equations for stochastic gradient descent for batch gradient descent for neural networks?

Question

I’m following along this lecture on neural networks. The professor derives equations for the gradient of $e(w)$: $\frac{\partial e(w)}{w_{ij}^l}$ for every $w_{ij}^l$ where $e(w)=e(h(x_n),y_n)$ is the error on one data point $(x_n,y_n)$ and $w$ are the weights in the network.

For a node, $s$ is the input and $x=\theta(s)$ the output after applying some activation function $\theta$. Here, $0 \leq i \leq d^{l-1}$ represents the input layer, $1 \leq j \leq d^{l}$the output layer, and the network has $1 \leq l \leq L$ layers.

Starting at around the 50:00 mark, $\frac{\partial e(w)}{w_{ij}^l}$ is found to equal $\frac{\partial e(w)}{s_j^l}\frac{\partial s_j^l}{w_{ij}^l} =\delta_j^l x_i^{l-1}$. For the final layer, $\delta_1^L=\frac{\partial e(w)}{\partial s_1^L}$ can be calculated directly since $e$ is a function of $s_1^L$ and $y_n$. Then for every layer before the final layer, $\delta_i^{l-1}=\frac{\partial e(w)}{\partial s_i^{l-1}}=\sum_{j=1}^{d^l}\delta_j^l w_{ij}^l\theta'(s_i^{l-1})$.

These equations were derived with stochastic gradient descent in mind, but I'm wondering how they can be modified for mini-batch (or batch) gradient descent?

The changes you have calculated for each training example just need to be accumulated over many training examples (the size of the batch) so that the total changes can be applied ever few examples. — ajax2112, Commented Apr 6, 2020 at 7:45
@ajax2112 My question is how this accumulation is computed, for a batch size $B$, do we accumulate the gradient for each point in the batch (after $B$ backpropagations) then average them, or is it something more complicated? — Yandle, Commented Apr 6, 2020 at 18:05
Yes you accumulate the gradient for each point in the batch across the batch. However, it is common to take a sum of all of these gradients rather than an average, given that some of the values will be positive and some will be negative you can kind of treat them like a sum of vectors. — ajax2112, Commented Apr 6, 2020 at 23:34

gunes · Accepted Answer · 2020-04-06 11:21:50Z

1

Let $e_k$ be the error when you input the $k$-th training sample in a batch of size $B$. The gradient calculation will be modified as follows:

$$\nabla_w (e)=\sum_{k=1}^B\nabla_w(e_k)$$

When $B=1$ it becomes SGD update, when $B$ is the full training size it becomes batch update, and when in between it becomes mini-batch GD.

answered Apr 6, 2020 at 11:21

gunes

57.9k4 gold badges50 silver badges88 bronze badges

$\begingroup$ Are we doing backpropagation for each point individually then averaging each $\frac{\partial e(w)}{w_{ij}^l}$ for all points in the batch? So for a batch size $B$, we do both forward and backpropagation $B$ times but the weight are updated at a different frequency as determined by $B$? $\endgroup$
– Yandle
Commented Apr 6, 2020 at 17:57
$\begingroup$ correct @Yandle, average or sum (doesn't matter because we can adjust learning rate). And, weights are updated when the batch is finished. $\endgroup$
– gunes
Commented Apr 7, 2020 at 13:53
$\begingroup$ One question. weight derivative in stochastic gradient descent includes product of delta for next layer and input for current layer as op mentioned in his post. When moving to batch gradient descent which input should we choose? δljxl−1i - In this the x part can be multiple values, do we average these as well? $\endgroup$
– theprogrammer
Commented Dec 20, 2022 at 14:28
$\begingroup$ Each summand (gradient) here is obtained for a single input, and the gradients are accumulated/averaged. The expression $\nabla_w(e_k)$ has the input expressions in it, so this is already accounted there. $\endgroup$
– gunes
Commented Dec 21, 2022 at 8:55

Add a comment |

Stack Exchange Network

How to adapt the equations for stochastic gradient descent for batch gradient descent for neural networks?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
neural-networks
gradient-descent
or ask your own question.

Hot Network Questions

How to adapt the equations for stochastic gradient descent for batch gradient descent for neural networks?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learningneural-networksgradient-descent or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
neural-networks
gradient-descent
or ask your own question.