I’m following along this lecture on neural networks. The professor derives equations for the gradient of $e(w)$: $\frac{\partial e(w)}{w_{ij}^l}$ for every $w_{ij}^l$ where $e(w)=e(h(x_n),y_n)$ is the error on one data point $(x_n,y_n)$ and $w$ are the weights in the network.
For a node, $s$ is the input and $x=\theta(s)$ the output after applying some activation function $\theta$. Here, $0 \leq i \leq d^{l-1}$ represents the input layer, $1 \leq j \leq d^{l}$the output layer, and the network has $1 \leq l \leq L$ layers.
Starting at around the 50:00 mark, $\frac{\partial e(w)}{w_{ij}^l}$ is found to equal $\frac{\partial e(w)}{s_j^l}\frac{\partial s_j^l}{w_{ij}^l} =\delta_j^l x_i^{l-1}$. For the final layer, $\delta_1^L=\frac{\partial e(w)}{\partial s_1^L}$ can be calculated directly since $e$ is a function of $s_1^L$ and $y_n$. Then for every layer before the final layer, $\delta_i^{l-1}=\frac{\partial e(w)}{\partial s_i^{l-1}}=\sum_{j=1}^{d^l}\delta_j^l w_{ij}^l\theta'(s_i^{l-1})$.
These equations were derived with stochastic gradient descent in mind, but I'm wondering how they can be modified for mini-batch (or batch) gradient descent?