1
$\begingroup$

How does back-propagation handle multiple different activation functions?

For example in a neural network of 3 hidden layers, each with a separate activation function such as tanh, sigmoid and ReLU, the derivatives of each of these functions would be different, so do we just compute the output error of each of these layers with the derivative of the activation function of that layer?

Also will the gradient of the cost function used to compute the output error of the last layer change at all, or is the activation function of that layer used to compute the gradient of the cost function?

$\endgroup$

1 Answer 1

3
$\begingroup$

In short, all the activation functions in the backpropagation algorithm are evaluated independently through the chain rule, thus, you can mix and match to your hearts content.


What are we optimizing in backpropagation?

Backpropagation allows you to update your weights as a gradient function of the resulting loss. This will tend towards the optimal loss (the highest accuracy). After each forward pass of your training stage, you get an output at the last layer. You then calculate the resulting loss $E$.

The consequence of each of your weights on your final loss is computed using its partial derivative. In other words, this is how much loss is attributed to each weight. How much error can be attributed to that value. The larger this value is the more the weight will change to correct itself (training).

$\frac{\partial E}{\partial w^k_{i, j}}$

How can we compute such a random partial derivative? Using the chain rule of derivatives, and putting together everything that led to our output during the forward pass. Let's look at what led to our output before getting into the backpropagation.

The forward pass

In the final layer of a 3-layer neural network ($k = 3$), the output ($o$), is a function ($\phi$) of the outputs of the previous layer ($o^2$) and the weights connecting the two layers ($w^2$).

$y_0 = o^3_1 = \phi(a^3_1) = \phi(\sum_{l=1}^n w^2_{l,1}o^2_l)$

The function $\phi$ is the activation function for the current layer. Typically chosen to be something with an easy to calculate derivative.

You can then see that the previous layers' outputs are calculated in the same way.

$o^2_1 = \phi(a^2_1) = \phi(\sum_{l=1}^n w^1_{l,1}o^1_l)$

So the outputs of the third layer can also be written as a function of the outputs of layer 1 by substituting the outputs of layer 2. This point becomes important for how the backpropagation propagates the error along the network.

Backpropagation

The partial derivative of the error in terms of the weights is broken down using the chain rule into

$\frac{\partial E}{\partial w^k_{i, j}}$ = $\frac{\partial E}{\partial o^k_{j}} \frac{\partial o^k_{j}}{\partial a^k_{j}} \frac{\partial a^k_{j}}{\partial w^k_{i,j}}$.

Let us look at each of these terms separately.

1. $\frac{\partial E}{\partial o^k_{j}}$

is the error caused by the output of the previous layer. For the last layer, using R2 loss, the error of the first output node is

$\frac{\partial E}{\partial o^3_{1}} = \frac{\partial E}{\partial y_{1}} = \frac{\partial }{\partial y_{1}} 1/2(\hat{y}_1-y_1)^2 = y_1 - \hat{y}_1$

In words, this is how far our result, $y_1$, from the actual target $\hat{y}_1$.

This is the same for all previous layers, where we need to substitute in the errors propagating through the network, this is written as

$\frac{\partial E}{\partial o^k_{j}} = \sum_{l \in L} (\frac{\partial E}{\partial o^{k+1}_{l}} \frac{\partial o^{k+1}_{l}}{\partial a^{k+1}_{l}} w^{k}_{j,l}) $

where L is the set of all neurons in the next layer $k+1$.

2. $\frac{\partial o^k_{j}}{\partial a^k_{j}}$

This is where the current layer's activation function will make a difference. Because we are taking the derivative of the output as a function of its input. And the output is related to the input through the activation function, $\phi$.

$\frac{\partial o^k_{j}}{\partial a^k_{j}} = \frac{\partial \phi(a^k_{j})}{\partial a^k_{j}}$

So just take the derivative of the activation function. For logistic function this is easy and its

$\frac{\partial o^k_{j}}{\partial a^k_{j}} = \frac{\partial \phi(a^k_{j})}{\partial a^k_{j}} = \phi(a_j)(1-\phi(a_j))$

3. $\frac{\partial a^k_{j}}{\partial w^k_{i,j}}$

a is simply a linear combination of w and the subsequent layers outputs. Thus,

$\frac{\partial a^k_{j}}{\partial w^k_{i,j}} = o_i$

Finally

You can see that the activation functions of your layers are evaluated separately in the backpropagation algorithm. They will just be added onto your ever growing back-chain as independent terms within your chain rule.

$\endgroup$
1
  • $\begingroup$ Comprehensive answer. I think OP might be getting confused due to the simple shortcut derivations of your section 2 combined with the cost function, that is applied in the output layer. For instance when the logistic function is paired with cross-entropy loss, then the derivative wrt neuron inputs is just $y_1 - \hat{y}_1$ so it may look to a beginner like the derivative of the activation function is not part of things. When actually it has just neatly cancelled out. $\endgroup$ Commented Jan 12, 2018 at 8:29

Not the answer you're looking for? Browse other questions tagged or ask your own question.