Application of the chain rule to $3$-layers neural network

Question

Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),L^3(x^3,\theta^3)$, where every $x_k,\theta^k$ are real vectors, for $k=1,2,3$. Also define $\theta=(\theta^1,\theta^2,\theta^3)$ (and note that $x$ is not $ (x^1,x^2,x^3)$).

What is the jacobian, with respect to $\theta$, of $f(x,\theta)=L^3(L^2(L^1(x,\theta^1),\theta^2),\theta^3)?$

This question arose when computing the gradients in the backpropagation phase of a neural network, and I got a result that I don't think is correct (or at least it is not the one which backpropagation algorithms use).

Here is my try. Using the chain rule:

$Jf=JL^3 \cdot J(L^2(L^1(x,\theta^1),\theta^2),\theta^3)=JL^3 \begin{pmatrix} J_{x,\theta^1,\theta^2}L^2(L^1(x,\theta^1),\theta^2) & 0\\ 0 & I \end{pmatrix}=\left ( J_{x^3}L^3\cdot J_{x,\theta^1,\theta^2}L^2(L^1(x,\theta^1),\theta^2)\middle |J_{\theta^3}L^3\right )$

Hence $Jf=\left ( J_{x^3}L^3\cdot JL^2(L^1(x,\theta^1),\theta^2)\middle |J_{\theta^3}L^3\right )$, and by the above reasoning: $$Jf=\left ( J_{x^3}L^3\cdot \left (J_{x^2}L^2\cdot JL^1\middle | J_{\theta^2}L^2 \right )\middle |J_{\theta^3}L^3\right )=\left ( J_{x^3}L^3\cdot J_{x^2}L^2\cdot JL^1 \middle | J_{x^3}L^3 \cdot J_{\theta^2}L^2 \middle | J_{\theta^3}L^3\right )$$

I must conclude that $J_\theta f=\left ( J_{x^3}L^3\cdot J_{x^2}L^2\cdot J_{\theta^1}L^1 \middle | J_{x^3}L^3 \cdot J_{\theta^2}L^2 \middle | J_{\theta^3}L^3\right )$: is this correct?

I see now, you probably can't fully understand my answer to the other question if you aren't able to derive back-propagation. I'll help you out later today (in a few hours). — f10w, Commented Oct 12, 2020 at 10:13
Sorry I was taken by some important work. I'm writing an answer now. — f10w, Commented Oct 14, 2020 at 14:27

f10w · Accepted Answer · 2020-10-14 16:29:09Z

Let me derive backpropagation for the above 3-layer feedforward neural network. For more general networks (e.g. residual ones), the idea is similar and I'll give some hints for dealing with those at the end of this answer.

Suppose that we feed an input $x\in\mathbb{R}^d$ to the network to produce an output loss value $L\in\mathbb{R}$ (given by a differentiable loss function $\ell$): \begin{align} x_1 &= f_1(x, \theta_1),\\ x_2 &= f_2(x_1, \theta_2),\\ x_3 &= f_3(x_2, \theta_3),\\ L &= \ell(x_3). \end{align} The goal is to compute the gradient $\left(\frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \frac{\partial L}{\partial \theta_3}\right)$.

Some notes on the notation:

I always prefer subscripts than superscripts whenever possible, to avoid confusion with exponents. In our case, it's possible so I use subscripts.
We could stop at $x_3$ and derive backpropagation from it, but I added a scalar loss function to reflect the real use case, and also for you to understand better my answer to your other question.
I use the explicit partial derivatives instead of your Jacobian notation, because on one hand, I find yours error prone, and on the other hand, many deep learning references seem to use the same notation as what I'm using so hopefully this will make you more confortable reading them later.

Now let's backprop. The computation graph in this case is simple: $\require{AMScd}$ \begin{CD} x @>>> x_1 @>>> x_2 @>>> x_3 @>>> L\\ @. @AAA @AAA @AAA\\ @. \theta_1 @. \theta_2 @. \theta_3 \end{CD} We go backward from the last node and use the chain rule: \begin{align} \frac{\partial L}{\partial x_3} &= \ell'(x_3),\\ \frac{\partial L}{\partial \theta_3} &= \frac{\partial x_3}{\partial \theta_3}\frac{\partial L}{\partial x_3},\\ \frac{\partial L}{\partial x_2} &= \frac{\partial x_3}{\partial x_2}\frac{\partial L}{\partial x_3},\\ \frac{\partial L}{\partial \theta_2} &= \frac{\partial x_2}{\partial \theta_2}\frac{\partial L}{\partial x_2},\\ \frac{\partial L}{\partial x_1} &= \frac{\partial x_2}{\partial x_1}\frac{\partial L}{\partial x_2},\\ \frac{\partial L}{\partial \theta_1} &= \frac{\partial x_1}{\partial \theta_1}\frac{\partial L}{\partial x_1}. \end{align} That was the exact order of how the derivative computation should be implemented. One can plug the intermediate terms into the main terms (i.e. the ones w.r.t. $\theta$) to get direct formula, for example \begin{equation}\label{Lt1} \frac{\partial L}{\partial \theta_1} = \frac{\partial x_1}{\partial \theta_1} \frac{\partial x_2}{\partial x_1} \frac{\partial x_3}{\partial x_2}\frac{\partial L}{\partial x_3}. \tag{1} \end{equation} However, this should not be used for an implementation as it will re-compute same quantities multiple times.

Some other important notes, again on the notation:

Because $x_3 = f_3(x_2, \theta_3)$, one can view $x_3$ as a function of $x_2$ and $\theta_3$, i.e. $x_3 := x_3(x_2, \theta_3)$ and thus $\frac{\partial x_3}{\partial \theta_3}$ denotes the partial derivative of that function with respect to $\theta_3$ (evaluated at $(x_2, \theta_3)$). Similarly for the other quantities. Using the Jacobian notation, \eqref{Lt1} for example can be written as: \begin{align} \frac{\partial L}{\partial \theta_1} &= (J_{\theta_1} f_1) (x, \theta_1) \times (J_{x_1} f_2) (x_1, \theta_2) \times (J_{x_2} f_3) (x_2, \theta_3) \times (J \ell)(x_3). \end{align} If with slight abuse of notation we omit the values at which the functions are evaluated, then the above become: \begin{align} \frac{\partial L}{\partial \theta_1} &= J_{\theta_1} f_1 \times (J_{x_1} f_2) \times (J_{x_2} f_3) \times (J \ell). \end{align} I hope this makes it easier for you because you seem to be familiar with the Jacobian notation. You can now easily compare this result with yours.
A more rigorous presentation should use the total derivative notation instead of using $\partial$ everywhere, like what I have done (and like in many deep learning references). For example, one should write: $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{\theta_1}} = \frac{\partial x_1}{\partial \theta_1}\frac{\dv{L}}{\dv{x_1}}. \end{equation}

For a more general computation graph, the principle is the same. One has to compute recursively $\frac{\dv{L}}{\dv{x_i}}$ for every nodes $x_i$ of the graph. The recursion lies in the fact that, the derivative with respect to one node can be computed once the derivatives of all its children have been computed, using the chain rule: \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}

Let me know if you have further questions.

Finally, I understood. What made me really understand was the difference between partial and total derivative, even if it was a minor remark. Thank you very much! — Lilla, Commented Oct 17, 2020 at 16:22

supinf · Accepted Answer · 2020-10-13 17:15:24Z

Your final expression and solution are (essentially) correct.

First, a remark on notation: If one wants to be more accurate, one would also add the point for Jacobi derivatives, for example some people write $(J_\theta f)(x,\theta)$ or something similar to refer to the partial Jacobi derivative of $f$ with respect to $\theta$ at the point $(x,\theta)$.

You should also be careful with notation such as $JL^2(L^1(x,\theta^1),\theta^2)$. Here, you use it to describe the Jacobi matrix of the function $h(x,\theta^1,\theta^2):=L^2(L^1(x,\theta^1),\theta^2)$, so it would be good to explain this. Other people could be confused, because to them it could mean the Jacobi matrix of $L^2$ at the point $(L^1(x,\theta^1),\theta^2)$, which would be a different mathematical object.

But in my opinion, its ok to use alternative notation if you explain what you use (or use it in a context where everyone uses the same notation).

An alternative solution would be to start with $$ J_\theta f= \begin{pmatrix} J_{\theta^1}f & J_{\theta^2}f & J_{\theta^3}f \end{pmatrix} $$ and then continue with calculating the partial Jacobi derivatives. This approach is not necessarily faster, but maybe it is easier to be convinced that your result is correct with this approach, because you have to "fight" fewer variables at once.

$\begingroup$ Totally agreed! $\endgroup$
– f10w
Commented Oct 14, 2020 at 16:55 — f10w, Commented Oct 14, 2020 at 16:55

paperskilltrees · Accepted Answer · 2020-10-14 17:26:57Z

3

Khue gave a good answer, and this is a comment to it.

Khue seems to be using the alternative convention for Jacobians, in which everything is transposed (cf. gradient vs derivative): $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}

The author of the question, warm_fish, uses the standard definition, in which case the chain rule is \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\dv{L}}{\dv{x_j}} \frac{\partial x_j}{\partial x_i}. \end{equation}

[need 50 reputation score to comment, such an annoyance! Please bump my rep up, so that I could write comments as comments!]

answered Oct 14, 2020 at 17:26

paperskilltrees

7026 silver badges19 bronze badges

1

$\begingroup$ Haha good catch! Yes I always use the "denominator layout" instead of the "numerator" one. That is to say, I prefer writing $\nabla f(x)^\top (y-x)$ than $\nabla f(x) (y-x)$. You've got my upvotes for all your answers, enough for posting comments now ;) $\endgroup$
– f10w
Commented Oct 14, 2020 at 17:37
$\begingroup$ P/s: Sorry the $\nabla f$ part in my above comment is not relevant (not sure why I brought that up, it makes no sense in view of your comment). Nice reminder about gradient vs derivative! $\endgroup$
– f10w
Commented Oct 14, 2020 at 18:25

Add a comment |

Stack Exchange Network

Application of the chain rule to $3$-layers neural network

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
real-analysis
statistics
multivariable-calculus
machine-learning
neural-networks
.

Linked

Hot Network Questions

Application of the chain rule to $3$-layers neural network

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged real-analysisstatisticsmultivariable-calculusmachine-learningneural-networks.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
real-analysis
statistics
multivariable-calculus
machine-learning
neural-networks
.