6
$\begingroup$

In machine learning, I'm optimizing a parameter matrix $W$.

The loss function is$$L=f(y),$$where $L$ is a scalar, $y=Wx$, $x\in \mathbb{R}^n$, $y\in \mathbb{R}^m$ and the order of $W$ is $m\times n$.

In all math textbooks, it is usually$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial x}=\frac{\partial L}{\partial y}W.$$Where $\dfrac{\partial L}{\partial y}$ is a $1\times m$ vector. This is quite easy to understand.

However, in machine learning, $x$ is the input and $W$ is the parameter matrix to optimize, it should be$$\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial W}.$$But what is $\dfrac{\partial y}{\partial W}$? Is it $x$? Is it correct?

According to wikipedia, the derivative of a scalar to a matrix is a matrix

\begin{equation*} \frac{\partial L}{\partial W} = \begin{pmatrix} \frac{\partial L}{\partial W_{11}} & \frac{\partial L}{\partial W_{21}} & \cdots & \frac{\partial L}{\partial W_{m1}} \\ \frac{\partial L}{\partial W_{12}} & \frac{\partial L}{\partial W_{22}} & \cdots & \frac{\partial L}{\partial W_{m2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial W_{1n}} & \frac{\partial L}{\partial W_{2n}} & \cdots & \frac{\partial L}{\partial W_{mn}} \end{pmatrix} \end{equation*}

where $$\frac{\partial L}{\partial W_{ji}}=\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial W_{ji}}=\frac{\partial L}{\partial y_j}x_i$$

therefore

\begin{equation*} \frac{\partial L}{\partial W} = \begin{pmatrix} \frac{\partial L}{\partial y_1}x_1 & \frac{\partial L}{\partial y_2}x_1 & \cdots & \frac{\partial L}{\partial y_m}x_1 \\ \frac{\partial L}{\partial y_1}x_2 & \frac{\partial L}{\partial y_2}x_2 & \cdots & \frac{\partial L}{\partial y_m}x_2 \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial y_1}x_n & \frac{\partial L}{\partial y_2}x_n & \cdots & \frac{\partial L}{\partial y_m}x_n \\ \end{pmatrix} \end{equation*}

Does this even fit the chain rule?

To fit the chain rule $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial W}$$ $\dfrac{\partial L}{\partial W}$ is a $n*m$ matrix, $\dfrac{\partial L}{\partial y}$ is a $1\times m$ vector, how to fit it?

PS: I just found there is an operation called kronecker product, and $\dfrac{\partial L}{\partial W}$ can be written as $\dfrac{\partial L}{\partial y}\bigotimes x$, but this is still beyond me. First, why does the chain rule lead to kronecker product? Isn't the chain rule about matrix multiplication?

Second, does this mean $\dfrac{\partial y}{\partial W} = x$? I didn't see the definition of the derivative of a vector to a matrix in wikipedia.

The third and most important question is, even I know the derivative $\dfrac{\partial L}{\partial W}$, how should I update my parameter matrix? We all know the gradient descent works because of directional derivative $$\nabla_v f = \frac{\partial f}{\partial v}v$$ so we should take the negative gradient direction to lower $f$.

Does this even exist for the derivative of a matrix? I mean $\dfrac{\partial L}{\partial W}$ multiplies $\Delta W$ won't reproduce $\Delta L$ anyway.

$\endgroup$
1
  • $\begingroup$ How about $\frac{\partial L}{\partial W} = \frac{\partial y}{\partial W}\frac{\partial L}{\partial y}=x\frac{\partial L}{\partial y}$, which is the product of an $n\times 1$ and a $1\times m$ matrix, which corresponds to the wikipedia formula. $\endgroup$ Commented Aug 1, 2022 at 16:40

3 Answers 3

7
$\begingroup$

First, it is important to keep track of what functions you are differentiating. You have an $m$-vector $y$ with component functions $y_i$ ($i=1,...,m$), and you are differentiating by an $(m\times n)$-matrix $W$. Therefore, your derivative should depend on three separate functional indices, i.e., it should be a third-order tensor $$ \left(\frac{\partial y}{\partial W}\right)_{ijk} = \frac{\partial y_i}{\partial W_{jk}}, \qquad 1\leq i,j\leq m,\quad 1\leq k \leq n. $$ So, this tells you immediately that the derivative on the left cannot possibly be $x$. To compute what it is, you should look for the linear mapping satisfying $$ dy_i = \sum_{j,k}\left(\frac{\partial y}{\partial W}\right)_{ijk} dW_{jk}\qquad \mathrm{for\,all\,} 1\leq i\leq m,$$ where $d$ denotes the exterior derivative. This can be accomplished explicitly by differentiating in the following way: $$ dy = d(Wx) = (dW)x = (I\otimes x):dW,$$ where we have used that for all $1\leq i \leq m$, $$ \left((dW)x\right)_i = \sum_k dW_{ik}x_k = \sum_{j,k} \delta_{ij}x_k dW_{jk} = \left((I\otimes x):dW\right)_{\,i}\,.$$ Therefore, the derivative you are looking for is simply the tensor product $I\otimes x.$

To answer your related question about the chain rule: it still works, but you have to be careful. Using your example, in components we have $$\left(f'(y)\,y'(W)\right)_{jk} = \sum_i \frac{\partial f}{\partial y_i} \frac{\partial y_i}{\partial W_{jk}} = \sum_{i}\frac{\partial f}{\partial y_i}\delta_{ij}x_k = \frac{\partial f}{\partial y_j}x_k = \left(\frac{\partial f}{\partial y} \otimes x\right)_{jk},$$ so that the total derivative depends on $m\times n$ functions. I leave it to you to check that this makes sense given that $f$ is a function of $W$.

With this, updating the parameters $W$ through gradient descent is straightforward. For each component, simply compute $$W_{jk} \gets W_{jk} - \eta \frac{\partial f}{\partial W_{jk}} = W_{jk} - \eta \frac{\partial f}{\partial y_j} x_k, $$ where $\eta>0$ is your step-size.

$\endgroup$
6
  • $\begingroup$ Thank you very much! I don't even know what an exterior derivative is. Does it belong to the content of calculus? I currently only understand it in the same way as a differential. $\endgroup$
    – user900476
    Commented Aug 1, 2022 at 21:20
  • $\begingroup$ Yes, the exterior derivative is the basic operation of exterior calculus. The idea is to extend the differential to more general objects, see e.g. en.wikipedia.org/wiki/Exterior_derivative . For a more detailed treatment, I recommend the book by Bott and Tu. $\endgroup$
    – MathIsArt
    Commented Aug 1, 2022 at 21:23
  • $\begingroup$ I'm sorry I dont understand the $dy$ part, why does $(dW)x=(I\bigotimes x):dW$? and what does the colon mean? could you please explain it more detailedly in the answer? $\endgroup$
    – user900476
    Commented Aug 1, 2022 at 21:23
  • $\begingroup$ As explained in the answer, to extract the derivative we have to write $(dW)x$ as something multiplied by $dW$. Here we simply use that $\delta_{ij}(dW)_{jk} = (dW)_{ik}$. The double colon denotes a double contraction. I will add to the answer about it. $\endgroup$
    – MathIsArt
    Commented Aug 1, 2022 at 21:42
  • $\begingroup$ Thank you very much! This answer is currently beyond me because I dont understand tensor multiplication and some other staff. I will look into the book you recommended. $\endgroup$
    – user900476
    Commented Aug 1, 2022 at 22:15
1
$\begingroup$

For this computation, a coordinate based approach is not difficult. You can simply think of $\frac{\partial{y}}{\partial W}$ as being represented by the coordinates $\frac{\partial y_k}{\partial w_{ij}}$ for $k \in [m], i \in [m], j \in [n]$. We have $$y_k = \sum_{i = 1}^{n} w_{ki}x_i$$ so $$\frac{\partial y_{k}}{w_{ij}} = \delta_{ik}x_j.$$

We can also obtain this same result using a more abstract approach involving the Frechet derivative. You can write $y(W) = Wx$. Then the Frechet derivative at a point $M \in M(m \times n, \mathbb{R})$ is $Dy(W) : M(m \times n, \mathbb{R}) \to \mathbb{R}^m$ given by $$Dy(W)H = Hx.$$ In other words, $Dy(W)$ is the linear operator of right multiplication by $x$. Hence the coordinates of $Dy(W)$ with respect to the standard bases for $M(m \times n, \mathbb{R})$ and $\mathbb{R}^m$ are $\frac{\partial{y}}{\partial w_{ij}} = Dy(W)e_ie_j^T = e_{i}e_j^Tx = x_je_i$. That is, $\frac{\partial y_{k}}{w_{ij}} = x_{j}\delta_{ik}$, in agreement with what as obtained earlier.

$\endgroup$
1
$\begingroup$

$ \def\L{{L}} \def\bbR#1{{\mathbb R}^{#1}} \def\l{\lambda} \def\o{{\tt1}} \def\LR#1{\left(#1\right)} \def\BR#1{\Big(#1\Big)} \def\bR#1{\big(#1\big)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $I'll assume that the gradient of $f(y)$ is known, i.e. $$\eqalign{ g &= \grad{f}{y} \qiq df = g:dy \qquad\qquad \\ }$$ Differentiating $\,y=Wx\:$ is trivial $$\eqalign{ y &= Wx &\qiq dy = W\,dx + dW\,x \\ }$$ Substitute this into the expression for $\,df$ $$\eqalign{ df &= g:\BR{W\,dx + dW\,x} \qquad\qquad\qquad \\ }$$ Holding $W$ constant $\bR{{\rm i.e.}\;dW=0}\,$ yields the gradient wrt $x$ $$\eqalign{ df &= g:\BR{W\,dx} = W^Tg:dx \qiq \grad{f}{x} &= W^Tg \\ }$$ while holding $x$ constant yields the gradient wrt $W$ $$\eqalign{ df &= g:\BR{dW\,x} = gx^T:dW \qiq \grad{f}{W} = gx^T \\ \\ }$$


In the above, a colon is used to denote the Frobenius product $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \qquad \{ {\rm Frobenius\;norm} \} \\ }$$ This is also called the double-dot or double contraction product.
When applied to vectors $(n=\o)$ it reduces to the standard dot product.

The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many useful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ Consistent use of the Frobenius product avoids silly transposition errors which often occur during Matrix Calculus calculations and avoids the need for higher-order tensors like $\large\LR{\grad{y}{W}}$ which are required by the chain rule.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .