In machine learning, I'm optimizing a parameter matrix $W$.
The loss function is$$L=f(y),$$where $L$ is a scalar, $y=Wx$, $x\in \mathbb{R}^n$, $y\in \mathbb{R}^m$ and the order of $W$ is $m\times n$.
In all math textbooks, it is usually$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial x}=\frac{\partial L}{\partial y}W.$$Where $\dfrac{\partial L}{\partial y}$ is a $1\times m$ vector. This is quite easy to understand.
However, in machine learning, $x$ is the input and $W$ is the parameter matrix to optimize, it should be$$\frac{\partial L}{\partial W}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial W}.$$But what is $\dfrac{\partial y}{\partial W}$? Is it $x$? Is it correct?
According to wikipedia, the derivative of a scalar to a matrix is a matrix
\begin{equation*} \frac{\partial L}{\partial W} = \begin{pmatrix} \frac{\partial L}{\partial W_{11}} & \frac{\partial L}{\partial W_{21}} & \cdots & \frac{\partial L}{\partial W_{m1}} \\ \frac{\partial L}{\partial W_{12}} & \frac{\partial L}{\partial W_{22}} & \cdots & \frac{\partial L}{\partial W_{m2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial W_{1n}} & \frac{\partial L}{\partial W_{2n}} & \cdots & \frac{\partial L}{\partial W_{mn}} \end{pmatrix} \end{equation*}
where $$\frac{\partial L}{\partial W_{ji}}=\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial W_{ji}}=\frac{\partial L}{\partial y_j}x_i$$
therefore
\begin{equation*} \frac{\partial L}{\partial W} = \begin{pmatrix} \frac{\partial L}{\partial y_1}x_1 & \frac{\partial L}{\partial y_2}x_1 & \cdots & \frac{\partial L}{\partial y_m}x_1 \\ \frac{\partial L}{\partial y_1}x_2 & \frac{\partial L}{\partial y_2}x_2 & \cdots & \frac{\partial L}{\partial y_m}x_2 \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial y_1}x_n & \frac{\partial L}{\partial y_2}x_n & \cdots & \frac{\partial L}{\partial y_m}x_n \\ \end{pmatrix} \end{equation*}
Does this even fit the chain rule?
To fit the chain rule $$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}\frac{\partial y}{\partial W}$$ $\dfrac{\partial L}{\partial W}$ is a $n*m$ matrix, $\dfrac{\partial L}{\partial y}$ is a $1\times m$ vector, how to fit it?
PS: I just found there is an operation called kronecker product, and $\dfrac{\partial L}{\partial W}$ can be written as $\dfrac{\partial L}{\partial y}\bigotimes x$, but this is still beyond me. First, why does the chain rule lead to kronecker product? Isn't the chain rule about matrix multiplication?
Second, does this mean $\dfrac{\partial y}{\partial W} = x$? I didn't see the definition of the derivative of a vector to a matrix in wikipedia.
The third and most important question is, even I know the derivative $\dfrac{\partial L}{\partial W}$, how should I update my parameter matrix? We all know the gradient descent works because of directional derivative $$\nabla_v f = \frac{\partial f}{\partial v}v$$ so we should take the negative gradient direction to lower $f$.
Does this even exist for the derivative of a matrix? I mean $\dfrac{\partial L}{\partial W}$ multiplies $\Delta W$ won't reproduce $\Delta L$ anyway.