1
$\begingroup$

Consider the feature space $\mathcal{X}=\mathbb R^{d}$ and $\mathcal{Y}=\{1,...,c\}$ such that $c > 2$. We consider some activation function $\alpha: \mathbb R^{c} \to \mathbb R^{c}$ and out weight matrix $W\in \mathbb R^{c\times d}$, and bias vector $b \in \mathbb R ^{c}$. Now given a particular loss function $L$, e.g. consider the quadratic loss function in $\mathbb R^{c}$, we one-hot encode labels $y \in \mathcal{Y}$ via the operation $\tilde{\cdot}$ such that $\tilde{y}= \hat{e}_{y}$, i.e. the vector with one in $y\in \{1,...,c\}$ and zeros elsewhere, in order to allow it as an argument in the loss function.

Question: Keeping the situation as general as possible (i.e. no specific loss function)

How do we compute $\frac{\partial \hat{L}}{\partial b_{j}}(b,W)$ and $\frac{\partial \hat{L}}{\partial W_{jk}}(b,W)$ such that $j\in \{1,...,c\}$ and $k\in \{1,...,d\}$ where

$ \hat{L}(b,W) =\frac{1}{N}\sum\limits_{i=1}^{N}L(\tilde{y}^{(i)},\alpha(W\cdot x^{(i)}+b))$

Initially I thought of doing something like:

$\frac{\partial \hat{L}}{\partial b_{j}}(b,W)=\frac{1}{N}\sum\limits_{i=1}^{N}\frac{\partial L}{\partial z}(\tilde{y}^{(i)},z)\lvert _{z=\alpha(W\cdot x^{(i)}+b)}\alpha^{\prime}(W\cdot x^{(i)}+b)\cdot \hat{e}_{j}$

and

$\frac{\partial \hat{L}}{\partial W_{jk}}(b,W)=\frac{1}{N}\sum\limits_{i=1}^{N}\frac{\partial L}{\partial z}(\tilde{y}^{(i)},z)\lvert _{z=\alpha(W\cdot x^{(i)}+b)}\alpha^{\prime}(W\cdot x^{(i)}+b)\cdot \hat{e}_{j}x^{(i)}$

But I can clearly see that my dimensions and differentiation is not working out. Any ideas?

$\endgroup$

0

You must log in to answer this question.