11
$\begingroup$

I am dealing with a problem related to finding the gradient of the Cross entropy loss function w.r.t. the parameter $\theta$ where:

$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$

Where, $\hat{y}_{i} = softmax(\theta_i)$ and $\theta_i$ is a vector input.

Also, $y$ is a one hot vector of the correct class and $\hat{y}$ is the prediction for each class using softmax function.

Hence, for example lets have $y_i = \begin{pmatrix}0\\0\\0\\1\\0\end{pmatrix}$ and $\hat{y}_{i} = \begin{pmatrix}0.10\\0.20\\0.10\\0.40\\0.20\end{pmatrix}$

To find the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{ik}}} = -{y_{ik} - \hat{y}_{ik}}$

Taking from there for each $i$ the individual partial gradients will be $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = \begin{pmatrix}y_{i1} - \hat{y}_{i1}\\y_{i2} - \hat{y}_{i2}\\y_{i3} - \hat{y}_{i3}\\y_{i4} - \hat{y}_{i4}\\y_{i5} - \hat{y}_{i5}\end{pmatrix}$

But this is not true because the gradients should actually be 0 for all other rows except for the 4th row because we have used the property of the one hot vector. So actual gradient should be $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = \begin{pmatrix}0\\0\\0\\y_{i4} - \hat{y}_{i4}\\0\end{pmatrix}$

And hence the gradients for all $i$ should be $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \left( \begin{array}{ccc} 0 & 0 & 0 & y_{i4} - \hat{y}_{i4} & 0 \\ 0 & 0 & y_{i3} - \hat{y}_{i3} & 0 & 0 \\ ... \\ 0 & y_{i2} - \hat{y}_{i2} & 0 & 0 & 0 \end{array} \right)$

But this is not equal to $\hat{y} - y$. So we should not call the gradient of the cross entropy function a vector difference between predicted and original.

Can some one clarify on this ?

UPDATE: Fixed my derivation

$\theta = \left( \begin{array}{c} \theta_{1} \\ \theta_{2} \\ \theta_{3} \\ \theta_{4} \\ \theta_{5} \\ \end{array} \right)$

$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$

Where, $\hat{y}_{i} = softmax(\theta_i)$ and $\theta_i$ is a vector input.

Also, $y$ is a one hot vector of the correct class and $\hat{y}$ is the prediction for each class using softmax function.

$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = - (log(\hat{y}_{k}))$

UPDATE: Removed the index from $y$ and $\hat{y}$ Hence, for example lets have $y = \begin{pmatrix}0\\0\\0\\1\\0\end{pmatrix}$ and $\hat{y} = \begin{pmatrix}0.10\\0.20\\0.10\\0.40\\0.20\end{pmatrix}$

UPDATE: Fixed I was taking derivative w.r.t. $\theta_{ik}$ it should be only w.r.t. $\theta_{i}$. To find the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = -{y_{k} - \hat{y}_{k}}$

Taking from there for each $i$ the individual partial gradients will be $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \begin{pmatrix}y_{1} - \hat{y}_{1}\\y_{2} - \hat{y}_{2}\\y_{3} - \hat{y}_{3}\\y_{4} - \hat{y}_{4}\\y_{5} - \hat{y}_{5}\end{pmatrix}$

The above happens because $CE(\theta) = -(y_k*log({\hat{y}_{k}}))$ And, $\hat{y}_{k} = log(softmax(\theta_k)) = \theta_k - log(\sum\nolimits_{j}{exp(\theta_j)})$ Taking the partial derivative of $CE(\theta)$ w.r.t. $\theta_i$ we get:

$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = - (\frac{\partial{\theta_k}}{\partial{\theta{i}}} - softmax(\theta_i))$

MAIN STEP: The fact that $\frac{\partial{\theta_k}}{\partial{\theta{i}}} = 0, i \neq k$ and $\frac{\partial{\theta_k}}{\partial{\theta{i}}} = 1, i = k$ makes the vector $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \hat{y} - y$ which completes the proof.

$\endgroup$

2 Answers 2

3
$\begingroup$

No, the gradients should not be zero for the other components. If your prediction is $\hat y_{ij}$ for some $i,j$ and your observation $y_{ij}=0$, then you predicted too much by $\hat y_{ij}$.

$\endgroup$
2
  • $\begingroup$ But $\hat{y}_{ij}$ will always be a softmax value and $y_{ij}$ the actual observation. And because we use the fact of $y_i$ being a one hot vector, hence, the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{ij}}} = 0, \forall j \neq k$, given $y_{ik} = 1$ Am I making an error in the differentiation ? $\endgroup$ Commented May 1, 2015 at 0:41
  • 1
    $\begingroup$ Thanks for your input @neil-g I was able to correct my derivation of the gradent. $\endgroup$ Commented May 1, 2015 at 1:48
19
$\begingroup$

The following is the same content as the edit, but in (for me) slightly clearer step-by-step format:

We are trying to proof that:

$\frac{\partial{CE}}{\partial{\theta}} = \hat{y} - y$

given

$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$

and

$\hat{y}_{i} = \frac{exp(\theta_i)}{\sum\nolimits_{j}{exp(\theta_j)}}$

We know that $y_{j} = 0$ for $j \neq k$ and $y_k = 1$, so:

$CE(\theta) = -\ log({\hat{y}_{k}})$

$= - \ log(\frac{exp(\theta_k)}{\sum\nolimits_{j}{exp(\theta_j)}})$

$ = - \ \theta_k + log(\sum\nolimits_{j}{exp(\theta_j)}) $

$\frac{\partial{CE}}{\partial{\theta}} = - \frac{\partial{\theta_k}}{\partial{\theta}} + \frac{\partial}{\partial{\theta}} log(\sum\nolimits_{j}{exp(\theta_j))}$

Use the fact that $ \frac{\partial{\theta_k}}{\partial{\theta_k}} = 1 $ and $ \frac{\partial{\theta_k}}{\partial{\theta_q}} = 0 $ for $q \neq k$, to show that.

$ \frac{\partial{\theta_k}}{\partial{\theta}} = y $

For the second part we write out the derivative for each individual element of $\theta$ and use the chain rule to get:

$\frac{\partial}{\partial{\theta_i}} log(\sum\nolimits_{j}{exp(\theta_j))} = \frac{exp(\theta_i)}{\sum\nolimits_{j}{exp(\theta_j)}} = \hat{y}_{i}$

Hence,

$\frac{\partial{CE}}{\partial{\theta}} = \frac{\partial}{\partial{\theta}} log(\sum\nolimits_{j}{exp(\theta_j))} - \frac{\partial{\theta_k}}{\partial{\theta}} = \hat{y}$ - y

$\endgroup$
1
  • $\begingroup$ very helpful for me $\endgroup$
    – Imran Q
    Commented Jan 17, 2022 at 3:06

Not the answer you're looking for? Browse other questions tagged or ask your own question.