I am dealing with a problem related to finding the gradient of the Cross entropy loss function w.r.t. the parameter $\theta$ where:
$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$
Where, $\hat{y}_{i} = softmax(\theta_i)$ and $\theta_i$ is a vector input.
Also, $y$ is a one hot vector of the correct class and $\hat{y}$ is the prediction for each class using softmax function.
Hence, for example lets have $y_i = \begin{pmatrix}0\\0\\0\\1\\0\end{pmatrix}$ and $\hat{y}_{i} = \begin{pmatrix}0.10\\0.20\\0.10\\0.40\\0.20\end{pmatrix}$
To find the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{ik}}} = -{y_{ik} - \hat{y}_{ik}}$
Taking from there for each $i$ the individual partial gradients will be $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = \begin{pmatrix}y_{i1} - \hat{y}_{i1}\\y_{i2} - \hat{y}_{i2}\\y_{i3} - \hat{y}_{i3}\\y_{i4} - \hat{y}_{i4}\\y_{i5} - \hat{y}_{i5}\end{pmatrix}$
But this is not true because the gradients should actually be 0 for all other rows except for the 4th row because we have used the property of the one hot vector. So actual gradient should be $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = \begin{pmatrix}0\\0\\0\\y_{i4} - \hat{y}_{i4}\\0\end{pmatrix}$
And hence the gradients for all $i$ should be $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \left( \begin{array}{ccc} 0 & 0 & 0 & y_{i4} - \hat{y}_{i4} & 0 \\ 0 & 0 & y_{i3} - \hat{y}_{i3} & 0 & 0 \\ ... \\ 0 & y_{i2} - \hat{y}_{i2} & 0 & 0 & 0 \end{array} \right)$
But this is not equal to $\hat{y} - y$. So we should not call the gradient of the cross entropy function a vector difference between predicted and original.
Can some one clarify on this ?
UPDATE: Fixed my derivation
$\theta = \left( \begin{array}{c} \theta_{1} \\ \theta_{2} \\ \theta_{3} \\ \theta_{4} \\ \theta_{5} \\ \end{array} \right)$
$CE(\theta) = -\sum\nolimits_{i}{y_i*log({\hat{y}_{i}})}$
Where, $\hat{y}_{i} = softmax(\theta_i)$ and $\theta_i$ is a vector input.
Also, $y$ is a one hot vector of the correct class and $\hat{y}$ is the prediction for each class using softmax function.
$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = - (log(\hat{y}_{k}))$
UPDATE: Removed the index from $y$ and $\hat{y}$ Hence, for example lets have $y = \begin{pmatrix}0\\0\\0\\1\\0\end{pmatrix}$ and $\hat{y} = \begin{pmatrix}0.10\\0.20\\0.10\\0.40\\0.20\end{pmatrix}$
UPDATE: Fixed I was taking derivative w.r.t. $\theta_{ik}$ it should be only w.r.t. $\theta_{i}$. To find the partial derivative $\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = -{y_{k} - \hat{y}_{k}}$
Taking from there for each $i$ the individual partial gradients will be $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \begin{pmatrix}y_{1} - \hat{y}_{1}\\y_{2} - \hat{y}_{2}\\y_{3} - \hat{y}_{3}\\y_{4} - \hat{y}_{4}\\y_{5} - \hat{y}_{5}\end{pmatrix}$
The above happens because $CE(\theta) = -(y_k*log({\hat{y}_{k}}))$ And, $\hat{y}_{k} = log(softmax(\theta_k)) = \theta_k - log(\sum\nolimits_{j}{exp(\theta_j)})$ Taking the partial derivative of $CE(\theta)$ w.r.t. $\theta_i$ we get:
$\frac{\partial{CE(\theta)}}{\partial{\theta{i}}} = - (\frac{\partial{\theta_k}}{\partial{\theta{i}}} - softmax(\theta_i))$
MAIN STEP: The fact that $\frac{\partial{\theta_k}}{\partial{\theta{i}}} = 0, i \neq k$ and $\frac{\partial{\theta_k}}{\partial{\theta{i}}} = 1, i = k$ makes the vector $\frac{\partial{CE(\theta)}}{\partial{\theta}} = \hat{y} - y$ which completes the proof.