How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?

Question

I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa.

Consider the features (part of the original attributes) of a given example $x=[1.9, 0.4]$ which belongs to Setosa class, so y=1.

Adapted from the Deep Learning textbook, I guess the cross-entropy loss could be defined as follows:

$${\displaystyle H(P,Q)=-\sum_{j=1}^2 P(x_j)\, \log Q(x_j)}$$

where, j=1 denotes the given plant belongs to Setosa class while j=2 denotes Non-Setosa. $P(x)$ denotes the corresponding ground truth label probabilities, which is 100% when j=1 and 0% when j=2 in this particular case.

The meaning stated above is illustrated by this table

\begin{array}{w{c}{8mm}|w{c}{8mm}} \text{ meaning } & a_j = Q(x_j) & j & y_j=P(x_j) \\ \hline \text{ Setosa } & 0.284958 & 1 & 100\% \\ \text{ Non-Setosa } & 0.715042 & 2 & 0\% \end{array}

where the computation of $a_j$ is given below.

Consider the example [1.9, 0.4],

I'm trying to use Q(x) to denote the output of a softmax regression model, representing the probability that the model classifies a given input image to a particular class, what does p(x) represents?

$$ a_j = {\displaystyle \operatorname{softmax} {(z_j)}={\frac {e^{z_{j}}}{\sum _{i=1}^{2}e^{z_{j}}}}\ \ \ \ {\text{ for }}j=1,2} $$

where

$$z_j=w_j^Tx+b \ \ {\text{ for }}j=1,2$$

Assume all $b$s are 0.0, and the weights $W$ is initialized as

$$w_{11}=0.1, w_{21}=0.1, w_{12}=0.5, w_{22}=0.5$$

As stated at the beginning, $x_1=1.9, x_2=0.4$, so, the confidence that the model predict the given plant as Setosa would be computed as follows:

$$z_1 = w_{11}x_1 + w_{21}x_2 + b_1 = 0.23$$

$$a_1 = \operatorname{softmax} {(z_1)} = 0.284958$$

$${\displaystyle H(p,q)=-\sum_{j=1}^2 P(x)\, \log Q(x) = -1.0 \times \log{(0.284958)} }$$

the confidence that the model predict the given feature vector [1.9, 0.4] as Non-Setosa would be computed as follows:

$$z_2 = w_{12}x_1 + w_{22}x_2 + b_2 = 1.15$$

$$a_2 = \operatorname{softmax} {(z_2)} = 0.715042$$

$${\displaystyle H(p,q)=-\sum_{j=1}^2 P(x)\, \log Q(x) = -1.0 \times \log{(0.715042)} }$$

\begin{array}{w{c}{8mm}|w{c}{8mm}} \text{ meaning } & a = Q(x) & j & y=P(x) \\ \hline \text{ Setosa } & 0.284958 & 1 & 100\% \\ \text{ Non-Setosa } & 0.715042 & 2 & 0\% \end{array}

I've managed to calculate the derivative of $a_j$ with respect to $z_j$

$$\frac{da_j}{dz_j} = a_j(1-a_j) \ \ \text{for } j=1,2$$

How do I compute the derivative of the loss $H(P,Q)$ with respect to the weights $W$ so that I can use the gradient descent algorithm to update the weights?

consider $j=1$, the derivative of $H(P,Q)$ is

\begin{align} \frac{\partial H}{\partial W_{i,1}} &=-\sum_{j=1}^2 \frac{P_j(x)}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i \\ &=-\frac{1}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i (P_1(x)+ P_2(x)) \end{align}

Is my understanding correct?

Here are some notes I wrote about computing the gradient for multiclass logistic regression Those notes are my attempt to make the calculation as clean and clear as possible, making good use of vector and matrix notation and the multivariable chain rule. — littleO, Commented Aug 2, 2021 at 7:07

Siong Thye Goh · Accepted Answer · 2021-08-02 17:53:29Z

1

+50

$$H = -\sum_{j=1}^2 P_j(x) \log Q_j(x)=-\sum_{k=1}^2 P_k(x) \log Q_k(x)$$

The trick is to use the chain rule.

\begin{align} \frac{\partial H}{\partial W_{i,j}} &= -\frac{\partial}{\partial W_{i,j}} \left(\sum_{k=1}^2 P_k(x) \log Q_k(x) \right) \\ &= -\frac{\partial}{\partial W_{i,j}} \left(P_j(x) \log Q_j(x) \right) \\ &=-\frac{P_j(x)}{Q_j(x)} \cdot \frac{\partial Q_j(x)}{\partial W_{i,j}} \\ &= - \frac{P_j(x)}{a_j(x)} \cdot \frac{\partial a_j(x)}{\partial z_j(x)} \cdot \frac{\partial z_j(x)}{\partial W_{i,j}}\\ &=-P_j(x) (1-a_j(x)) x_i \end{align}

Now we have the gradient, you can perform gradient descent.

edited Aug 2, 2021 at 17:53

answered Aug 2, 2021 at 3:43

Siong Thye Goh

151k20 gold badges88 silver badges149 bronze badges

$\begingroup$ Thank you so much. Would you please explain a little bit about how do I get $ \sum_{j=1}^2 \frac{P(x_j)}{a_j}\cdot \frac{\partial a_j}{\partial W_{i,j}} $ from $\sum_{j=1}^2 \frac{P(x_j)}{Q(x_j)} \cdot \frac{\partial Q(x_j)}{\partial W_{i,j}}$ $\endgroup$
– JakeMZ
Commented Aug 2, 2021 at 3:59
1

$\begingroup$ $a_j = Q(x_j)$ as you have stated in the first table. $\endgroup$
– Siong Thye Goh
Commented Aug 2, 2021 at 4:14
$\begingroup$ You're so nice! Thank you! It seems the $j$ in $W_{i,j}$ conflicts a liittle bit with the one in $\sum_{j=1}^2$, is it? $\endgroup$
– JakeMZ
Commented Aug 2, 2021 at 4:31
$\begingroup$ true, may I know where does $a_j = Q(x_j)$ comes from? it seems that $j$ has been used for different notation. $\endgroup$
– Siong Thye Goh
Commented Aug 2, 2021 at 6:21
$\begingroup$ consider $j=1$, the derivative of $H(P,Q)$ is \begin{align} \frac{\partial H}{\partial W_{i,1}} =-\sum_{j=1}^2 \frac{P(x_j)}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i \end{align} Is my understanding correct? $\endgroup$
– JakeMZ
Commented Aug 2, 2021 at 6:51

| Show 3 more comments

Stack Exchange Network

How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
statistics
machine-learning
.

Linked

Hot Network Questions

How do I compute the derivative of the cross-entropy loss $H(P,Q)$ with respect to the weights $W$?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged statisticsmachine-learning.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
statistics
machine-learning
.