1
$\begingroup$

I'm trying to understand the cross-entropy loss with iris dataset for binary classification where y=1 denotes the plant belongs to Setosa and y=0 denotes the example belongs to Non-Setosa.

Consider the features (part of the original attributes) of a given example $x=[1.9, 0.4]$ which belongs to Setosa class, so y=1.

Adapted from the Deep Learning textbook, I guess the cross-entropy loss could be defined as follows:

$${\displaystyle H(P,Q)=-\sum_{j=1}^2 P(x_j)\, \log Q(x_j)}$$

where, j=1 denotes the given plant belongs to Setosa class while j=2 denotes Non-Setosa. $P(x)$ denotes the corresponding ground truth label probabilities, which is 100% when j=1 and 0% when j=2 in this particular case.

The meaning stated above is illustrated by this table

\begin{array}{w{c}{8mm}|w{c}{8mm}} \text{ meaning } & a_j = Q(x_j) & j & y_j=P(x_j) \\ \hline \text{ Setosa } & 0.284958 & 1 & 100\% \\ \text{ Non-Setosa } & 0.715042 & 2 & 0\% \end{array}

where the computation of $a_j$ is given below.

Consider the example [1.9, 0.4],

I'm trying to use Q(x) to denote the output of a softmax regression model, representing the probability that the model classifies a given input image to a particular class, what does p(x) represents?

$$ a_j = {\displaystyle \operatorname{softmax} {(z_j)}={\frac {e^{z_{j}}}{\sum _{i=1}^{2}e^{z_{j}}}}\ \ \ \ {\text{ for }}j=1,2} $$

where

$$z_j=w_j^Tx+b \ \ {\text{ for }}j=1,2$$

enter image description here

Assume all $b$s are 0.0, and the weights $W$ is initialized as

$$w_{11}=0.1, w_{21}=0.1, w_{12}=0.5, w_{22}=0.5$$

As stated at the beginning, $x_1=1.9, x_2=0.4$, so, the confidence that the model predict the given plant as Setosa would be computed as follows:

$$z_1 = w_{11}x_1 + w_{21}x_2 + b_1 = 0.23$$

$$a_1 = \operatorname{softmax} {(z_1)} = 0.284958$$

$${\displaystyle H(p,q)=-\sum_{j=1}^2 P(x)\, \log Q(x) = -1.0 \times \log{(0.284958)} }$$

the confidence that the model predict the given feature vector [1.9, 0.4] as Non-Setosa would be computed as follows:

$$z_2 = w_{12}x_1 + w_{22}x_2 + b_2 = 1.15$$

$$a_2 = \operatorname{softmax} {(z_2)} = 0.715042$$

$${\displaystyle H(p,q)=-\sum_{j=1}^2 P(x)\, \log Q(x) = -1.0 \times \log{(0.715042)} }$$

\begin{array}{w{c}{8mm}|w{c}{8mm}} \text{ meaning } & a = Q(x) & j & y=P(x) \\ \hline \text{ Setosa } & 0.284958 & 1 & 100\% \\ \text{ Non-Setosa } & 0.715042 & 2 & 0\% \end{array}

I've managed to calculate the derivative of $a_j$ with respect to $z_j$

$$\frac{da_j}{dz_j} = a_j(1-a_j) \ \ \text{for } j=1,2$$

How do I compute the derivative of the loss $H(P,Q)$ with respect to the weights $W$ so that I can use the gradient descent algorithm to update the weights?

consider $j=1$, the derivative of $H(P,Q)$ is

\begin{align} \frac{\partial H}{\partial W_{i,1}} &=-\sum_{j=1}^2 \frac{P_j(x)}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i \\ &=-\frac{1}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i (P_1(x)+ P_2(x)) \end{align}

Is my understanding correct?

$\endgroup$
2
  • $\begingroup$ Here are some notes I wrote about computing the gradient for multiclass logistic regression Those notes are my attempt to make the calculation as clean and clear as possible, making good use of vector and matrix notation and the multivariable chain rule. $\endgroup$
    – littleO
    Commented Aug 2, 2021 at 7:07
  • $\begingroup$ This related question might help $\endgroup$
    – greg
    Commented Jan 25, 2023 at 17:47

1 Answer 1

1
+50
$\begingroup$

$$H = -\sum_{j=1}^2 P_j(x) \log Q_j(x)=-\sum_{k=1}^2 P_k(x) \log Q_k(x)$$

The trick is to use the chain rule.

\begin{align} \frac{\partial H}{\partial W_{i,j}} &= -\frac{\partial}{\partial W_{i,j}} \left(\sum_{k=1}^2 P_k(x) \log Q_k(x) \right) \\ &= -\frac{\partial}{\partial W_{i,j}} \left(P_j(x) \log Q_j(x) \right) \\ &=-\frac{P_j(x)}{Q_j(x)} \cdot \frac{\partial Q_j(x)}{\partial W_{i,j}} \\ &= - \frac{P_j(x)}{a_j(x)} \cdot \frac{\partial a_j(x)}{\partial z_j(x)} \cdot \frac{\partial z_j(x)}{\partial W_{i,j}}\\ &=-P_j(x) (1-a_j(x)) x_i \end{align}

Now we have the gradient, you can perform gradient descent.

$\endgroup$
8
  • $\begingroup$ Thank you so much. Would you please explain a little bit about how do I get $ \sum_{j=1}^2 \frac{P(x_j)}{a_j}\cdot \frac{\partial a_j}{\partial W_{i,j}} $ from $\sum_{j=1}^2 \frac{P(x_j)}{Q(x_j)} \cdot \frac{\partial Q(x_j)}{\partial W_{i,j}}$ $\endgroup$
    – JakeMZ
    Commented Aug 2, 2021 at 3:59
  • 1
    $\begingroup$ $a_j = Q(x_j)$ as you have stated in the first table. $\endgroup$ Commented Aug 2, 2021 at 4:14
  • $\begingroup$ You're so nice! Thank you! It seems the $j$ in $W_{i,j}$ conflicts a liittle bit with the one in $\sum_{j=1}^2$, is it? $\endgroup$
    – JakeMZ
    Commented Aug 2, 2021 at 4:31
  • $\begingroup$ true, may I know where does $a_j = Q(x_j)$ comes from? it seems that $j$ has been used for different notation. $\endgroup$ Commented Aug 2, 2021 at 6:21
  • $\begingroup$ consider $j=1$, the derivative of $H(P,Q)$ is \begin{align} \frac{\partial H}{\partial W_{i,1}} =-\sum_{j=1}^2 \frac{P(x_j)}{a_1} \cdot \frac{\partial a_1}{\partial z_1}\cdot x_i \end{align} Is my understanding correct? $\endgroup$
    – JakeMZ
    Commented Aug 2, 2021 at 6:51

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .