2
$\begingroup$

This is a sequel to a previous question about implementing binary logistic regression from scratch.

Background knowledge:

To train a logistic regression model for a classification problem with $K$ classes, we are given a training dataset consisting of feature vectors $x_1, x_2, \ldots, x_N \in \mathbb R^d$ and corresponding one-hot encoded label vectors $y_1, y_2, \ldots, y_N \in \mathbb R^K$. If example $i$ belongs to class $k$, then the label vector $y_i$ has a $1$ in the $k$th position and zeros elsewhere. Our goal is to find vectors $\beta_1, \beta_2, \ldots, \beta_K \in \mathbb R^{d+1}$ such that $$ \tag{1} y_i \approx S\left(\begin{bmatrix} \hat x_i^T \beta_1 \\ \vdots \\ \hat x_i^T \beta_K \end{bmatrix} \right) \quad \text{for } i = 1, \ldots, N. $$ Here $\hat x_i \in \mathbb R^{d+1}$ is the "augmented" feature vector obtained by prepending a $1$ to the feature vector $x_i \in \mathbb R^d$, and $S:\mathbb R^K \to \mathbb R^K$ is the softmax function defined by $$ S(u) = \begin{bmatrix} \frac{e^{u_1}}{e^{u_1} + \cdots + e^{u_K}} \\ \vdots \\ \frac{e^{u_K}}{e^{u_1} + \cdots + e^{u_K}} \end{bmatrix} \quad \text{for all vectors } u = \begin{bmatrix} u_1 \\ \vdots \\ u_K \end{bmatrix} \in \mathbb R^K. $$ The softmax function is useful in machine learning because it converts a vector into a "probability vector" (that is, a vector whose components are nonnegative and sum to $1$). Equation (1) can be expressed more concisely using vector and matrix notation as follows. Define $$ \beta = \begin{bmatrix} \beta_1 \\ \vdots \\ \beta_K \end{bmatrix} \in \mathbb R^{K(d+1)} \quad \text{and} \quad M_i = \begin{bmatrix} \hat x_i^T & & & \\ & \hat x_i^T & & \\ & & \ddots & \\ & & & \hat x_i^T \end{bmatrix} \in \mathbb R^{K \times K(d+1)}. $$ Then equation (1) says that we hope that $$ y_i \approx S(M_i \beta) \quad \text{for } i = 1, \ldots N. $$

I think of $S(M_i \beta) \in \mathbb R^K$ as being a "predicted probability vector" which tells you how strongly our model believes that example $i$ belongs to each of the $K$ classes. And $y_i$ is a "ground truth" probability vector which reflects certainty about which of the $K$ classes example $i$ belongs to.

We will use the cross-entropy loss function $$ \ell(p,q) = \sum_{k=1}^K -p_k \log(q_k) $$ to measure how well a predicted probability vector $q \in \mathbb R^K$ agrees with a ground truth probability vector $p \in \mathbb R^K$. Here $\log$ denotes the natural logarithm. The vector $\beta$ will be chosen to minimize the average cross-entropy $$ \tag{2} L(\beta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, S(M_i \beta)). $$ We can minimize $L(\beta)$ using an optimization algorithm such as gradient descent, stochastic gradient descent, or Newton's method. In order to use such methods, we need to know how to compute the gradient of $L$. For Newton's method, we also need to know how to compute the Hessian of $L$.

Question:

How can we compute the gradient and the Hessian of the average cross-entropy function $L(\beta)$ defined in equation (2)?

The goal is to compute the gradient and the Hessian of $L$ with finesse, making good use of vector and matrix notation.

(I'll post an answer below.)

$\endgroup$

3 Answers 3

2
$\begingroup$

$ \def\bbR#1{{\mathbb R}^{#1}} \def\a{\alpha}\def\b{\beta}\def\g{\gamma}\def\t{\theta} \def\l{\lambda}\def\s{\sigma}\def\e{\varepsilon} \def\n{\nabla}\def\o{{\tt1}}\def\p{\partial} \def\E{R}\def\F{{\cal F}}\def\L{{\cal L}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\vec#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3}} \def\c#1{\color{red}{#1}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $It will be convenient to use $\{\e_k,\o,J,I\}$ as the standard basis vector, the all-ones vector, all-ones matrix, and identity matrix, respectively. And to use $\{\odot,\oslash\}$ to denote elementwise multiplication and division.

Further, the indexed column vectors can be collected into matrices, e.g. $$\eqalign{ X &= \m{x_\o&\ x_2&\ldots&\ x_N} &\qiq x_i = X\e_i \\ Y &= \m{y_\o&\ y_2&\ldots&\ y_N} &\qiq y_j = Y\e_j \\ B &= \m{\b_\o&\b_2&\ldots&\b_K} &\qiq \b_k = B\e_k \\ }$$

The various functions can be defined for the vector argument $u$. $$\eqalign{ e &= \exp(u) &\qiq de = e\odot du \\ s &= \operatorname{Softmax}(u) \\ &= \frac{e}{\o:e} = e \oslash Je &\qiq ds = \BR{\Diag{s}-ss^T}\,du \\ \l &= \operatorname{log-sum-exp}(u) \\ &= \log(\o:e) &\qiq d\l = s:du \\ }$$ $$\eqalign{ \log(s) &= \log(e)-\log(Je) \;&=\; u -\l\o \qquad\qquad\qquad\qquad\qquad\qquad \\ \ell(y,s) &= -y:\log(s) \;&=\; \l{y:\o} - y:u \;=\; \l - y:u \\ }$$ where a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A = A^TC:B \\ }$$ Furthermore, it commutes with elementwise multiplication and division $$\eqalign{ A:(B\odot C) &= (A\odot B):C &= \sum_{i=1}^m\sum_{j=1}^n A_{ij} B_{ij} C_{ij} \\ }$$

The Kronecker product $(\otimes)$ can also be used to good effect since $$\eqalign{ M_i &= \LR{I\otimes x_i^T} \\ M_i\b &= \LR{I\otimes x_i^T} \vec{B} = \vec{\e_i^TX^TB} = {B^TX\e_i} \\ y_i &\approx \operatorname{Softmax}(M_i\b) = \exp\LR{B^TX\e_i}\oslash J\exp\LR{B^TX\e_i} \\ Y &\approx \exp\LR{B^TX}\oslash J\exp\LR{B^TX} \;\doteq\; S \\ }$$ The last equation is the Matrix Model that must be optimized for $B$ with respect to the cross-entropy loss function. Note that we now have matrix analogs of all of the vector quantities $$\eqalign{ u &\to U=B^TX \\ e &\to E=\exp(U) \\ s &\to S=\operatorname{Softmax}(U) &= E\oslash JE \\ &&= E\cdot\c{\Diag{E^T\o}^{-1}} \\ &&= E\c{D^{-1}} \\ y &\to Y,\quad Y^T\o=\o,\;&JY=J,\quad Y:J=N \\ }$$ Calculate the gradient of the loss function. $$\eqalign{ {\ell} &= -\BR{Y:\log(S)} \\ &= Y:\log(JE) - Y:U \\ \\ d{\ell} &= Y:d\log(JE) - Y:dU \\ &= Y:(J\,dE\oslash JE) - Y:dB^TX \\ &= (Y\oslash JE):(J\,dE) - Y^T:X^TdB \\ &= \LR{Y{D^{-1}}}:\BR{J\,dE} - XY^T:dB \\ &= JY{D^{-1}}:dE - XY^T:dB \\ &= JD^{-1}:(E\odot dU) - XY^T:dB \\ &= (E\odot JD^{-1}):dB^TX - XY^T:dB \\ &= \LR{ED^{-1}}^T:X^TdB - XY^T:dB \\ &= X\LR{ED^{-1}}^T:dB - XY^T:dB \\ \\ \grad{\ell}{B} &= X\LR{ED^{-1} - Y}^T \\ &= X\LR{S - Y}^T \;\doteq\; G \\ }$$ This $G$ matrix can be used in any standard gradient descent algorithm.
The Hessian will be a fourth-order tensor and much uglier.

Update

Here is an attempt at a Hessian calculation (with flattening via vectorization).

First, recall some identities involving the Diag() operator $$\eqalign{ &E\odot JP = E\cdot\Diag{P^T\o} \\ &H = \Diag{h} \iff h = H\o \\ &\Diag{H\o} = \Diag{h} = H \qiq &\Diag{H^n\o} = H^n \\ }$$ Calculate the differential of $G$. $$\eqalign{ G &= X\LR{ED^{-1}-Y}^T \\ dG &= X\LR{dE\;D^{-1}}^T - X\LR{E\,dD\,D^{-2}}^T \\ &= ​XD^{-1}\,dE^T - XD^{-2}\,dD\,E^T \\ &= ​XD^{-1}\LR{E^T\odot dU^T} - XD^{-2}\Diag{dE^T\o}\,E^T \\ &= ​XD^{-1}\LR{\c{E^T\odot X^TdB}} - XD^{-2}\Diag{(\c{E^T\odot X^TdB})\o}\,E^T \\ }$$ Note the following vec expansion of the $\c{\rm red}$ term $$\eqalign{ &\E = \Diag{\vec{E^T}} \\ &\vec{\c{E^T\odot X^TdB}} = \E\cdot\vec{X^TdB} = \E \LR{I\otimes X^T} d\b \\ }$$ and use this to vectorize the differential relationship and recover the Hessian. $$\eqalign{ dg &= \vec{dG} \\ &= \LR{I\otimes XD^{-1}} \vec{\c{E^T\odot X^TdB}} - \LR{E\boxtimes XD^{-2}} \LR{\c{E^T\odot X^TdB}}\o \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}} \;\vec{\c{E^T\odot X^TdB}} \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} d\b \\ \grad{g}{\b} &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} \;=\; \hess{\ell}{\b\,}{\b^T} \\ }$$ where $\{\boxtimes\}$ denotes the Khatri-Rao product which can be expanded in terms of Hadamard and Kronecker products $$\eqalign{ B\boxtimes A &= (B\otimes\o_m)\odot(\o_p\otimes A) \;\in \bbR{(pm)\times n} \\ \o_p &\in \bbR{p\times\o} \qquad A \in \bbR{m\times n} \qquad B \in \bbR{p\times n} \\\\ }$$ Note that the gradient can also be vectorized, if that is preferable $$\vec{G} \;=\; g \;=\; \grad{\ell}{\b}$$

$\endgroup$
1
$\begingroup$

Notice that $$ L(\beta) = \frac{1}{N} \sum_{i=1}^N h_i(M_i \beta) $$ where $h_i: \mathbb R^K \to \mathbb R$ is the function defined by $$ h_i(u) = \ell(y_i, S(u)). $$ The softmax function and the cross-entropy loss function are natural companions, and they are happy to be combined together into the function $h_i$. By the chain rule, $$ L'(\beta) = \frac{1}{N} \sum_{i=1}^N h_i'(M_i \beta) M_i. $$ If we use the convention that the gradient is a column vector, then we have $$ \nabla L(\beta) = L'(\beta)^T = \frac{1}{N} \sum_{i=1}^N M_i^T \nabla h_i(M_i \beta). $$ So, to compute $\nabla L(\beta)$, we only need to figure out how to evaluate $\nabla h_i(u)$. We are nearly done already.

Before we compute $\nabla h_i(u)$, let's first simplify $h_i(u)$ as much as possible. The components of the label vector $y_i \in \mathbb R^K$ will be denoted $y_{i1}, \ldots, y_{iK}$. Note that $y_{i1} + \cdots + y_{iK} = 1$. Doing some high school algebra and using some properties of logarithms, we find that \begin{align} h_i(u) &= \sum_{k=1}^K - y_{ik} \log\left(\frac{e^{u_k}}{e^{u_1} + \cdots + e^{u_K}}\right) \\ &= \sum_{k=1}^K -y_{ik} u_k + y_{ik} \log(e^{u_1} + \cdots + e^{u_K}) \\ &= -\langle y_i, u \rangle + \sum_{k=1}^K y_{ik} \underbrace{\log(e^{u_1} + \cdots + e^{u_K})}_{\text{common factor}} \\ &= - \langle y_i, u \rangle + \log(e^{u_1} + \cdots + e^{u_K})\underbrace{\sum_{k=1}^K y_{ik}}_1 \\ &= \underbrace{\log(e^{u_1} + \cdots + e^{u_K})}_{\text{LogSumExp function}} - \langle y_i, u \rangle. \end{align} Look how much $h_i(u)$ simplified! And look how beautiful it is that our old friend the LogSumExp function has appeared. This goes to show that the softmax function and the cross-entropy loss function are meant to be together. This is why PyTorch's loss function torch.nn.functional.cross_entropy takes "logits" as input. (In other words, PyTorch's cross_entropy loss function assumes that you have not yet applied the softmax function.)

Now we are ready to compute the gradient. It can be shown that the gradient of the LogSumExp function is the softmax function, which is a beautiful fact. It follows that $$ \nabla h_i(u) = S(u) - y_i. $$ This is a beautiful and delightfully simple formula. Interpretation: If the predicted probability vector $S(u)$ agrees perfectly with the ground truth probability vector $y_i$, then $\nabla h_i(u)$ is zero, suggesting that no change to the value of $u$ is necessary.

Putting the above pieces together, we see that $$ \nabla L(\beta) = \frac{1}{N} \sum_{i=1}^N M_i^T (S(M_i \beta) - y_i). $$

We are now done with computing the gradient of $L$, but it's helpful to make one further observation. Let's look more closely at the $i$th term in the sum above, which we shall call $g_i$. The vector $\beta$ has block structure, since it is the concatenation of the vectors $\beta_1, \ldots, \beta_K$. Let $S_k$ be the $k$th component function of $S$. Notice that $$ g_i = M_i^T(S(M_i \beta) - y_i) = \underbrace{\begin{bmatrix} \hat x_i & & & \\ & \hat x_i & & \\ & & \ddots & \\ & & & \hat x_i \end{bmatrix}}_{K(d+1) \times K} \underbrace{\begin{bmatrix} S_1(M_i \beta) - y_{i1} \\ S_2(M_i \beta) - y_{i2} \\ \vdots \\ S_K(M_i \beta) - y_{iK}\end{bmatrix}}_{K \times 1}. $$ We can see that the $k$th block of $g_i$ is $(S_k(M_i \beta) - y_i) \hat x_i$. This is a very nice formula. Interpretation: If the predicted probability $S_k(M_i \beta)$ agrees perfectly with the ground truth probability $y_i$, then the $k$th block of $g_i$ is $0$, suggesting that no change needs to be made to $\beta_k$.


The above observation makes life easier when implementing multiclass logistic regression from scratch in Python. Due to the block structure of the vectors $\beta$ and $g_i \in \mathbb R^{K(d+1)}$, it is convenient to store $\beta$ and $g_i$ as Numpy arrays with shape $(d+1) \times K$. Then $M_i \beta$ can be computed using the Numpy command MiBeta = beta.T @ xiHat. And $g_i$ can be computed using the Numpy command gi = np.outer(xiHat,S(MiBeta) - yi). This makes it possible to implement multiclass logistic regression from scratch in Python very concisely, particularly when using stochastic gradient descent with a batch size of 1.


Next let's compute the Hessian of $L$. The Hessian matrix $H L(\beta)$ is the derivative of the function $$g(\beta) = \nabla L(\beta) = \frac{1}{N} \sum_{i=1}^N M_i^T (S(M_i \beta) - y_i). $$ To compute the derivative of $g$, it is helpful to know that $$ S'(u) = \text{diag}(S(u)) - S(u) S(u)^T, $$ which is analogous to the formula for the derivative of the sigmoid function. Again using the chain rule, we find that the derivative of $g$ is \begin{align} g'(\beta) &= \frac{1}{N} \sum_{i=1}^N M_i^T S'(M_i \beta) M_i. \end{align} If we look closely at this formula, it probably simplifies nicely, but I still need to work out the details.

So we have found our Hessian matrix $HL(\beta) = g'(\beta)$. Having computed the gradient and Hessian of $L$, it is now straightforward to train a multiclass logistic regression model from scratch using gradient descent or Newton's method in a language such as Python.

$\endgroup$
1
$\begingroup$

The objective function in multinomial logistic regression can be written as $$ \phi(\mathbf{B}) = \frac{-1}{N} \sum_n \mathbf{y}_n^T \log(\mathbf{p}_n) $$ where $\mathbf{p}_n$ is the softmax probability vector for the $n$-th example computed as $\mathbf{p}_n = \mathrm{softmax}(\mathbf{B}^T \mathbf{x}_n)$. Note that $log$ is applied elementwise.

The gradient is $$ \frac{\partial \phi}{\partial \mathbf{B}} = \frac{1}{N} \sum_n \mathbf{x}_n \left[\mathbf{p}_n - \mathbf{y}_n\right]^T $$ or in vectorized form $$ \mathbf{g} = \frac{\partial \phi} {\partial \mathrm{vec}(\mathbf{B})} = \frac{1}{N} \sum_n \left[\mathbf{p}_n - \mathbf{y}_n\right] \otimes \mathbf{x}_n $$

For one example, it holds \begin{eqnarray*} d\mathbf{g} &=& d\mathbf{p} \otimes \mathbf{x} = \left[ \mathbf{A} (d\mathbf{B})^T \mathbf{x} \right] \otimes \mathbf{x} \\ &=& \mathrm{vec} \left[ \mathbf{x} \mathbf{x}^T (d\mathbf{B}) \mathbf{A} \right] \\ &=& \left[ \mathbf{A} \otimes \mathbf{x} \mathbf{x}^T \right] \mathrm{vec}(d\mathbf{B}) \end{eqnarray*}

where we have used the Jacobian $d\mathbf{p} = [\mathrm{diag}(\mathbf{p})-\mathbf{p}\mathbf{p}^T] (d\mathbf{B})^T \mathbf{x} = \mathbf{A} (d\mathbf{B})^T \mathbf{x} $.

The Hessian is $$ \mathbf{H} = \frac{1}{N} \sum_n \left[ \mathrm{diag}(\mathbf{p}_n)-\mathbf{p}_n\mathbf{p}_n^T \right] \otimes \left[ \mathbf{x}_n \mathbf{x}_n^T \right] $$

$\endgroup$
1
  • $\begingroup$ $+{\tt1}\,$ Nice compact notation. $\endgroup$
    – greg
    Commented Nov 16, 2021 at 15:38

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .