This is a sequel to a previous question about implementing binary logistic regression from scratch.
Background knowledge:
To train a logistic regression model for a classification problem with $K$ classes, we are given a training dataset consisting of feature vectors $x_1, x_2, \ldots, x_N \in \mathbb R^d$ and corresponding one-hot encoded label vectors $y_1, y_2, \ldots, y_N \in \mathbb R^K$. If example $i$ belongs to class $k$, then the label vector $y_i$ has a $1$ in the $k$th position and zeros elsewhere. Our goal is to find vectors $\beta_1, \beta_2, \ldots, \beta_K \in \mathbb R^{d+1}$ such that $$ \tag{1} y_i \approx S\left(\begin{bmatrix} \hat x_i^T \beta_1 \\ \vdots \\ \hat x_i^T \beta_K \end{bmatrix} \right) \quad \text{for } i = 1, \ldots, N. $$ Here $\hat x_i \in \mathbb R^{d+1}$ is the "augmented" feature vector obtained by prepending a $1$ to the feature vector $x_i \in \mathbb R^d$, and $S:\mathbb R^K \to \mathbb R^K$ is the softmax function defined by $$ S(u) = \begin{bmatrix} \frac{e^{u_1}}{e^{u_1} + \cdots + e^{u_K}} \\ \vdots \\ \frac{e^{u_K}}{e^{u_1} + \cdots + e^{u_K}} \end{bmatrix} \quad \text{for all vectors } u = \begin{bmatrix} u_1 \\ \vdots \\ u_K \end{bmatrix} \in \mathbb R^K. $$ The softmax function is useful in machine learning because it converts a vector into a "probability vector" (that is, a vector whose components are nonnegative and sum to $1$). Equation (1) can be expressed more concisely using vector and matrix notation as follows. Define $$ \beta = \begin{bmatrix} \beta_1 \\ \vdots \\ \beta_K \end{bmatrix} \in \mathbb R^{K(d+1)} \quad \text{and} \quad M_i = \begin{bmatrix} \hat x_i^T & & & \\ & \hat x_i^T & & \\ & & \ddots & \\ & & & \hat x_i^T \end{bmatrix} \in \mathbb R^{K \times K(d+1)}. $$ Then equation (1) says that we hope that $$ y_i \approx S(M_i \beta) \quad \text{for } i = 1, \ldots N. $$
I think of $S(M_i \beta) \in \mathbb R^K$ as being a "predicted probability vector" which tells you how strongly our model believes that example $i$ belongs to each of the $K$ classes. And $y_i$ is a "ground truth" probability vector which reflects certainty about which of the $K$ classes example $i$ belongs to.
We will use the cross-entropy loss function $$ \ell(p,q) = \sum_{k=1}^K -p_k \log(q_k) $$ to measure how well a predicted probability vector $q \in \mathbb R^K$ agrees with a ground truth probability vector $p \in \mathbb R^K$. Here $\log$ denotes the natural logarithm. The vector $\beta$ will be chosen to minimize the average cross-entropy $$ \tag{2} L(\beta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, S(M_i \beta)). $$ We can minimize $L(\beta)$ using an optimization algorithm such as gradient descent, stochastic gradient descent, or Newton's method. In order to use such methods, we need to know how to compute the gradient of $L$. For Newton's method, we also need to know how to compute the Hessian of $L$.
Question:
How can we compute the gradient and the Hessian of the average cross-entropy function $L(\beta)$ defined in equation (2)?
The goal is to compute the gradient and the Hessian of $L$ with finesse, making good use of vector and matrix notation.
(I'll post an answer below.)