Background knowledge:
To train a logistic regression model for a classification problem with two classes (called class $0$ and class $1$), we are given a training dataset consisting of feature vectors $x_1, x_2, \ldots, x_N \in \mathbb R^d$ and corresponding target values $y_1, y_2, \ldots, y_N \in \{0,1\}$. Our goal is to find numbers $\beta_0, \beta_1, \ldots, \beta_d \in \mathbb R$ such that $$ \tag{1} y_i \approx \sigma(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_d x_{id}) \quad \text{for } i = 1, \ldots, N. $$ Here $x_{i1}, \ldots, x_{id}$ are the components of the feature vector $x_i$, and $\sigma:\mathbb R \to \mathbb R$ is the sigmoid function (also called "logistic function") defined by $$ \sigma(u) = \frac{e^u}{1 + e^u} \quad \text{for all } u \in \mathbb R. $$ The sigmoid function is useful in machine learning because it converts a real number into a probability (that is, a number between $0$ and $1$). Equation (1) can be expressed more concisely using vector notation: we hope that $$ y_i \approx \sigma(\hat x_i^T \beta) \quad \text{for } i = 1, \ldots N $$ where $$ \hat x_i = \begin{bmatrix} 1 \\ x_{i1} \\ \vdots \\ x_{id} \end{bmatrix} \quad \text{and} \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{bmatrix}. $$ (The vector $\hat x_i$ is called an "augmented" feature vector. It's the vector you get by prepending a $1$ to $x_i$.) I think of $\sigma(\hat x_i^T \beta)$ as being a "predicted probability" which tells you how strongly our model believes that example $i$ belongs to class $1$. And $y_i$ is a "ground truth" probability which reflects certainty about whether or not example $i$ belongs to class $1$.
We will use the binary cross-entropy loss function $$ \ell(p,q) = -p \log(q) - (1 - p) \log(1 - q) $$ to measure how well a predicted probability $q \in (0,1)$ agrees with a ground truth probability $p \in [0,1]$. Here $\log$ denotes the natural logarithm. The vector $\beta$ will be chosen to minimize the average cross-entropy $$ \tag{2} L(\beta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, \sigma(\hat x_i^T \beta)). $$ We can minimize $L(\beta)$ using an optimization algorithm such as gradient descent, stochastic gradient descent, or Newton's method. In order to use such methods, we need to know how to compute the gradient of $L$. For Newton's method, we also need to know how to compute the Hessian of $L$.
Question:
How can we compute the gradient and the Hessian of the average cross-entropy function $L(\beta)$ defined in equation (2)?
The goal is to compute the gradient and the Hessian of $L$ with finesse, making good use of vector notation.
(I'll post an answer below.)