3
$\begingroup$

Background knowledge:

To train a logistic regression model for a classification problem with two classes (called class $0$ and class $1$), we are given a training dataset consisting of feature vectors $x_1, x_2, \ldots, x_N \in \mathbb R^d$ and corresponding target values $y_1, y_2, \ldots, y_N \in \{0,1\}$. Our goal is to find numbers $\beta_0, \beta_1, \ldots, \beta_d \in \mathbb R$ such that $$ \tag{1} y_i \approx \sigma(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_d x_{id}) \quad \text{for } i = 1, \ldots, N. $$ Here $x_{i1}, \ldots, x_{id}$ are the components of the feature vector $x_i$, and $\sigma:\mathbb R \to \mathbb R$ is the sigmoid function (also called "logistic function") defined by $$ \sigma(u) = \frac{e^u}{1 + e^u} \quad \text{for all } u \in \mathbb R. $$ The sigmoid function is useful in machine learning because it converts a real number into a probability (that is, a number between $0$ and $1$). Equation (1) can be expressed more concisely using vector notation: we hope that $$ y_i \approx \sigma(\hat x_i^T \beta) \quad \text{for } i = 1, \ldots N $$ where $$ \hat x_i = \begin{bmatrix} 1 \\ x_{i1} \\ \vdots \\ x_{id} \end{bmatrix} \quad \text{and} \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_d \end{bmatrix}. $$ (The vector $\hat x_i$ is called an "augmented" feature vector. It's the vector you get by prepending a $1$ to $x_i$.) I think of $\sigma(\hat x_i^T \beta)$ as being a "predicted probability" which tells you how strongly our model believes that example $i$ belongs to class $1$. And $y_i$ is a "ground truth" probability which reflects certainty about whether or not example $i$ belongs to class $1$.

We will use the binary cross-entropy loss function $$ \ell(p,q) = -p \log(q) - (1 - p) \log(1 - q) $$ to measure how well a predicted probability $q \in (0,1)$ agrees with a ground truth probability $p \in [0,1]$. Here $\log$ denotes the natural logarithm. The vector $\beta$ will be chosen to minimize the average cross-entropy $$ \tag{2} L(\beta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, \sigma(\hat x_i^T \beta)). $$ We can minimize $L(\beta)$ using an optimization algorithm such as gradient descent, stochastic gradient descent, or Newton's method. In order to use such methods, we need to know how to compute the gradient of $L$. For Newton's method, we also need to know how to compute the Hessian of $L$.

Question:

How can we compute the gradient and the Hessian of the average cross-entropy function $L(\beta)$ defined in equation (2)?

The goal is to compute the gradient and the Hessian of $L$ with finesse, making good use of vector notation.

(I'll post an answer below.)

$\endgroup$
1
  • 1
    $\begingroup$ The folks (including myself) over at stats.stackexchange.com would have appreciated this Q&A too :) $\endgroup$
    – mhdadk
    Commented Nov 5, 2021 at 22:10

1 Answer 1

2
$\begingroup$

Notice that $$ L(\beta) = \frac{1}{N} \sum_{i=1}^N h_i(\hat x_i^T \beta) $$ where $h_i: \mathbb R \to \mathbb R$ is the function defined by $$ h_i(u) = \ell(y_i, \sigma(u)). $$ The sigmoid function and the cross-entropy loss function are natural companions, and they are happy to be combined together into the function $h_i$. By the chain rule, $$ L'(\beta) = \frac{1}{N} \sum_{i=1}^N h_i'(\hat x_i^T \beta) \hat x_i^T. $$ So, to compute $L'(\beta)$, we only need to figure out how to evaluate $h_i'(u)$. But this must be easy because $h_i$ is a function of a single variable. Evaluating $h_i'(u)$ is a routine exercise in single-variable calculus.

Before we take the derivative of $h_i$, let's first simplify $h_i(u)$ as much as possible: \begin{align} h_i(u) &= -y_i \log(\sigma(u)) - (1 - y_i) \log(1 - \sigma(u)) \\ &= -y_i \log(\frac{e^u}{1+e^u}) - (1 - y_i) \log(\frac{1}{1 + e^u}) \\ &= -y_i (u - \log(1 + e^u)) + (1 - y_i) \log(1 + e^u) \\ &= - y_i u + \log(1 + e^u). \end{align} Look how much $h_i(u)$ simplified! The sigmoid function and the cross-entropy loss function are meant to be together. This is why PyTorch offers the "binary cross-entropy with logits" loss function.

Now we are ready to take the derivative: $$ h_i'(u) = -y_i +\frac{e^u}{1 + e^u} = \sigma(u) - y_i. $$ This is a beautiful and delightfully simple formula. Interpretation: If the predicted probability $\sigma(u)$ agrees perfectly with the ground truth probability $y_i$, then $h_i'(u)$ is zero, suggesting that no change to the value of $u$ is necessary.

Putting the above pieces together, we see that $$ L'(\beta) = \frac{1}{N} \sum_{i=1}^N (\sigma(\hat x_i^T \beta) - y_i) \hat x_i^T. $$ If we use the convention that the gradient is a column vector, then $$ \nabla L(\beta) = L'(\beta)^T = \frac{1}{N} \sum_{i=1}^N (\sigma(\hat x_i^T \beta) - y_i) \hat x_i. $$


Next let's compute the Hessian of $L$. The Hessian matrix $H L(\beta)$ is the derivative of the function $$g(\beta) = \nabla L(\beta)= \frac{1}{N} \sum_{i=1}^N \hat x_i (\sigma(\hat x_i^T \beta) - y_i). $$ To compute the derivative of $g$, it is helpful to know that $$ \sigma'(u) = \sigma(u) - \sigma(u)^2, $$ which is a beautiful result that you can easily check. Again using the chain rule, we find that the derivative of $g$ is \begin{align} g'(\beta) &= \frac{1}{N} \sum_{i=1}^N \hat x_i \sigma'(\hat x_i^T \beta) \hat x_i^T \\ &= \frac{1}{N} \sum_{i=1}^N (\sigma(u_i) - \sigma^2(u_i)) \hat x_i \hat x_i^T \end{align} where $u_i = \hat x_i^T \beta$.

So we have found our Hessian matrix $HL(\beta) = g'(\beta)$. Having computed the gradient and Hessian of $L$, it is now straightforward to train a logistic regression model from scratch using gradient descent or Newton's method in a language such as Python.

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .