Does down-sampling change logistic regression coefficients?

Question

If I have a dataset with a very rare positive class, and I down-sample the negative class, then perform a logistic regression, do I need to adjust the regression coefficients to reflect the fact that I changed the prevalence of the positive class?

For example, let's say I have a dataset with 4 variables: Y, A, B and C. Y, A, and B are binary, C is continuous. For 11,100 observations Y=0, and for 900 Y=1:

set.seed(42)
n <- 12000
r <- 1/12
A <- sample(0:1, n, replace=TRUE)
B <- sample(0:1, n, replace=TRUE)
C <- rnorm(n)
Y <- ifelse(10 * A + 0.5 * B + 5 * C + rnorm(n)/10 > -5, 0, 1)

I fit a logistic regression to predict Y, given A, B and C.

dat1 <- data.frame(Y, A, B, C)
mod1 <- glm(Y~., dat1, family=binomial)

However, to save time I could remove 10,200 non-Y observations, giving 900 Y=0, and 900 Y=1:

require('caret')
dat2 <- downSample(data.frame(A, B, C), factor(Y), list=FALSE)
mod2 <- glm(Class~., dat2, family=binomial)

The regression coefficients from the 2 models look very similar:

> coef(summary(mod1))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -127.67782  20.619858 -6.191983 5.941186e-10
A           -257.20668  41.650386 -6.175373 6.600728e-10
B            -13.20966   2.231606 -5.919353 3.232109e-09
C           -127.73597  20.630541 -6.191596 5.955818e-10
> coef(summary(mod2))
              Estimate  Std. Error     z value    Pr(>|z|)
(Intercept) -167.90178   59.126511 -2.83970391 0.004515542
A           -246.59975 4059.733845 -0.06074284 0.951564016
B            -16.93093    5.861286 -2.88860377 0.003869563
C           -170.18735   59.516021 -2.85952165 0.004242805

Which leads me to believe that the down-sampling did not affect the coefficients. However, this is a single, contrived example, and I'd rather know for sure.

The intercept aside, you're estimating the same population parameters when you down-sample but with less precision - except the intercept, which you can estimate when you know the population prevalence of the response. See Hosmer & Lemeshow (2000), Applied Logistic Regression, Ch 6.3 for a proof. Sometimes you can introduce separation, though not commonly, as you down-sample the majority response. — Scortchi - Reinstate Monica, Commented Aug 20, 2013 at 21:06
@Scortchi Post your comment as an answer-- this seems sufficient for my question. Thanks for the reference. — Zach, Commented Aug 20, 2013 at 21:43
@Scortchi and Zach: According to the downsampled model (mod2), Pr(>|z|) for A is almost 1. We cannot reject the null hypothesis that the coefficient A is 0 so we have lost a covariate which is used in mod1. Isn't this a substantial difference? — Zhubarb, Commented Jan 8, 2015 at 11:43
@Zhubarb: As I noted, you might introduce separation, making the Wald standard error estimates completely unreliable. — Scortchi - Reinstate Monica, Commented Jan 8, 2015 at 12:44

Scortchi - Reinstate Monica · Accepted Answer · 2013-08-30 14:16:54Z

40

Down-sampling is equivalent to case–control designs in medical statistics—you're fixing the counts of responses & observing the covariate patterns (predictors). Perhaps the key reference is Prentice & Pyke (1979), "Logistic Disease Incidence Models and Case–Control Studies", Biometrika, 66, 3.

They used Bayes' Theorem to rewrite each term in the likelihood for the probability of a given covariate pattern conditional on being a case or control as two factors; one representing an ordinary logistic regression (probability of being a case or control conditional on a covariate pattern), & the other representing the marginal probability of the covariate pattern. They showed that maximizing the overall likelihood subject to the constraint that the marginal probabilities of being a case or control are fixed by the sampling scheme gives the same odds ratio estimates as maximizing the first factor without a constraint (i.e. carrying out an ordinary logistic regression).

The intercept for the population $\beta_0^*$ can be estimated from the case–control intercept $\hat{\beta}_0$ if the population prevalence $\pi$ is known:

$$ \hat{\beta}_0^* = \hat{\beta}_0 - \log\left( \frac{1-\pi}{\pi}\cdot \frac{n_1}{n_0}\right)$$

where $n_0$ & $n_1$ are the number of controls & cases sampled, respectively.

Of course by throwing away data you've gone to the trouble of collecting, albeit the least useful part, you're reducing the precision of your estimates. Constraints on computational resources are the only good reason I know of for doing this, but I mention it because some people seem to think that "a balanced data-set" is important for some other reason I've never been able to ascertain.

edited Aug 30, 2013 at 14:16

answered Aug 29, 2013 at 22:01

Scortchi - Reinstate Monica♦

31.3k8 gold badges98 silver badges279 bronze badges

1

$\begingroup$ Thanks for the detailed answer. And yes, the reason I'm doing this running the full model (with no down-sampling) is computationally prohibitive. $\endgroup$
– Zach
Commented Aug 30, 2013 at 12:32
$\begingroup$ Dear @Scortchi , thanks for the explanation but in a case that I want to use logistic regression, the balanced dataset seems necessary regardless of the computational resources. I tried to use "Firth's biased reduced penalized-likelihood logit" with no avail. So seemingly the down-sampling is the only alternate for me, right? $\endgroup$
– Shahin
Commented Sep 22, 2017 at 9:52
1

$\begingroup$ @Shahin Well, (1) why are you unhappy with a logistic regression fit by maximum-likelihood? & (2) what exactly goes wrong using Firth's method? $\endgroup$
– Scortchi - Reinstate Monica ♦
Commented Sep 22, 2017 at 10:55
5

$\begingroup$ @Shahin: You seem to be barking up the wrong tree there: down-sampling isn't going to improve the discrimination of your model. Bias correction or regularization might (on new data - are you assessing its performance on a test set?), but a more complex specification could perhaps help, or it could simply be that you need more informative predictors. You should probably ask a new question, giving details of the data, the subject-matter context, the model, diagnostics and your aims. $\endgroup$
– Scortchi - Reinstate Monica ♦
Commented Sep 25, 2017 at 10:47
1

$\begingroup$ @Scortchi-ReinstateMonica can you provide more information as to why you think balancing a dataset might be useless? $\endgroup$
– eduardokapp
Commented Oct 6, 2021 at 19:05

| Show 2 more comments

Stack Exchange Network

Does down-sampling change logistic regression coefficients?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
logistic
unbalanced-classes
case-control-study
or ask your own question.

Linked

Hot Network Questions

Does down-sampling change logistic regression coefficients?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged logisticunbalanced-classescase-control-study or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
logistic
unbalanced-classes
case-control-study
or ask your own question.