48
$\begingroup$

If I have a dataset with a very rare positive class, and I down-sample the negative class, then perform a logistic regression, do I need to adjust the regression coefficients to reflect the fact that I changed the prevalence of the positive class?

For example, let's say I have a dataset with 4 variables: Y, A, B and C. Y, A, and B are binary, C is continuous. For 11,100 observations Y=0, and for 900 Y=1:

set.seed(42)
n <- 12000
r <- 1/12
A <- sample(0:1, n, replace=TRUE)
B <- sample(0:1, n, replace=TRUE)
C <- rnorm(n)
Y <- ifelse(10 * A + 0.5 * B + 5 * C + rnorm(n)/10 > -5, 0, 1)

I fit a logistic regression to predict Y, given A, B and C.

dat1 <- data.frame(Y, A, B, C)
mod1 <- glm(Y~., dat1, family=binomial)

However, to save time I could remove 10,200 non-Y observations, giving 900 Y=0, and 900 Y=1:

require('caret')
dat2 <- downSample(data.frame(A, B, C), factor(Y), list=FALSE)
mod2 <- glm(Class~., dat2, family=binomial)

The regression coefficients from the 2 models look very similar:

> coef(summary(mod1))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -127.67782  20.619858 -6.191983 5.941186e-10
A           -257.20668  41.650386 -6.175373 6.600728e-10
B            -13.20966   2.231606 -5.919353 3.232109e-09
C           -127.73597  20.630541 -6.191596 5.955818e-10
> coef(summary(mod2))
              Estimate  Std. Error     z value    Pr(>|z|)
(Intercept) -167.90178   59.126511 -2.83970391 0.004515542
A           -246.59975 4059.733845 -0.06074284 0.951564016
B            -16.93093    5.861286 -2.88860377 0.003869563
C           -170.18735   59.516021 -2.85952165 0.004242805

Which leads me to believe that the down-sampling did not affect the coefficients. However, this is a single, contrived example, and I'd rather know for sure.

$\endgroup$
6
  • 8
    $\begingroup$ The intercept aside, you're estimating the same population parameters when you down-sample but with less precision - except the intercept, which you can estimate when you know the population prevalence of the response. See Hosmer & Lemeshow (2000), Applied Logistic Regression, Ch 6.3 for a proof. Sometimes you can introduce separation, though not commonly, as you down-sample the majority response. $\endgroup$ Commented Aug 20, 2013 at 21:06
  • $\begingroup$ @Scortchi Post your comment as an answer-- this seems sufficient for my question. Thanks for the reference. $\endgroup$
    – Zach
    Commented Aug 20, 2013 at 21:43
  • 2
    $\begingroup$ @Scortchi and Zach: According to the downsampled model (mod2), Pr(>|z|) for A is almost 1. We cannot reject the null hypothesis that the coefficient A is 0 so we have lost a covariate which is used in mod1. Isn't this a substantial difference? $\endgroup$
    – Zhubarb
    Commented Jan 8, 2015 at 11:43
  • $\begingroup$ @Zhubarb: As I noted, you might introduce separation, making the Wald standard error estimates completely unreliable. $\endgroup$ Commented Jan 8, 2015 at 12:44
  • $\begingroup$ See also Scott 2006 $\endgroup$
    – StasK
    Commented Jul 17, 2015 at 13:55

1 Answer 1

40
$\begingroup$

Down-sampling is equivalent to case–control designs in medical statistics—you're fixing the counts of responses & observing the covariate patterns (predictors). Perhaps the key reference is Prentice & Pyke (1979), "Logistic Disease Incidence Models and Case–Control Studies", Biometrika, 66, 3.

They used Bayes' Theorem to rewrite each term in the likelihood for the probability of a given covariate pattern conditional on being a case or control as two factors; one representing an ordinary logistic regression (probability of being a case or control conditional on a covariate pattern), & the other representing the marginal probability of the covariate pattern. They showed that maximizing the overall likelihood subject to the constraint that the marginal probabilities of being a case or control are fixed by the sampling scheme gives the same odds ratio estimates as maximizing the first factor without a constraint (i.e. carrying out an ordinary logistic regression).

The intercept for the population $\beta_0^*$ can be estimated from the case–control intercept $\hat{\beta}_0$ if the population prevalence $\pi$ is known:

$$ \hat{\beta}_0^* = \hat{\beta}_0 - \log\left( \frac{1-\pi}{\pi}\cdot \frac{n_1}{n_0}\right)$$

where $n_0$ & $n_1$ are the number of controls & cases sampled, respectively.

Of course by throwing away data you've gone to the trouble of collecting, albeit the least useful part, you're reducing the precision of your estimates. Constraints on computational resources are the only good reason I know of for doing this, but I mention it because some people seem to think that "a balanced data-set" is important for some other reason I've never been able to ascertain.

$\endgroup$
7
  • 1
    $\begingroup$ Thanks for the detailed answer. And yes, the reason I'm doing this running the full model (with no down-sampling) is computationally prohibitive. $\endgroup$
    – Zach
    Commented Aug 30, 2013 at 12:32
  • $\begingroup$ Dear @Scortchi , thanks for the explanation but in a case that I want to use logistic regression, the balanced dataset seems necessary regardless of the computational resources. I tried to use "Firth's biased reduced penalized-likelihood logit" with no avail. So seemingly the down-sampling is the only alternate for me, right? $\endgroup$
    – Shahin
    Commented Sep 22, 2017 at 9:52
  • 1
    $\begingroup$ @Shahin Well, (1) why are you unhappy with a logistic regression fit by maximum-likelihood? & (2) what exactly goes wrong using Firth's method? $\endgroup$ Commented Sep 22, 2017 at 10:55
  • 5
    $\begingroup$ @Shahin: You seem to be barking up the wrong tree there: down-sampling isn't going to improve the discrimination of your model. Bias correction or regularization might (on new data - are you assessing its performance on a test set?), but a more complex specification could perhaps help, or it could simply be that you need more informative predictors. You should probably ask a new question, giving details of the data, the subject-matter context, the model, diagnostics and your aims. $\endgroup$ Commented Sep 25, 2017 at 10:47
  • 1
    $\begingroup$ @Scortchi-ReinstateMonica can you provide more information as to why you think balancing a dataset might be useless? $\endgroup$ Commented Oct 6, 2021 at 19:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.