7
$\begingroup$

Sorry for the confusing title, I think this is a general statistics question, but I'm working in R. I have a combined dataset of two samples from different countries (n=240 and n=1,010), and when I run a linear regression between the same three variables in each dataset, both datasets produce a significant result, with almost identical coefficients. However, when I merge the datasets and run the same regression on the combined dataset, it is no longer significant. Can anyone explain this?

In case it matters, the regression has the form lm(a~b*c).

$\endgroup$
8
  • 5
    $\begingroup$ If the two regressions have significant but different slopes (very different and possibly of different sign) there is no reason to think that combining the data into a single regression will give a significant slope. $\endgroup$ Commented Aug 18, 2017 at 23:42
  • $\begingroup$ Like I said, the coefficients are almost the same. Thank you for your comment though, and I'm curious to hear if you have any advice on how to proceed in trying to solve this problem! $\endgroup$
    – Benji
    Commented Aug 18, 2017 at 23:49
  • $\begingroup$ Sorry, I missed the part where you said that you had almost identical coefficients. But what exactly did you mean by almost identical? What were the significant levels for each? $\endgroup$ Commented Aug 19, 2017 at 0:06
  • $\begingroup$ Also what was the significance level for the combined regression? $\endgroup$ Commented Aug 19, 2017 at 0:08
  • 3
    $\begingroup$ What about the intercepts? $\endgroup$ Commented Aug 19, 2017 at 0:37

4 Answers 4

25
$\begingroup$

Without seeing your data, this is difficult to answer definitively. One possibility is that your datasets span different ranges of the independent variable. It is well-known that combining data across different groups can sometimes reverse correlations seen in each group individually. This effect is known as Simpson's Paradox.

$\endgroup$
8
  • $\begingroup$ Wow, that's really interesting, I had never heard of Simpson's Paradox! I wonder if you could give me some advice about how to proceed in trying to answer my research question, which is to see whether variable c moderates variable b 's effect on variable a. I'm puzzled as to how I should address something like this, because it seems like whether I say c moderates b, I'm correct in each country individually, but incorrect in general! I guess that's the paradox, but I'm still stumped. $\endgroup$
    – Benji
    Commented Aug 18, 2017 at 23:42
  • $\begingroup$ Assuming you are dealing with Simpson's paradox here (something that we haven't fully established!), I think there are two key questions. First, do your two datasets correspond to different levels of a meaningful grouping factor. Second, if so, does the variation introduced by this factor represent nuisance variation that you want to control for (as opposed to interesting variation that you want to study). If you answered yes to both questions, then you might consider estimating a fixed effect of group (continued) $\endgroup$ Commented Aug 19, 2017 at 1:15
  • $\begingroup$ (continued) which might allow the model to draw parallel lines (with the slope of interest) through each of your two groups, while dealing with the between-group variation by giving the two lines different intercepts. But I emphasize that these are decisions that can only be made with a full conceptual/theoretical understanding of the problem that your analysis is supposed to answer. $\endgroup$ Commented Aug 19, 2017 at 1:17
  • 1
    $\begingroup$ +1 @Benji - Imagine two x-y scatter plots of data pts with the same slope, but one scatter plot is shifted to the right of the other on the x-axis, such that the best fitting regression line for both scatter plots is essentially flat. $\endgroup$
    – RobertF
    Commented Aug 19, 2017 at 1:18
  • 1
    $\begingroup$ @BenjiKaveladze Yes (see my answer below), but you'll want to verify this is the case by, for example, removing the country with the quadratic relationship from the dataset and see if the observed regression coefficients still change. At any rate, this illustrates how linear regressions can fail to detect nonlinear relationships which other techniques (like boosted regression, neural networks, decision trees, etc.) can model more effectively. $\endgroup$
    – RobertF
    Commented Aug 19, 2017 at 14:03
17
$\begingroup$

If your data looks something like this then the reason may be more obvious. Your two original regression lines would be almost parallel and look reasonably plausible but combined they produce a different result which is probably not very helpful.

regrssion

The data for this chart came from using the R code

exdf <- data.frame(
  x=c(-64:-59, -52:-47),
  y=c(-8.29, -8.36, -9.05, -9.30, -9.20, -9.69, 
      -7.90, -8.34, -8.49, -8.85, -9.38, -9.65),
  col=c(rep("blue",6), rep("red",6)) )
fitblue  <- lm(y ~ x, data=exdf[exdf$col=="blue",])
fitred   <- lm(y ~ x, data=exdf[exdf$col=="red" ,])
fitcombo <- lm(y ~ x, data=exdf)
plot(y ~ x, data=exdf, col=col)
abline(fitblue , col="blue")
abline(fitred  , col="red" )
abline(fitcombo, col="black")

which reports

> summary(fitblue)

Call:
lm(formula = y ~ x, data = exdf[exdf$col == "blue", ])

Residuals:
       1        2        3        4        5        6 
-0.00619  0.20295 -0.20790 -0.17876  0.20038 -0.01048 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -26.14895    2.91063  -8.984  0.00085 ***
x            -0.27914    0.04731  -5.900  0.00413 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1979 on 4 degrees of freedom
Multiple R-squared:  0.8969,    Adjusted R-squared:  0.8712 
F-statistic: 34.81 on 1 and 4 DF,  p-value: 0.004128

> summary(fitred)

Call:
lm(formula = y ~ x, data = exdf[exdf$col == "red", ])

Residuals:
        7         8         9        10        11        12 
-0.005238 -0.095810  0.103619  0.093048 -0.087524 -0.008095 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -26.06505    1.12832  -23.10 2.08e-05 ***
x            -0.34943    0.02278  -15.34 0.000105 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0953 on 4 degrees of freedom
Multiple R-squared:  0.9833,    Adjusted R-squared:  0.9791 
F-statistic: 235.3 on 1 and 4 DF,  p-value: 0.0001054

> summary(fitcombo)

Call:
lm(formula = y ~ x, data = exdf)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8399 -0.4548 -0.0750  0.4774  0.9999 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.269561   1.594455  -5.814  0.00017 ***
x           -0.007109   0.028549  -0.249  0.80839    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.617 on 10 degrees of freedom
Multiple R-squared:  0.006163,  Adjusted R-squared:  -0.09322 
F-statistic: 0.06201 on 1 and 10 DF,  p-value: 0.8084

not too far away from your statistics and with further work could be made closer

$\endgroup$
1
4
$\begingroup$

It's also possible the data points in each dataset may have completely different distributions due to outliers and/or nonlinear relationships between $x$ and $y$, and yet still share nearly identical linear regression coefficients, standard errors, and statistically significant $p$-values. Combining the two datasets could create a dataset that no longer has a strong linear relationship. See Anscombe's Quartet. A visual representation of numerous datasets sharing the same summary statistics but radically different scatterplots can be found here. My recommendation would be to closely examine the scatterplots of both datasets.

$\endgroup$
4
  • 1
    $\begingroup$ In addition to examine the scatterplots, I would try to repeat regression using the country as an additional variable (a~bccountry). This way you will see if some coefficients change significantly between countries. $\endgroup$
    – Pere
    Commented Aug 19, 2017 at 8:47
  • $\begingroup$ @Pere When I include country in the model (a~bccountry), the result produced is that the bc interaction variable becomes significantly related to a (b=-0.35, p<0.001). Can I interpret that as evidence that bc predicts a? It just seems weird to me that b*c only predicts a when I introduce the country variable into the equation. Thanks! $\endgroup$
    – Benji
    Commented Aug 19, 2017 at 22:40
  • $\begingroup$ @BenjiKaveladze I'm not sure to understand your comment. I suggest posting the whole summary, maybe in another question. However, significant bc interactions means that you can take in account bc to get better predictions of a, which is equivalent to say that for different values of c you get a diferent relation between a and b. $\endgroup$
    – Pere
    Commented Aug 20, 2017 at 9:19
  • $\begingroup$ @Pere Maybe it will help if I write the actual question, which is "do stressors moderate the link between social support and depression"? So the model is (depression~supportstress). My question is whether I should instead use the model (depression~supportstresscountry). This is tricky for me because only when I use the second model does the supportstress interaction become significant. When country is not in the model, the support*stress interaction is not significant. Does that make sense? $\endgroup$
    – Benji
    Commented Aug 21, 2017 at 18:42
1
$\begingroup$

For more on Simpson's Paradox see Pearl, J., & Mackenzie, D. (2018). Paradoxes Galore! The Book of Why: The New Science of Cause and Effect (Kindle ed., pp. 2843-3283). New York: Basic Books. Also, see Pearl's Causality.

In his book, Pearl gives an example very similar to yours. The problem is that there is a confounding variable that is affecting both the independent variable(s) and the dependent variable. In Pearl's example, the question is, Why is an anti-heart attack drug bad for women, bad for men, but good for people? (when the two gender samples are combined). The answer is that gender is a confounding variable that impacts who takes the drug (women are far more likely), and also the prevalence of heart attack (men are far more likely). The solution to confounding variables is to condition on them. The can be done in two ways: (1) Using regression analysis, make gender a variable; (2) Analyze the average effect of the drug for the two genders separately; then compute the weighted average (weighted by percent in population of the genders, here 1/2) of the effects.

Pearl would say that you have to have a model of the phenomenon you are studying, i.e., an exhaustive theory that takes into account all the variables involved in the response. Developing such a model and theory can take months of reading to understand the work of others in the field. However, recall that one left out variable can bias the results and make them meaningless or just plain wrong.

Pearl would also write that you cannot extract causality from data; for that you need a theoretical model. However, once you have a theory and a model, you can use data to support them.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.