5
$\begingroup$

I am trying to understand Lord's paradox, where controlling for baseline status can affect inference. I tried to set up some data following the quotation in Wikipedia

“A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex differences in these effects. Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded.” In both September and June, the distribution of male weights is the same: the males' weights have the same mean and variance, and likewise for the distribution of female weights. Lord posits two statisticians who use different but respected statistical methods to reach opposite conclusions about the effects of the diet provided in the university dining halls on students' weights.

So I gave each student a base weight, influenced by gender but with a spread so the two groups overlapped and the overall distribution was unimodal. I then supposed that at the initial weighings they were at their base weights with some noise, and similarly at the final weighings, so that the change was just noise, and the noise had the same distribution for each individual, and the initial and final weights had the same distributions. An attempted attempt to analyse changes by group produced small non-significant results.

But a regression of final weights against group and initial weights did produce apparently significant and substantial results for the group indicator. This is what I believe is what happens in Lord's paradox, but I am not certain whether it is supposed to apply in this situation.

I can give a handwaving explanation of what happened: for each group, the second analysis in effect drew regression lines through the means of each group (which did not change between initial and final weighings), but because of the noise the correlations were not perfect and so the slopes of the lines were inevitably less steep than a diagonal line, meaning the two regression lines inevitably had different intercepts, a regression to the mean effect. Here is a chart illustrating the point:

enter image description here

But playing further, it seems it is the initial noise which causes this: a regression of final weights against group and base weights does not produce this effect, while a regression of base weights against group and initial weights does and with almost the same impact on the group coefficient. (In these last attempts, the base weights have lower variances than the initial and final weights, so the handwaving argument about needing perfect correlation for a diagonal line does not apply so directly.) So it seems to be the regression to the mean from the initial weights which is producing the apparent paradoxical effect.

Simulating the data (in kilograms with female in group 0 and male in group 1):

set.seed(2021)
N <-  2000
group   <- rep(c(0, 1), times=N/2)
base    <- rnorm(N, mean=70 + 20*group, sd=10)
initial <- base + rnorm(N, mean=0, sd=5)
final   <- base + rnorm(N, mean=0, sd=5)
change  <- final - initial

comparing change with group revealed nothing significant (as expected)

> summary(lm(change ~ group))  

Call:
lm(formula = change ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.092  -4.604   0.226   4.826  26.858 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03355    0.22250  -0.151    0.880
group       -0.04256    0.31466  -0.135    0.892

Residual standard error: 7.036 on 1998 degrees of freedom
Multiple R-squared:  9.157e-06, Adjusted R-squared:  -0.0004913 
F-statistic: 0.0183 on 1 and 1998 DF,  p-value: 0.8924

while regressing final against group and initial does produce something significant for the coefficient of group

> summary(lm(final ~ group + initial))  

Call:
lm(formula = final ~ group + initial)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.308  -4.237   0.023   4.471  23.666 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.46634    0.94881  14.193   <2e-16 ***
group        3.73929    0.39579   9.448   <2e-16 ***
initial      0.80855    0.01312  61.643   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.69 on 1997 degrees of freedom
Multiple R-squared:  0.803,     Adjusted R-squared:  0.8028 
F-statistic:  4070 on 2 and 1997 DF,  p-value: < 2.2e-16

and playing with some other regressions suggests that it is the presence of initial which is causing this

> summary(lm(final ~ group + base))

Call:
lm(formula = final ~ group + base)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.6212  -3.2292  -0.1426   3.3610  16.8688 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.47777    0.78703  -0.607    0.544    
group        0.06003    0.30853   0.195    0.846    
base         1.00665    0.01094  92.021   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.979 on 1997 degrees of freedom
Multiple R-squared:  0.8909,    Adjusted R-squared:  0.8908 
F-statistic:  8152 on 2 and 1997 DF,  p-value: < 2.2e-16

while

> summary(lm(base ~ group + initial))

Call:
lm(formula = base ~ group + initial)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.3176  -2.8154   0.1182   3.1505  14.0104 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.250421   0.648053   21.99   <2e-16 ***
group        3.766569   0.270328   13.93   <2e-16 ***
initial      0.797561   0.008959   89.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.569 on 1997 degrees of freedom
Multiple R-squared:  0.8952,    Adjusted R-squared:  0.8951 
F-statistic:  8526 on 2 and 1997 DF,  p-value: < 2.2e-16
$\endgroup$
7
  • $\begingroup$ Could you make explicit what do you assume to be paradoxical about the problem? As some authors (e.g. Pearl) have stated, the two statisticians are not so much disagreeing as they are answering different questions. Hence, I think it would be useful to answer you if you make explicit what do you think is the specific effect of interest here. $\endgroup$
    – Kuku
    Commented May 4, 2021 at 23:08
  • $\begingroup$ @Kuku - I am trying to understand what Lord's paradox involves: I recognise that it is the two different conclusions on the group effect. Why trying to read about it and then trying to reproduce it (with the initial distributions matching the final distributions), all I could find was a regression to the mean effect and the apparent group-dependent result being a consequence of regression to the mean and the two groups having different means. My question is whether this is indeed what causes Lord's paradox and whether there is anything more to it than that. $\endgroup$
    – Henry
    Commented May 5, 2021 at 0:00
  • $\begingroup$ 5 Kg noise in measurement appears excessive - typical values for diurnal variation are ~1Kg (mostly variation in food, water and waste), if not controlled for at weighing clothes may systematically vary between September and June, but not likely by more than 1Kg. Typical research-instrument error will be in fractions of grams. The link specifies that the scenario was specifically chosen for low measurement variation. $\endgroup$
    – ReneBt
    Commented May 8, 2021 at 7:49
  • $\begingroup$ @ReneBt the model is of weights measured 9 months apart after a switch from home food and exercise to college food and exercise, with the same distribution of weights before and after. The scale may be wrong, but that in a sense is not important, as smaller changes would lead to a similar conclusion on a smaller scale, and I chose my scale to have something I could see on the charts. The difficult part in my model was ensuring the same distribution before and after rather than an increased variance after, and it was this that led to my question about regression to the mean $\endgroup$
    – Henry
    Commented May 8, 2021 at 11:09
  • $\begingroup$ I think this finding is ineviatble whenever you use an ANCOVA model where randomisation was not applied (or was not sucessful). This is why ANCOVA is a standard practice in clinical trials but very dubious in observational studies. $\endgroup$ Commented Jun 26, 2021 at 20:32

0