Can I include a variable related to the outcome variable into statistical analysis?

Question

My research question is about the contact patterns during the pandemic and what characteristics of people who contacted more person during the national lock down.

The outcome variable is a variable computed using the number of contacts during the lock down (the third wave) - the number of contacts before the lockdown (the first wave).

For example, a participant contacted 24 person during the first wave, and 10 person during the second wave, 45 person during the third wave. Then for this person, the outcome variable is gap = 45 - 24 = 21.

The independent variable includes: age, gender, education, employments status, marital status, transportation methods, and city.

There is another variable I want to include, that is the standard deviation of the number of contacts for three waves. For this person, that is:

sd(c(24, 10, 45))
# [1] 17.61628

Then I build a Generalized additive mixed effect model (GAMMs).

gamm(gap_adjusted ~ age_cat2 + Sex + Education2 + Employment2 + region + hh_cat3 + Marital2 + Country2 + Transport2 + Weekend + IIV1,
     random = list(city = ~ 1) , data = part_235_gap, family = Gamma(link = "log")) 
# because range of gap is -10, 10, so gap_adjusted = gap + 11
# IIV1 is the standard deviation of the number of contacts for three waves.

The result shows that there is a statistically significant association between gap_adjusted and IIV1. Is this a proper method?

@FransRodenburg sorry, I am not familier with GLS model, and just read it on the wiki, I will try to use it — Chao, Commented Jul 8 at 7:55
Ok no problem, but I will delete my earlier comment as it is no longer relevant. — Frans Rodenburg, Commented Jul 8 at 12:33

Frans Rodenburg · Accepted Answer · 2024-07-08 20:44:36Z

The Problem

My interpretation of the problem is that you want to know how the difference in the number of contacts depends on the demographic variables you listed.

What you have tried now is to take the difference and then model that as an outcome. This is not necessarily wrong, but it is inefficient, because you are reducing what used to be two measurements to a single number. Moreover, the current approach ignores the discrete nature of the outcome. Finally, the current approach has no way to incorporate the number of contacts during the second wave. It would be nice if we can somehow include all observations in a single model.

A Solution Using the Number of Contacts Directly

Instead of taking the difference, use the number of contacts directly as the outcome and include a new variable that indicates whether this observation was from the first, second, or third wave. You can then include an interaction between this variable and the demographic variables, to see if the difference in contacts between waves depends on those. This can be done with a generalized linear mixed model (GLMM).

I will use a simplified version of your problem to illustrate, with some simulated data (included at the bottom of this post):

library("lmerTest")
GLMM <- glmer(contacts ~ wave * I(age/100) + (1 | ID), family = "poisson", data = DF)
summary(GLMM)$coefficients

# Fixed effects:
#                    Estimate Std. Error    z value     Pr(>|z|)
# (Intercept)       2.8661864  0.2826758  10.139483 3.690461e-24
# wave2             0.9512341  0.1641407   5.795237 6.822456e-09
# wave3             2.7958852  0.1464731  19.088050 3.173992e-81
# I(age/100)       -4.0236674  0.5126397  -7.848919 4.196371e-15
# wave2:I(age/100) -2.7631425  0.4494294  -6.148113 7.841032e-10
# wave3:I(age/100) -5.4290274  0.4254496 -12.760681 2.718128e-37

(For brevity, I only show the fixed effects here.)

Reminder, this is fake data, but in the example here, you can see that:

During the first wave, someone with a theoretical age of $0$ had on average $e^{2.8661864} \approx 18$ contacts;
This increased during the second wave to $e^{2.8661864 + 0.9512341} \approx 45$ contacts;
...and further to $e^{2.8661864 + 2.7958852} \approx 288$ during the third wave
However, older individuals had fewer contacts during the first wave (-4.0237 on the linear scale);
This is even further reduced during the second wave (add an additional -2.7631 per 100 years on the linear scale);
...and even further during the third wave (add an additional -5.4290 per 100 years on the linear scale).

You can include other demographic variables interacting with wave in the same manner (e.g., wave * age + wave * sex). Interpreting the whole thing becomes quite messy quite quickly, but with the help of some packages you can make it a lot easier:

Comparing the Effect of a Demographic Variable Between Waves

You can perform ANOVA-style comparisons with the emmeans package:

# Compare the slopes of age per wave ANOVA-style
library("emmeans")
EMT <- emtrends(GLMM, pairwise ~ wave, "age")
EMT$contrasts
#  contrast      estimate      SE  df z.ratio p.value
#  wave1 - wave2   0.0276 0.00449 Inf   6.148  <.0001
#  wave1 - wave3   0.0543 0.00425 Inf  12.761  <.0001
#  wave2 - wave3   0.0267 0.00473 Inf   5.630  <.0001
# 
# P value adjustment: tukey method for comparing a family of 3 estimates

Visualizing the Estimated Differences

You can visualize the estimates as follows:

library("sjPlot")
plot_model(GLMM, type = "pred", terms = c("age", "wave"))

Simulated Data

If you want to run the example yourself, here is the code used to generate the data.

set.seed(2024)
n <- 100
t <- 3
wave <- factor(rep(1:t, n))
ID   <- factor(rep(1:n, each = 3))
age  <- rep(round(runif(n, 18, 100)), each = 3)
X <- model.matrix(~ wave * scale(age))
Z <- model.matrix(~ ID)
beta    <- rnorm(ncol(X))
upsilon <- rnorm(ncol(Z))
eta <- X %*% beta + Z %*% upsilon
y   <- rpois(n * t, exp(eta))
DF <- data.frame(
  contacts = y, age, wave, ID
)
rm(n, t, wave, ID, age, X, Z, beta, upsilon, eta, y)

Sorry, I still have some questions about the interpretation, is it ok if I send you an email? Thank you. — Chao, Commented Jul 9 at 12:03
Sorry, I will put it here: # I am looking for variables that have a statistically significant effect on the number of contacts of the fup2 wave compared to the base wave, the interaction terms with a p-value less than 0.05 should be considered. For example, if the p-value of the interaction term (Wavefup2:I(hhsize)) is less than 0.05, then I can say that during the second wave, participants with a bigger household size tend to have more contacts compared to the base wave. — Chao, Commented Jul 9 at 13:55
@Chao, sorry for the late reply, but yes that is correct. The interaction term is what tells you to what extent the difference in waves depends on the demographic variable, and its $p$-value can be used as a significance test for that. — Frans Rodenburg, Commented Jul 12 at 7:12
Thank you so much. I will finish your online book. Your explanations help me a lot. Nice weekend! — Chao, Commented Jul 12 at 10:52

Stack Exchange Network

Can I include a variable related to the outcome variable into statistical analysis?

1 Answer 1

The Problem

A Solution Using the Number of Contacts Directly

Comparing the Effect of a Demographic Variable Between Waves

Visualizing the Estimated Differences

Simulated Data

Not the answer you're looking for? Browse other questions tagged
regression
r
mixed-model
variance
count-data
or ask your own question.

Linked

Hot Network Questions

Can I include a variable related to the outcome variable into statistical analysis?

1 Answer 1

The Problem

A Solution Using the Number of Contacts Directly

Comparing the Effect of a Demographic Variable Between Waves

Visualizing the Estimated Differences

Simulated Data

Not the answer you're looking for? Browse other questions tagged regressionrmixed-modelvariancecount-data or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
regression
r
mixed-model
variance
count-data
or ask your own question.