The Problem
My interpretation of the problem is that you want to know how the difference in the number of contacts depends on the demographic variables you listed.
What you have tried now is to take the difference and then model that as an outcome. This is not necessarily wrong, but it is inefficient, because you are reducing what used to be two measurements to a single number. Moreover, the current approach ignores the discrete nature of the outcome. Finally, the current approach has no way to incorporate the number of contacts during the second wave. It would be nice if we can somehow include all observations in a single model.
A Solution Using the Number of Contacts Directly
Instead of taking the difference, use the number of contacts directly as the outcome and include a new variable that indicates whether this observation was from the first, second, or third wave. You can then include an interaction between this variable and the demographic variables, to see if the difference in contacts between waves depends on those. This can be done with a generalized linear mixed model (GLMM).
I will use a simplified version of your problem to illustrate, with some simulated data (included at the bottom of this post):
library("lmerTest")
GLMM <- glmer(contacts ~ wave * I(age/100) + (1 | ID), family = "poisson", data = DF)
summary(GLMM)$coefficients
# Fixed effects:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.8661864 0.2826758 10.139483 3.690461e-24
# wave2 0.9512341 0.1641407 5.795237 6.822456e-09
# wave3 2.7958852 0.1464731 19.088050 3.173992e-81
# I(age/100) -4.0236674 0.5126397 -7.848919 4.196371e-15
# wave2:I(age/100) -2.7631425 0.4494294 -6.148113 7.841032e-10
# wave3:I(age/100) -5.4290274 0.4254496 -12.760681 2.718128e-37
(For brevity, I only show the fixed effects here.)
Reminder, this is fake data, but in the example here, you can see that:
- During the first wave, someone with a theoretical age of $0$ had on average $e^{2.8661864} \approx 18$ contacts;
- This increased during the second wave to $e^{2.8661864 + 0.9512341} \approx 45$ contacts;
- ...and further to $e^{2.8661864 + 2.7958852} \approx 288$ during the third wave
- However, older individuals had fewer contacts during the first wave (
-4.0237
on the linear scale);
- This is even further reduced during the second wave (add an additional
-2.7631
per 100 years on the linear scale);
- ...and even further during the third wave (add an additional
-5.4290
per 100 years on the linear scale).
You can include other demographic variables interacting with wave
in the same manner (e.g., wave * age + wave * sex
). Interpreting the whole thing becomes quite messy quite quickly, but with the help of some packages you can make it a lot easier:
Comparing the Effect of a Demographic Variable Between Waves
You can perform ANOVA-style comparisons with the emmeans
package:
# Compare the slopes of age per wave ANOVA-style
library("emmeans")
EMT <- emtrends(GLMM, pairwise ~ wave, "age")
EMT$contrasts
# contrast estimate SE df z.ratio p.value
# wave1 - wave2 0.0276 0.00449 Inf 6.148 <.0001
# wave1 - wave3 0.0543 0.00425 Inf 12.761 <.0001
# wave2 - wave3 0.0267 0.00473 Inf 5.630 <.0001
#
# P value adjustment: tukey method for comparing a family of 3 estimates
Visualizing the Estimated Differences
You can visualize the estimates as follows:
library("sjPlot")
plot_model(GLMM, type = "pred", terms = c("age", "wave"))
![CV_contacts_per_wave](https://cdn.statically.io/img/i.sstatic.net/kZ5gkHob.png)
Simulated Data
If you want to run the example yourself, here is the code used to generate the data.
set.seed(2024)
n <- 100
t <- 3
wave <- factor(rep(1:t, n))
ID <- factor(rep(1:n, each = 3))
age <- rep(round(runif(n, 18, 100)), each = 3)
X <- model.matrix(~ wave * scale(age))
Z <- model.matrix(~ ID)
beta <- rnorm(ncol(X))
upsilon <- rnorm(ncol(Z))
eta <- X %*% beta + Z %*% upsilon
y <- rpois(n * t, exp(eta))
DF <- data.frame(
contacts = y, age, wave, ID
)
rm(n, t, wave, ID, age, X, Z, beta, upsilon, eta, y)