R lmer, 3 time point longitudinal data, non linear, messy residual help!

Question

I am new to mixed models so please go easy on me!

I am attempting to longitudinally model a normally distributed continuous outcome with values in my dataset from -30-10. I have data collected at 1, 4 and 12 months post event on 256 individuals. This is a spaghetti plot of the data by age category.

spaghetti plot.png

The mean of the outcome is 1.56 at baseline (pre event), -4.57, -0.24 and -0.42 at 1, 4 and 12 months post event respectively. Both based on the plot and logic (the outcome can't just become less negative linearly throughout time after the event forever onward) a linear relationship between time and this outcome doesn't seem to fit, however I am limited by only 3 post event time points AND an outcome with negative values.

I built a linear mixed model with a fixed time slope, fixed quadratic time slope and random intercept and random linear time including a significant covariance between intercept and slope(random quadratic time was not significant). I tested my time invariant predictors one at a time in the model and included those variables that resulted in a pseudo R^2 > 5% in the final model (the goal of model building was a) test the hypothesis that age category was related to the outcome over time and b) determine if any other variables collected help predict the outcome). The above selection criteria left only age category as a significant predictor of the outcome in the final model. Below there is a table of my model building process.

The parameter estimates make sense given the distribution of the data. I think a piecewise discontinuous model or exponential model for time would fit the data better, however with only 3 time points I can't fit these (model doesn't converge).

So all is well, but then I run the model diagnostics and my residuals demonstrate a non normal (light tailed?) distribution and the residual vs fitted looks like it has a linear drift.

From what I understand this basically indicates that the model isn't well fit and usually means that the parameter estimates are accurate, but the p values and CI would be too narrow. With this in mind I thought about a log transformation of the outcome - but with the negative values I would have to add an arbitrary amount to the outcome (recenter around 100?) which I know is frowned upon. In addition, I think it would be very hard to interpret a model of log transformed outcome with a quadratic time function.

Any advice on next steps (or corrections on the steps taken so far) would be greatly appreciated.

Here is a second table comparing the final model with continuous time with a quadratic and time as a factor. And the qq plot of time as a factor. Using time as a factor doesn't appear to improve model fit (right?).

Here are the residual vs time and age category plots

and this is the categorical time model with baseline differences. Baseline differences are not significant (nor do they change the residuals) when included in the model with continuous time.

Your time points are unevenly spaced. Why not code time as a factor with 3 levels (assuming your outcome is a change from pre to post-event at each of your post-event time points)? This way, you will be able to capture the nonlinearity of the time effect without imposing a parametric form (e.g., quadratic) on the shape of the effect. — Isabella Ghement, Commented Oct 5, 2018 at 2:08
Thank you so much for taking the time to respond. I'm hitting my head against the wall here. I actually did try that in one of my many iterations and it doesn't appear to help model fit. I have included the table and QQ plot with time as a factor in my original question now. Any other thoughts or comments on where I have gone astray are greatly appreciated. — newbie, Commented Oct 5, 2018 at 3:19
I measured HRQoL following injury and I am now modelling the difference between parent and child report of this measure. The measure itself is bound between 0-100 and is left skewed, however the differences are normally distributed. — newbie, Commented Oct 5, 2018 at 3:36
One would expect there to be some baseline line difference that this difference would return to at some point during recovery. I have a baseline measure, but it is retrospective and inflated. — newbie, Commented Oct 5, 2018 at 3:39

Dimitris Rizopoulos · Accepted Answer · 2018-10-09 07:26:18Z

4

If you have an outcome that is bounded, and you see that the normal distribution for the error terms does not seem appropriate, you may want to try a Beta distribution. In particular, say that your original outcome is the range $y \in [a, b]$, then $y^* = (y - a) / (b - a) \in [0, 1]$, and you could fit a Beta mixed effects model for $y^*$.

This can be done, for instance, using the beta.fam() object in the family argument of the mixed_model() function of the GLMMadaptive package. E.g.,

library("GLMMadaptive")

fm <- mixed_model(y_new ~ time + agecat, data = your_data, 
                  random = ~ time | studyID, family = beta.fam())
summary(fm)

You can also find more info in the vignette Custom Models.

Model Fit

To investigate the fit of your model to the observed data, you could use the idea of replication checks (in Bayesian posterior predictive checks). The code below illustrates how this could be done:

x_vals <- with(your_data, seq(min(y_new), max(y_new), length.out = 500))
sim_fun <- function (n, mu, phis, eta_zi) {
    phi <- exp(phis)
    rbeta(n, shape1 = mu * phi, shape2 = phi * (1 - mu))
}
out <- simulate(fm, nsim = 30, acount_MLEs_var = TRUE, sim_fun = sim_fun)
rep_y <- apply(out, 2, function (x, x_vals) ecdf(x)(x_vals), x_vals = x_vals)

matplot(x_vals, rep_y, type = "l", lty = 1, col = "lightgrey", 
        xlab = "Response Variable", ylab = "Empirical CDF")
lines(x_vals, ecdf(your_data$y_new)(x_vals))
legend("bottomright", c("replicated data", "observed data"), lty = 1, 
       col = c("lightgrey", "black"), bty = "n", cex = 0.8)

edited Oct 9, 2018 at 7:26

answered Oct 5, 2018 at 8:44

Dimitris Rizopoulos

21.5k2 gold badges23 silver badges50 bronze badges

$\begingroup$ Your answers are always so helpful and informative, Dimitris! In this case, the outcome variable seems to be a difference between two ratings (parent vs. child), with each rating ranging between 0 and 100. So a Beta mixed effects may not be appropriate, unless one decided to model the parent rating separately from the child rating. I wonder if, with Time treated as categorical, it may be possible to improve the model fit by allowing for error variance heterogeneity across time points and/or age categories? It's hard to tell without plots of the model residuals against time & age. $\endgroup$
– Isabella Ghement
Commented Oct 5, 2018 at 20:06
$\begingroup$ The lack of a baseline measure in the model can potentially be problematic. $\endgroup$
– Isabella Ghement
Commented Oct 5, 2018 at 20:09
$\begingroup$ Thank you both so much for your time. When modelling HRQoL on it's own the beta mixed effects model is very useful! I have attached the residual vs time category and age category plots, and the model of categorical time including baseline difference as a covariate. I have included a random categorical time component in the model, it doesn't converge if I try to allow for correlation between the random time and intercept, so they are uncorrelated in the model. I'm still getting a light tailed qq plot. $\endgroup$
– newbie
Commented Oct 5, 2018 at 22:49
1

$\begingroup$ @IsabellaGhement Thanks! My line of thought was that the difference between two bounded outcomes will also be a bounded outcome. In addition, the reason why I suggested the Beta distribution is because it can admit symmetric and assymetric shapes (such as it seems to be in this case). $\endgroup$
– Dimitris Rizopoulos
Commented Oct 6, 2018 at 6:06
1

$\begingroup$ @newbie note that the residuals need to be compared with the corresponding Beta distribution not the normal distribution. Hence, you need to use the version of the QQ-plot in which the theoretical quantiles come from the Beta. $\endgroup$
– Dimitris Rizopoulos
Commented Oct 6, 2018 at 6:07

| Show 4 more comments

Stack Exchange Network

R lmer, 3 time point longitudinal data, non linear, messy residual help!

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
lme4-nlme
nonlinear
longitudinal-data-analysis
or ask your own question.

Linked

Hot Network Questions

R lmer, 3 time point longitudinal data, non linear, messy residual help!

1 Answer 1

Not the answer you're looking for? Browse other questions tagged lme4-nlmenonlinearlongitudinal-data-analysis or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
lme4-nlme
nonlinear
longitudinal-data-analysis
or ask your own question.