Firstly, I would like to admit that even though it is not the first time I am working with linear mixed models, the mathematical foundations escape me. I am running a linear mixed-effects model using the nlme package, specifically the lme
function. Firstly, I'm using this function instead of lme4
because of my previous experience with it. As you know, dput()
of long-format databases is almost impossible, so I will describe the data in as much detail as possible (if necessary I will provide through some platform).
I will try to explain step by step what I have done. I will work with an example, actually the first component of my model, which has yielded an error.
- The model: This model could be overfitted, and this is the first issue I am facing. However, the problem for me is how to identify this excess of covariates.
lme(cd86 ~ time:grupo_int_v00 + time + grupo_int_v00 + edad_s1 +
sexo_s1 + imc + hta_s1 + diab_s1 + dislip + geaf_tot, random= ~ 1|paciente, control = lmeControl(opt = "optim"), method="REML", data=tmp, na.action = na.omit)
- The data = tmp: Long-format data, with 2 temporal points, time = 0,1. There are multiple categorical variables in the model, two of which include an additional category for unknown information coded as value 9 (in diab_s1 and hta_s1).
tibble [8,874 × 13] (S3: tbl_df/tbl/data.frame)
$ cd86 : num [1:8874] NA NA NA NA NA ...
$ grupo_int_v00: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ time : num [1:8874] 0 0 0 0 0 0 0 0 0 0 ...
$ paciente : num [1:8874] 6291 6291 6291 6291 6291 ...
$ edad_s1 : num [1:8874] 68 68 68 68 68 68 68 68 68 68 ...
$ sexo_s1 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ peso : num [1:8874] 92.6 92.6 92.6 92.6 92.6 92.6 92.6 92.6 92.6 92.6 ...
$ imc : num [1:8874] 30.2 30.2 30.2 30.2 30.2 ...
$ hta_s1 : Factor w/ 3 levels "0","1","9": 2 2 2 2 2 2 2 2 2 2 ...
$ diab_s1 : Factor w/ 3 levels "0","1","9": 1 1 1 1 1 1 1 1 1 1 ...
$ dislip : num [1:8874] 1 1 1 1 1 1 1 1 1 1 ...
$ geaf_tot : num [1:8874] 280 280 280 280 280 ...
$ alcoholg_v00 : num [1:8874] 31.5 31.5 31.5 31.5 31.5 ...
sapply(vars_modelo, function(x) sum(!is.na(tmp[[x]])))
cd86 time grupo_int_v00 edad_s1
169 8874 8874 8874
sexo_s1 imc hta_s1 diab_s1
8874 8874 8874 8874
dislip geaf_tot paciente
8874 4437 8874
- Multicollinearity? Following previous threads here by Ben Bolker, I went for creating the matrix and test the "linear combos"
X <- model.matrix(~ time:grupo_int_v00 + time + grupo_int_v00 + edad_s1 + sexo_s1 + imc + hta_s1 + diab_s1 + dislip + geaf_tot, data = tmp)
lc <- caret::findLinearCombos(X)
colnames(X)[lc$linearCombos[[1]]]
[1] "time"
> colnames(X)[lc$linearCombos[[2]]]
[1] "hta_s19"
> colnames(X)[lc$linearCombos[[3]]]
[1] "diab_s19"
> colnames(X)[lc$linearCombos[[4]]]
[1] "time:grupo_int_v001"
colnames(X)
[1] "(Intercept)" "time"
[3] "grupo_int_v001" "edad_s1"
[5] "sexo_s11" "imc"
[7] "hta_s11" "hta_s19"
[9] "diab_s11" "diab_s19"
[11] "dislip" "geaf_tot"
[13] "time:grupo_int_v001
Regarding these linear combinations, I have some doubts that have arisen. At least hta_s19 and diab_s19 only have one category in the matrix created. I'm not sure if this is related with the additional category for unknown information codified as "9" which actually has 0 individuals on it (Note that the variable name has switched to hta_s19, I guess this is related to value "9")
unique(X[,"hta_s19"])
[1] 0
table(tmp$hta_s1)
0 1 9
1122 7752 0
unique(X[,"diab_s19"])
[1] 0
table(tmp$diab_s1)
0 1 9
5916 2958 0
What I don't fully comprehend is the significance of the other two linear combinations (time and "time:grupo_int_v001"). Does it imply that only time = 0 is detected in the matrix, suggesting a specific occurrence at this time?
Additionally, considering the matrix construction's aim to detect collinearity, are these variables indeed collinear?
unique(X[,"time"])
[1] 0
# Freq in the original databsase
table(tmp$time)
0 1
4437 4437
unique(X[,"time:grupo_int_v001"])
[1] 0
table(tmp$grupo_int_v00)
0 1
4386 4488
I would like to clarify within this volume of information.
- If I continue dropping from the matrix the tagged columns, I would be discarding time:group interaction term, which is in fact the term I am looking for.
- I don't know whether the problem is related to convergence or collinearity, or both of them (not sure this affirmation is nonsense)
- In case I should discard columns and use the matrix, I wasn't able to follow this . I wouldn't know from the matrix of covariance how to reformulate the model due to the different notation of the variables
#variables to remove from the matrix
variables_discard <- c("time", "hta_s19", "diab_s19", "time:grupo_int_v001")
# Remove columns
X1 <- with(tmp,
data.frame(cd86,
as.data.frame(X),
paciente))
X1 <- X1[, -which(colnames(X1) %in% variables_discard)]
X1 <- as.data.frame(X1)
lme(reformulate(colnames(X1)[2:15], response = "cd86"), random = ~1|paciente,control = lmeControl(opt = "optim"), method="REML", data=X1, na.action = na.omit)
#Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
NA
!!!! Only 169 rows get used at all. Also 50% ofgeaf_tot
is missing too. $\endgroup$