3
$\begingroup$

Firstly, I would like to admit that even though it is not the first time I am working with linear mixed models, the mathematical foundations escape me. I am running a linear mixed-effects model using the nlme package, specifically the lme function. Firstly, I'm using this function instead of lme4 because of my previous experience with it. As you know, dput() of long-format databases is almost impossible, so I will describe the data in as much detail as possible (if necessary I will provide through some platform).

I will try to explain step by step what I have done. I will work with an example, actually the first component of my model, which has yielded an error.

  1. The model: This model could be overfitted, and this is the first issue I am facing. However, the problem for me is how to identify this excess of covariates.

lme(cd86 ~ time:grupo_int_v00 + time + grupo_int_v00 + edad_s1 + 
    sexo_s1 + imc + hta_s1 + diab_s1 + dislip + geaf_tot, random= ~ 1|paciente, control = lmeControl(opt = "optim"), method="REML", data=tmp, na.action = na.omit)
  1. The data = tmp: Long-format data, with 2 temporal points, time = 0,1. There are multiple categorical variables in the model, two of which include an additional category for unknown information coded as value 9 (in diab_s1 and hta_s1).
tibble [8,874 × 13] (S3: tbl_df/tbl/data.frame)
 $ cd86         : num [1:8874] NA NA NA NA NA ...
 $ grupo_int_v00: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ time         : num [1:8874] 0 0 0 0 0 0 0 0 0 0 ...
 $ paciente     : num [1:8874] 6291 6291 6291 6291 6291 ...
 $ edad_s1      : num [1:8874] 68 68 68 68 68 68 68 68 68 68 ...
 $ sexo_s1      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ peso         : num [1:8874] 92.6 92.6 92.6 92.6 92.6 92.6 92.6 92.6 92.6 92.6 ...
 $ imc          : num [1:8874] 30.2 30.2 30.2 30.2 30.2 ...
 $ hta_s1       : Factor w/ 3 levels "0","1","9": 2 2 2 2 2 2 2 2 2 2 ...
 $ diab_s1      : Factor w/ 3 levels "0","1","9": 1 1 1 1 1 1 1 1 1 1 ...
 $ dislip       : num [1:8874] 1 1 1 1 1 1 1 1 1 1 ...
 $ geaf_tot     : num [1:8874] 280 280 280 280 280 ...
 $ alcoholg_v00 : num [1:8874] 31.5 31.5 31.5 31.5 31.5 ...

 sapply(vars_modelo, function(x) sum(!is.na(tmp[[x]])))
         cd86          time grupo_int_v00       edad_s1 
          169          8874          8874          8874 
      sexo_s1           imc        hta_s1       diab_s1 
         8874          8874          8874          8874 
       dislip      geaf_tot      paciente 
         8874          4437          8874 


  1. Multicollinearity? Following previous threads here by Ben Bolker, I went for creating the matrix and test the "linear combos"
X <- model.matrix(~ time:grupo_int_v00 + time + grupo_int_v00 + edad_s1 + sexo_s1 + imc + hta_s1 + diab_s1 + dislip + geaf_tot, data = tmp)

lc <- caret::findLinearCombos(X)

colnames(X)[lc$linearCombos[[1]]]
[1] "time"
> colnames(X)[lc$linearCombos[[2]]]
[1] "hta_s19"
> colnames(X)[lc$linearCombos[[3]]]
[1] "diab_s19"
> colnames(X)[lc$linearCombos[[4]]]
[1] "time:grupo_int_v001"

colnames(X)
 [1] "(Intercept)"         "time"               
 [3] "grupo_int_v001"      "edad_s1"            
 [5] "sexo_s11"            "imc"                
 [7] "hta_s11"             "hta_s19"            
 [9] "diab_s11"            "diab_s19"           
[11] "dislip"              "geaf_tot"           
[13] "time:grupo_int_v001

Regarding these linear combinations, I have some doubts that have arisen. At least hta_s19 and diab_s19 only have one category in the matrix created. I'm not sure if this is related with the additional category for unknown information codified as "9" which actually has 0 individuals on it (Note that the variable name has switched to hta_s19, I guess this is related to value "9")

unique(X[,"hta_s19"])
[1] 0

table(tmp$hta_s1)

   0    1    9 
1122 7752    0 


unique(X[,"diab_s19"])
[1] 0


table(tmp$diab_s1)

   0    1    9 
5916 2958    0 

What I don't fully comprehend is the significance of the other two linear combinations (time and "time:grupo_int_v001"). Does it imply that only time = 0 is detected in the matrix, suggesting a specific occurrence at this time?

Additionally, considering the matrix construction's aim to detect collinearity, are these variables indeed collinear?

 unique(X[,"time"])
[1] 0

# Freq in the original databsase
table(tmp$time)

   0    1 
4437 4437

unique(X[,"time:grupo_int_v001"])
[1] 0

table(tmp$grupo_int_v00)

   0    1 
4386 4488

I would like to clarify within this volume of information.

  • If I continue dropping from the matrix the tagged columns, I would be discarding time:group interaction term, which is in fact the term I am looking for.
  • I don't know whether the problem is related to convergence or collinearity, or both of them (not sure this affirmation is nonsense)
  • In case I should discard columns and use the matrix, I wasn't able to follow this . I wouldn't know from the matrix of covariance how to reformulate the model due to the different notation of the variables
#variables to remove from the matrix

variables_discard <- c("time", "hta_s19", "diab_s19", "time:grupo_int_v001")


# Remove columns
X1 <- with(tmp,
           data.frame(cd86,
                      as.data.frame(X),
                      paciente))

X1 <- X1[, -which(colnames(X1) %in% variables_discard)]
X1 <- as.data.frame(X1)


lme(reformulate(colnames(X1)[2:15], response = "cd86"), random = ~1|paciente,control = lmeControl(opt = "optim"), method="REML", data=X1, na.action = na.omit)

#Error in terms.formula(formula, data = data) : 
  invalid model formula in ExtractVars
$\endgroup$
2
  • 3
    $\begingroup$ your response is almost all NA!!!! Only 169 rows get used at all. Also 50% of geaf_tot is missing too. $\endgroup$ Commented Mar 21 at 15:27
  • $\begingroup$ Could you provide a brief description of each of your variables? It appears they are in Spanish and are a bit de-contextualized for those who aren't in the medical community. I also agree that the DV's missingness may be problematic and should be checked. $\endgroup$ Commented Mar 25 at 23:36

1 Answer 1

2
+50
$\begingroup$

Preliminary Responses

I provide an answer for now in case others do not respond. I first address some of your initial questions, then move on to more important matters that are visible in your modeling.

The model: This model could be overfitted, and this is the first issue I am facing. However, the problem for me is how to identify this excess of covariates.

You indeed have many variables here. I would first determine which of these variables is theoretically defensible to include. Indeed many "controls" are just artefacts of past literature which may not be that important (see section "Control Variables" in the references provided). If they absolutely need to be included, one could use a PCA reduction of the variables, but this comes at the cost of interpretability (e.g. what would the first principal component "mean" with respect to it's influence on the DV?).

Multicollinearity? Following previous threads here by Ben Bolker, I went for creating the matrix and test the "linear combos"

I'm wondering why you haven't first checked the variance inflation factor (VIF) of the model (see Miles, 2014 for a brief explanation). You can easily check by running check_collinearity(model) from the performance package in R. I'm not sure which threads you are referring to, so a link to his posts may be helpful. But the VIF should at the very least give you a quick look at what is going on there. If there is something more critical (such as a a perfect linear combination of two variables), than that can of course be addressed by removing one of the variables if it can be considered redundant. Unfortunately I don't speak Spanish and know fairly little about medical research, so you may need to determine that yourself.

More Important Issues

On to some important matters, as pointed out in the comments, your dependent variable appears to have a lot of missingness present. I would first determine what is causing this missingness, and then determine how to handle it (see Little et al., 2014 for a quick guide). In your code, you have chosen the na.omit option, which is probably not ideal. There may be some structural reasons for the missingness, and simply removing them isn't the best strategy. Using a missingness map/matrix may help a lot with identifying what is going on.

It would also be helpful to show what your original convergence errors are in R, as these would offer some potential clues (I know that lme4 provides these explicitly). The last part of your code seems to imply more an error with respect to formulating lme. I'm not as familiar with lme or reformulate, but that seems to be more a technical error than something related to your data. I will anyway note that you shouldn't exclude what seems to be the one of the most critical parts of your model, time. For the others you excluded, it may be helpful to provide a lot more information about what your variables actually mean here, as the rest is just going to be guesswork.

My hunch is this: based off the data I see, you only have two time points. If the patients do not vary much between time points, this may contribute to the model not converging. Given that you have already accounted for this in your model, I'm not entirely sure if it is even necessary to run random intercepts. I would check how much patients vary by time and see if that is contributing in some way. If you include a lot of missingness into the mix, I can only imagine this doesn't help with estimating the model (see section here which notes missingness as a potential contributor). The "Convergence" section is useful for your own purposes.

References

Control Variables

Additional References

$\endgroup$
3
  • $\begingroup$ First of all, appreciate your effort and time of your response. As far I can see the missingness is a strong problem of my database, in fact, seems to be a problem with one of the variables there, which in the long format process was distorted. But I need to point out some details of you comments. Which na.action would you recommend? na.omit /na.exclude, AFA I know lme don't work under the listwise deletion procedure losing statistical power. About the collinearity, didn't know about check_collinearity. Is it different from caret::findLinearCombos? $\endgroup$ Commented Mar 26 at 9:20
  • $\begingroup$ The findLinearCombos function uses a QR decomposition to check for linear combinations and then removes them. This is not the same as VIF, which estimates how much the variance is inflated by the inclusion of certain predictors as a result of multi-collinearity. I don't promote any action for missingness before finding out why it exits. Filter the data for NA values, check their cause, figure out if it is due to completely random factors or because of observed data, and then act accordingly (per the article I referenced). Without knowing more about the NA values, I can't give concrete words. $\endgroup$ Commented Mar 26 at 10:02
  • $\begingroup$ One should only delete missing values in any case if there are strong reasons for doing so (such as clearly erroneously coded values or an artefact of some pivoting of the data, which should anyway be fixed before moving forward). $\endgroup$ Commented Mar 26 at 10:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.