5
$\begingroup$

I have a dataset where multiple doctors assessed a score for three experimental conditions, they also saw each patient from three different angles. I am now trying to predict the assessment by the doctors.

As I now have multiple non-independent data points for each patient I am trying to fit a mixed model for my data.

Each doctor has a bias which affects the assessment. Each patient is different which affects the assessment, each condition is slightly different which also affects the assessment. Therefore should take all three as random effects in my model?

model1 <- lmer(formula = assessment ~ LIST_PREDICTORS + 
    (1|participant) + (1|patient) + (1|condition), data = 
    mixedModel_df)

I am mainly interested in predicting the assessment for new cases so I want to control for the above random effects.

I now have around 20 predictor variables. Many of them are heavily correlated with each other often Pearson r > 0.9. Looking at Variance Inflation Factors confirms heavy multicollinearity problems in my models.

library(car)
vif(model1)

How should I best reduce the number of predictors for my data? How can I find the best model for my data?

I looked into lass regression methods but found lasso adaptations only for generalized linear mixed models and not for linear mixed models with ratio-scaled dependent variables like in my case.

step(model1)

does not produce convincing results. Also from manually trying to eliminate predictors it seems like there are multiple equally plausible predictor combinations.

I also tried to run hierarchical clustering with my predictors, then using the variables which had the strongest correlation with the dependent variable (assessment) for my mixed model. I am also thinking about doing feature selection with a lasso regression model then plugging in the identified variables in a mixed model to properly control for the random effects?

$\endgroup$

2 Answers 2

3
$\begingroup$

One approach to dealing with this type of problem is to group the predictors to reduce the extent of correlations among them. Ideally this should be done with expert knowledge, or it might be obvious to non-experts that certain predictors should be grouped together, such as systolic blood pressure and diastolic blood pressure. If there is no knowldege that can help to identifty the groups then a clustering algorithm, or an exploratory factor analysis, could be used. The general idea is that all of the variables in a group are measuring the same thing - in the case of sbp and dbv, they are measuring blood pressure. From there you can take a number of approaches. A very simple one is just to take the average of the group (if doing this then it would be a good idea to scale the variabls first, if they are not already on the same scale). Then you use the average as a predictor in place of the individual variables. Perhaps a better approach that recognises the idea that the observed variables are measuring another, unobserved (latent) variable, is to adpopt a latent variable approach and use the factor loadings, as weights to create a new variable that you then use as a predictor in your regression with random effects. You could combine both steps by using a mixed effects latent variable model.

$\endgroup$
8
  • $\begingroup$ Thanks for the suggestions. I am mainly interested in predicting assessment scores for new patients. Acquiring more predictor data points is very costly so I want to reduce the amount of predictors by feature elimination and focusing on established measures. Therefore constructing new "compound predictors" using approaches like averaging/PCA/LDA is very costly in my use case. $\endgroup$
    – florian
    Commented Nov 9, 2020 at 19:23
  • $\begingroup$ I'm not sure how, for example, averaging is costly to you ? $\endgroup$ Commented Nov 9, 2020 at 19:39
  • $\begingroup$ To compute averages I need to acquire data for all the predictors. Ideally I manage to reduce the clusters to one feature per cluster. $\endgroup$
    – florian
    Commented Nov 9, 2020 at 20:20
  • $\begingroup$ I really don't understand. Why do you need to acquire more data to calculate an average? $\endgroup$ Commented Nov 9, 2020 at 20:52
  • $\begingroup$ because my aim is to have a model which can predict for new cases, so if my model depends on averages I need to acquire every single datapoint. $\endgroup$
    – florian
    Commented Nov 9, 2020 at 22:15
0
$\begingroup$

I am dealing with a similar problem. By looking at the literature, I found this paper "CONFRONTING MULTICOLLINEARITY IN ECOLOGICAL MULTIPLE REGRESSION" (Graham, 2003) https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/02-3114, where it looks like you could use AIC to compare models, and use AIC for a model/feature selection, which is probably a better approach than simply looking at the Wald statistic (i.e., t-value).

I would be more than happy to hear other thoughts on this issue.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.