3
$\begingroup$

. I would like to study the link between mortality (outcome and binary variable) and competition between hospitals (Predictor). The competition faced by the hospital is measured by the Herfindahl-Hirschmann index (HHI), which is a continuous variable. I have patient-level variables (age, sex, diagnosis, general state of the patient), area (city) level variables (social deprivation index of the city, care offer in the city...), hospital level variables (HHI, public or private status of the hospital, hospital caseload). The last two groups of variables are at a higher level, since I want to make a cross-classified multilevel model (Patients are nested in both hospitals and cities). I'm looking for R scripts to implement the right modele. I have hundred of hospitals et thousand of cities, so I would consider hospitals and area effects as random.

Here's how I plan to proceed:

library lme4

model<- lmer(Death~age+sex+diagnos+patient_stat+ (1|Hospit_ID+HHI+Hospital_stat+HospCaseLoad) +
(1|City_ID+Deprivation_index,care offer),data=mydata).

But I'm not sure that the model is well implemented.

Another concern is the travel distance between the patient's city and the hospital where he or she is treated. If I decide to put this variable as level 2, I don't know whether to associate it at hospital level or city level, since not all patients living in the same city are at the same distance from their hospital of care (if they are treated in different hospitals), and this is true in the other way, not all patients treated in the same hospital are at the same distance from this hospital. But, all patients living in the same city and treated in the same hospital will share the same travel distance. Could I consider this variable to be level 1? What are the risks for this?

Should I follow the same steps as a simple model for the selection of variables to put in the model? That is, do bivariate analyses with each fitting variable (apart from the relevant variables) and the outcome? In practice, what are the validity conditions to check for such a model?

$\endgroup$

1 Answer 1

5
$\begingroup$

First:

Should I follow the same steps as a simple model for the selection of variables to put in the model? That is, do bivariate analyses with each fitting variable (apart from the relevant variables) and the outcome?

That is never a good way to do variable selection. Choose your variables by considering the causal paths between your main exposure, HHI, and the outcome, and all the other variables of interest. You should include variables if they are a potential confounders, or competing exposures, but not if they are mediators. See this answer for further details:
How do DAGs help to reduce bias in causal inference?

Regarding your model:

model<- lmer(Death~age+sex+diagnos+patient_stat+ (1|Hospit_ID+HHI+Hospital_stat+HospCaseLoad) + (1|City_ID+Deprivation_index,care offer),data=mydata)

This does not make much sense. You cannot have several variables after the | symbol in the random effects structure, you can only have one (or an interaction term). You said that your grouping variables are hospital and city and that these are crossed, so your model should be something like:

Death ~ HHI + confounders + competing_exposures + (1|Hospit_ID) + (1|City_ID)

Since the outcome is binary, you would want to fit a logistic model using glmer with family=binomial not lmer.

Regading the question of the level at which the travel distance variable varies: in mixed effects models this does not matter. The software will automatically handle it at the correct level.


Edit: To address the query in the comment about how to tell the software at which "level" a variable varies.

It is not neccessary, or even possible to tell the software the level at which a variable varies. It does not need to know. We can demonstrate this with a simple simulation:

We simulate patients within hospitals, and two fixed effects, one which varies at the hospital level and one which varies at the patient level and we will simulate them with parameters 10 and 5 respectively:

> set.seed(1)
> dt <- expand.grid(hospID = 1:10, patientID = 1:20)
> dt$hosp_var <- dt$hospID/3
> dt$patient_var <- dt$patientID/3
> dt$Y <- 1
> 
> X <- model.matrix(~ hosp_var + patient_var, data = dt)
> myFormula <- "Y ~ hosp_var + patient_var + (1 | hospID)"
> 
> foo <- lFormula(eval(myFormula), dt)
> Z <- t(as.matrix(foo$reTrms$Zt))
> 
> betas <- c(20, 10, 5)  # fixed effects
> b <- rnorm(10)         # random effects
> 
> dt$Y <- X %*% betas + Z %*% b + rnorm(nrow(dt))
> lmer(eval(myFormula), dt) %>% summary()

Fixed effects:
             Estimate Std. Error  
(Intercept)  19.97767    0.37073  
hosp_var     10.08795    0.15773
patient_var   5.01977    0.05032  

..and we recoved the values 10 and 5, as expected. All we had to do was include the variables as fixed effects.

$\endgroup$
3
  • $\begingroup$ Thank you for all your clarifications, which help me a lot. But how can the software know, for example, that the variables HHI and hospital status are level 2 variables and that they are relative to the hospital. Also, that the deprivation index variable is related to the city and that they are not related to the patients, to avoid some bias. $\endgroup$ Commented Aug 25, 2020 at 16:51
  • 1
    $\begingroup$ @SeydouGORO Perhaps it would be better to say that it does not need to know, because it has no impact on the estimation procedure. There is no way for you to tell the software at what level it varies anyway! $\endgroup$ Commented Aug 25, 2020 at 17:49
  • 1
    $\begingroup$ @SeydouGORO I have added a simulation to demonstrate that it doesn't matter what level a variable varies at. $\endgroup$ Commented Aug 25, 2020 at 18:43

Not the answer you're looking for? Browse other questions tagged or ask your own question.