Multiple imputation before or after creating variables?

Question

This question seemed simple, but I cannot find the answer in books. I know that the assumptions of multiple imputation require that only the variables are imputed that will be used in the analysis. So far, so good.

However, what if you don't have yet the variables you will use in the analysis, but only the "source" variables you will derive them from? For instance, I have in the source dataset the two variables "History of myocardial infarction" (No/Yes) and "Age at myocardial infarction" (of course, only present for those individuals with "Yes" at "History of myocardial infarction"). In my analysis I will need to define a variable "Myocardial infarction at early age", that will be "Yes" for those with myocardial infarction AND age at myocardial infarction <55 years. So, I need both source variables to create mine. Now, what should I do:

Impute the two source variables separately (so I have no missing values at "History of myocardial infarction" and "age at myocardial infarction") and then, in all the multiply imputed datasets, create the variable "Myocardial infarction at early age"?
OR: first create the variable "Myocardial infarction at early age" from the two source variables in the source dataset, and then multiply impute the several missing values this variable will invariably have?

fwiw I always impute as early in the process as possible. (Not least because some subsequent processing e.g. applying IRT model software, even requires a full data set to begin.) However, in the example you give you'd then also have a fiddly conditional imputation step if you did it all first. — conjugateprior, Commented Mar 24, 2016 at 23:41
As to your question, my intuition (not firm enough for a proper answer) is that it doesn't matter except insofar as the early (late) variables turn out to be more predictable on the basis of the other variables. — conjugateprior, Commented Mar 24, 2016 at 23:45
How will you be using the "myocardial infarction at an early age" variable differently from the 2 underlying source variables? In general, cutoffs (here, age of 55) tend to throw out useful information unnecessarily. — EdM, Commented Mar 25, 2016 at 2:36
@conjugateprior: by conditional imputation, do you mean only imputing age at myocardial infarction for those with myocardial infarction=yes? If you mean that, indeed, that is difficult with most software packges. But I might have found a way around it fiddling with mice in R (though I will accept advice). Do you propose that I try both approaches to check whether the early or the late variables seem to be "more predictable" based on the other ones? — torwart, Commented Mar 25, 2016 at 14:17
@EdM: well, in this case it makes biological sense, as there is a suspicion that premature vascular disease has different risk factors than later vascular disease (one of the objects of my PhD). Anyway, I have also other examples; for instance, I need to define a variable "Current smoker or ex-smoker who has stopped less than 5 years ago". So I have to combine a variable "Current smoker, ex-smoker, or never smoked" with the variable "If ex-smoker, years since he/she stopped" and I have again the same problem. — torwart, Commented Mar 25, 2016 at 14:24

EdM · Accepted Answer · 2016-03-25 20:03:38Z

This problem is closely related to the issue of what are called "passive" variables: for example, the square of an imputed variable if the analysis involves a quadratic term in the variable, or an interaction term between two imputed predictor variables. This leads to an argument for Option 3, which is to impute both the source and the derived variables separately, treating the derived variable as "Just Another Variable" (see section 6 of that very useful overview of multiple imputation).

Although a particular imputed case then can be internally inconsistent, this approach works better in some examples than predicting only the source variables and then (passively) getting the value of the derived variable. This web page, in its discussion of transformations, non-linearities, and interactions, similarly favors the "Just Another Variable" approach, despite its lack of internal logical consistency in some individual imputed cases. It would seem wise to try different imputation approaches and see if they substantially affect the conclusions of your analyses.

For your particular application, however, I would suggest that you do not use "Premature/early onset myocardial infarction" as a binary outcome variable, as it throws away most information about the age at MI. That particular variable would seem much better handled by some form of survival analysis, which incorporates both the MI event, the age at which it happened, and the lack of an event in individuals who have not had an MI by their age at last follow up. Yes, the biological basis of early-onset MI may be different than for later-onset, but a continuous-time survival model that can incorporate time-dependent influences of predictors might demonstrate that more usefully.

Finally, I'm not sure that I understand what you mean by "only the variables are imputed that will be used in the analysis." As I understand it, variables that may help impute missing predictor values should also be imputed if you are using chained equations as in mice, even if they are not used as outcome predictors in your model.

Stack Exchange Network

Multiple imputation before or after creating variables?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
multiple-imputation
or ask your own question.

Linked

Hot Network Questions

Multiple imputation before or after creating variables?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged multiple-imputation or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
multiple-imputation
or ask your own question.