This question seemed simple, but I cannot find the answer in books. I know that the assumptions of multiple imputation require that only the variables are imputed that will be used in the analysis. So far, so good.
However, what if you don't have yet the variables you will use in the analysis, but only the "source" variables you will derive them from? For instance, I have in the source dataset the two variables "History of myocardial infarction" (No/Yes) and "Age at myocardial infarction" (of course, only present for those individuals with "Yes" at "History of myocardial infarction"). In my analysis I will need to define a variable "Myocardial infarction at early age", that will be "Yes" for those with myocardial infarction AND age at myocardial infarction <55 years. So, I need both source variables to create mine. Now, what should I do:
- Impute the two source variables separately (so I have no missing values at "History of myocardial infarction" and "age at myocardial infarction") and then, in all the multiply imputed datasets, create the variable "Myocardial infarction at early age"?
- OR: first create the variable "Myocardial infarction at early age" from the two source variables in the source dataset, and then multiply impute the several missing values this variable will invariably have?