3
$\begingroup$

I have a data set with N ~ 5000 and about 1/2 missing on at least one important variable. Comparing the people with missing data to those with complete data it is fairly clear that the data are not missing at random (nonignorable nonresponse). The missing data pattern is not monotone missing and at least some data is missing on about a dozen variables.

The main analytic method will be Cox proportional hazards.

I am using SAS which now offers an MNAR method for such data. If the pattern is monotone, then it offers options to estimate the missing data with either nearest neighbors or complete cases. If the pattern is non-monotone, then it offers only a method of 'adjusting' the parameters, which seems (from what I understand) like a pure sensitivity analysis.

SAS also offers a MCMC method of creating a monotone missing pattern from a non-monotone pattern.

My current plan is to first create a monotone missing pattern and then apply nearest neighbor, then analyze the multiply imputed data.

However, I am not sure this is best, nor what determines which options to choose in a scenario like this. Advice welcome as are references to the literature.

$\endgroup$

1 Answer 1

2
$\begingroup$

Whether an imputation method is the "best", depends in part, largely on the discipline.

If this is survey data with non-response because the question(s) were deemed sensitive by the responders, then imputing with the neighbour method is probably inappropriate.

If the data is missing because the recorder malfunctioned, then the nearest neighbor imputation is appropriate. What others have done in the same discipline may also be worth considering.

Whether any method is "best" is debatable. You can do the analysis under several scenarios (ignore missing data, exclude the variable with missing data altogether, nearest neighbor imputation, etc) and see if it you come to different conclusions. This is where experience and subject-matter expertise come in handy.

$\endgroup$
2
  • $\begingroup$ These are questions about college students entry into college. It's not clear why there was so much missing data, but the questions don't seem particularly sensitive. I can certainly do sensitivity analysis (and plan to) but if they give different results, then what? $\endgroup$
    – Peter Flom
    Commented Apr 28, 2014 at 10:02
  • 1
    $\begingroup$ I take "sensitive" to cover a lot of reasons - ranging from "I'm not answering this because it's too personal" to "This survey is boring me now so I will skip the rest of the questions". If you get different conclusions, I would be conservative and stick with the non-imputed results because I don't think imputation algorithms were designed to address non-response due to sensitivity. But that's just a personal practice of mine; I don't think there's a hard rule in the community. Ultimately, I think it has to do with how confident you are in your imputation algorithm. $\endgroup$
    – rocinante
    Commented Apr 28, 2014 at 17:21

Not the answer you're looking for? Browse other questions tagged or ask your own question.