Multiple imputation of missing values in survey data analysis is a state-of-the-art technique. Typically, methods like multivariate imputation by chained equations (mice, van Buuren 2018) are employed, replacing missing values on a variable-by-variable basis. Generally, the information used for imputation comes from the survey dataset being analysed. Valid analysis results are achieved when the missing values are either missing completely at random (MCAR) or missing at random (MAR). However, the situation becomes more complex if the values are missing not at random (MNAR).
There are some approaches to dealing with suspected MNAR. One approach incorporates sensitivity analyses into the imputation, i.e. making the imputation as robust as possible. Alternatively, the target data set to be imputed can be enriched with further information, so that an MNAR mechanism becomes MAR, and thus the imputation and analysis of the imputed data can be valid. The advantages of this approach are clear, but often the full range of variables of the data set is already included in the imputation, and still the suspicion of MNAR remains.
We present a new method that integrates prior information from an external distribution into the mice imputation process to reduce the risk of MNAR and better justify the assumption of a MAR mechanism. If required covariates are entirely unobserved in the target data set, we draw the missing variable values from the external data set with predictive mean matching. For this to work, besides the variable of interest, there must be a sufficient overlap of other variables measured in the same way.
We demonstrate the effectiveness and limitations of our new approach with a simulation example, involving the imputation of a typical income variable. Additionally, we apply this method to the German Socio-Economic Panel Study (SOEP) Core and Innovation Sample, where the multiply imputed income variable from the former is used to inform the imputation in the latter.
References
(i) Van Buuren, S. (2018). Flexible imputation of missing data. CRC press.