Abstract
This article presents findings from a case study of different approaches to the treatment of missing data. Simulations based on data from the Los Angeles Mammography Promotion in Churches Program (LAMP) led the authors to the following cautionary conclusions about the treatment of missing data: (1) Automated selection of the imputation model in the use of full Bayesian multiple imputation can lead to unexpected bias in coefficients of substantive models. (2) Under conditions that occur in actual data, casewise deletion can perform less well than we were led to expect by the existing literature. (3) Relatively unsophisticated imputations, such as mean imputation and conditional mean imputation, performed better than the technical literature led us to expect. (4) To underscore points (1), (2), and (3), the article concludes that imputation models are substantive models, and require the same caution with respect to specificity and calculability.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Reference
Allison PD (2001) Missing data. Sage Publications, Thousand Oaks
Ambler G, Omar RZ (2007) A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome. Stat Methods Med Res 16: 277–298
Anderson AB, Basilevsky A, Hum DPJ (1983) Missing data: a review of the literature. In: Rossi, Wright, Anderson (eds) Handbook of survey research. Academic Press, New York
Breen N, Kessler L (1994) Changes in the use of screening mammography: evidence from the 1987 and 1990 National Health Interview Surveys. Am J Public Health 84: 62–72
Brick JM, Kalton G (1996) Handling missing data in survey research. Stat Methods Med Res 5: 215–238
Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Stat Methods Med Res 16: 259–275
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, New York
Farewell VT (1979) Some results on the estimation of logistic models based on retrospective data. Biometrika 66: 533–538
Fox J (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks
Fox SA, Siu AL, Stein JA (1994) The importance of physician communication on breast-cancer screening of older women. Arch Intern Med 154: 2058–2068
Fox SA, Pitkin K, Paul C, Carson S, Duan N (1998) Breast cancer screening adherence: does church attendance matter?. Health Educ Behav 25: 742–758
Groves RM, Singer E, Corning A (2000) Leverage–Saliency theory of survey participation. Public Opin Q 64: 299–308
Heckman J (1976) The common structure of statistical models of truncation, sample selection, and limited dependent variables, and a simple estimator for such models. Ann Econ Soc Meas 5: 475–492
Heckman J (1979) Sample selection bias as a specification error. Econometrica 47: 153–161
Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 91: 222–230
Landerman LR, Land KC, Pieper CF (1997) An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol Methods Res 26: 3–33
Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87: 1227–1238
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York
Rao JNK, Shao J (1992) Jackknife variance estimation with survey data under hot deck imputation. Biometrika 79: 811–822
Royston P (2004) Multiple imputation of missing values. Stata J 4: 227–241
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91: 473–489
Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81: 366–374
Rubin DB, Schenker N (1991) Multiple imputation in health-care databases: an overview and some applications. Stat Med 10: 585–598
Schafer JL (1997a) Analysis of incomplete multivariate data. Chapman & Hall, London
Schafer JL (1997b) Software for multiple imputation. [http://www.stat.psu.edu/~jls/misoftwa.html]
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation (with discussion). J Am Stat Assoc 82: 528–550
Vach W (1994) Logistic regression with missing values in the covariates. Springer, New York
Xie Y, Manski CF (1989) The logit model and response-based samples. Sociol Methods Res 17: 283–302
Author information
Authors and Affiliations
Corresponding author
Additional information
The research reported here was partially supported by National Institutes of Health, National Cancer Institute, R01 CA65879 (SAF). We thank Nicholas Wolfinger, Naihua Duan, John Adams, John Fox, and the anonymous referees for their thoughtful comments on earlier drafts. The responsibility for any remaining errors is ours alone. Benjamin Stein was exceptionally helpful in orchestrating the simulations at the labs of UCLA Social Science Computing. Michael Mitchell of the UCLA Academic Technology Services Statistical Consulting Group artfully created Fig. 1 using the Stata graphics language; we are most grateful.
Rights and permissions
About this article
Cite this article
Paul, C., Mason, W.M., McCaffrey, D. et al. A cautionary case study of approaches to the treatment of missing data. Stat Meth Appl 17, 351–372 (2008). https://doi.org/10.1007/s10260-007-0090-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-007-0090-4