A cautionary case study of approaches to the treatment of missing data

Paul, Christopher; Mason, William M.; McCaffrey, Daniel; Fox, Sarah A.

doi:10.1007/s10260-007-0090-4

A cautionary case study of approaches to the treatment of missing data

Original Article
Published: 08 January 2008

Volume 17, pages 351–372, (2008)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Statistical Methods and Applications Aims and scope Submit manuscript

A cautionary case study of approaches to the treatment of missing data

Download PDF

Christopher Paul¹,
William M. Mason²,
Daniel McCaffrey¹ &
…
Sarah A. Fox³

219 Accesses
28 Citations
1 Altmetric
Explore all metrics

Abstract

This article presents findings from a case study of different approaches to the treatment of missing data. Simulations based on data from the Los Angeles Mammography Promotion in Churches Program (LAMP) led the authors to the following cautionary conclusions about the treatment of missing data: (1) Automated selection of the imputation model in the use of full Bayesian multiple imputation can lead to unexpected bias in coefficients of substantive models. (2) Under conditions that occur in actual data, casewise deletion can perform less well than we were led to expect by the existing literature. (3) Relatively unsophisticated imputations, such as mean imputation and conditional mean imputation, performed better than the technical literature led us to expect. (4) To underscore points (1), (2), and (3), the article concludes that imputation models are substantive models, and require the same caution with respect to specificity and calculability.

Article PDF

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Article Open access 18 July 2022

Improving the Robustness of Parametric Imputation

Missing Data Imputation: A Practical Guide

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reference

Allison PD (2001) Missing data. Sage Publications, Thousand Oaks
Google Scholar
Ambler G, Omar RZ (2007) A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome. Stat Methods Med Res 16: 277–298
Article MATH MathSciNet Google Scholar
Anderson AB, Basilevsky A, Hum DPJ (1983) Missing data: a review of the literature. In: Rossi, Wright, Anderson (eds) Handbook of survey research. Academic Press, New York
Google Scholar
Breen N, Kessler L (1994) Changes in the use of screening mammography: evidence from the 1987 and 1990 National Health Interview Surveys. Am J Public Health 84: 62–72
Article Google Scholar
Brick JM, Kalton G (1996) Handling missing data in survey research. Stat Methods Med Res 5: 215–238
Article Google Scholar
Carpenter JR, Kenward MG, White IR (2007) Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Stat Methods Med Res 16: 259–275
Article MATH MathSciNet Google Scholar
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall, New York
MATH Google Scholar
Farewell VT (1979) Some results on the estimation of logistic models based on retrospective data. Biometrika 66: 533–538
Article MathSciNet Google Scholar
Fox J (1997) Applied regression analysis, linear models, and related methods. Sage Publications, Thousand Oaks
Google Scholar
Fox SA, Siu AL, Stein JA (1994) The importance of physician communication on breast-cancer screening of older women. Arch Intern Med 154: 2058–2068
Article Google Scholar
Fox SA, Pitkin K, Paul C, Carson S, Duan N (1998) Breast cancer screening adherence: does church attendance matter?. Health Educ Behav 25: 742–758
Article Google Scholar
Groves RM, Singer E, Corning A (2000) Leverage–Saliency theory of survey participation. Public Opin Q 64: 299–308
Article Google Scholar
Heckman J (1976) The common structure of statistical models of truncation, sample selection, and limited dependent variables, and a simple estimator for such models. Ann Econ Soc Meas 5: 475–492
Google Scholar
Heckman J (1979) Sample selection bias as a specification error. Econometrica 47: 153–161
Article MATH MathSciNet Google Scholar
Jones MP (1996) Indicator and stratification methods for missing explanatory variables in multiple linear regression. J Am Stat Assoc 91: 222–230
Article MATH Google Scholar
Landerman LR, Land KC, Pieper CF (1997) An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol Methods Res 26: 3–33
Article Google Scholar
Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87: 1227–1238
Article Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
MATH Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York
MATH Google Scholar
Rao JNK, Shao J (1992) Jackknife variance estimation with survey data under hot deck imputation. Biometrika 79: 811–822
Article MATH MathSciNet Google Scholar
Royston P (2004) Multiple imputation of missing values. Stata J 4: 227–241
Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Google Scholar
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91: 473–489
Article MATH Google Scholar
Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81: 366–374
Article MATH MathSciNet Google Scholar
Rubin DB, Schenker N (1991) Multiple imputation in health-care databases: an overview and some applications. Stat Med 10: 585–598
Article Google Scholar
Schafer JL (1997a) Analysis of incomplete multivariate data. Chapman & Hall, London
MATH Google Scholar
Schafer JL (1997b) Software for multiple imputation. [http://www.stat.psu.edu/~jls/misoftwa.html]
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation (with discussion). J Am Stat Assoc 82: 528–550
Article MATH MathSciNet Google Scholar
Vach W (1994) Logistic regression with missing values in the covariates. Springer, New York
MATH Google Scholar
Xie Y, Manski CF (1989) The logit model and response-based samples. Sociol Methods Res 17: 283–302
Article Google Scholar

Download references

Author information

Authors and Affiliations

RAND, 4570 Fifth Ave., Suite 600, Pittsburgh, PA, 15213, USA
Christopher Paul & Daniel McCaffrey
California Center for Population Research, University of California, Los Angeles, 4284 Public Policy Building, PO Box 951484, Los Angeles, CA, 90095, USA
William M. Mason
Department of Medicine, Division of General Internal Medicine and Health Services Research, University of California, Los Angeles, 1100 Glendon Ave., Suite 2010, Los Angeles, CA, 90024, USA
Sarah A. Fox

Authors

Christopher Paul
View author publications
You can also search for this author in PubMed Google Scholar
William M. Mason
View author publications
You can also search for this author in PubMed Google Scholar
Daniel McCaffrey
View author publications
You can also search for this author in PubMed Google Scholar
Sarah A. Fox
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Paul.

Additional information

The research reported here was partially supported by National Institutes of Health, National Cancer Institute, R01 CA65879 (SAF). We thank Nicholas Wolfinger, Naihua Duan, John Adams, John Fox, and the anonymous referees for their thoughtful comments on earlier drafts. The responsibility for any remaining errors is ours alone. Benjamin Stein was exceptionally helpful in orchestrating the simulations at the labs of UCLA Social Science Computing. Michael Mitchell of the UCLA Academic Technology Services Statistical Consulting Group artfully created Fig. 1 using the Stata graphics language; we are most grateful.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paul, C., Mason, W.M., McCaffrey, D. et al. A cautionary case study of approaches to the treatment of missing data. Stat Meth Appl 17, 351–372 (2008). https://doi.org/10.1007/s10260-007-0090-4

Download citation

Accepted: 11 December 2007
Published: 08 January 2008
Issue Date: July 2008
DOI: https://doi.org/10.1007/s10260-007-0090-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A cautionary case study of approaches to the treatment of missing data

Abstract

Article PDF

Similar content being viewed by others

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Improving the Robustness of Parametric Imputation

Missing Data Imputation: A Practical Guide

Reference

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A cautionary case study of approaches to the treatment of missing data

Abstract

Article PDF

Similar content being viewed by others

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Improving the Robustness of Parametric Imputation

Missing Data Imputation: A Practical Guide

Reference

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation