Introduction

The validity of statistical inferences is at risk when analyzed data are incomplete, especially if missing data are handled incorrectly. It has been shown that even very small proportions of incomplete cases (in RCTs) can lead to substantial missing information, and misleading inferences [1]. Although the statistical tools to deal with incomplete data are available in statistics and biostatistics literature [24], the degree to which HIV prevention scientists are applying them to their studies is unknown. For example, although drop-out is a common complication in longitudinal studies of health and health behavior, it is still the convention to use only the available data [2, 5, 6]. It has been shown repeatedly that ignoring the problems caused by missing data could lead to biased results, flawed interpretation, loss of statistical power and inefficiency [2, 5].

Many studies show that incomplete data may differ by key variables such as treatment group, gender, age, race, and education level [79]. Hence, we expect a higher probability of nonresponse for some subgroups compared with others. Differential missing data can lead to differences between those with complete data and those with incomplete data, causing a lack of generalizability to nonresponders. Despite this fact, one of the most commonly used missing data techniques is list-wise deletion, which makes use of complete case data to the exclusion of cases with incomplete data. When study completers differ substantively from non-completers, statistical conclusions drawn from the selected data will be particularly misleading. Although deleting cases with incomplete data is straight-forward and is the default in many statistical packages (e.g. SAS, SPSS, MINITAB), this technique may lead to important biases and loss of statistical power. Fortunately, methods have been developed to handle missing data with significant advantages over case deletion. The purpose of this commentary is to review the techniques used for managing missing data and assumptions for managing missing data for recent published HIV prevention trials.

In this review we examine the missing data assumptions, their applications, and their solutions. Our focus is on the extent of missing data in HIV prevention trials and the implications for interpreting findings. We conclude with some recommendations for managing missing data in future prevention trials.

Missing Data Assumptions

Prior to examining the methods used for managing missing data in HIV prevention trials, we review the underlying assumptions for managing missing data. Assumptions for managing missing values are built upon some conceptual mechanisms. These mechanisms can be thought of as the reasons for missing values. These assumptions are important to understand in order to choose the correct analysis procedures. It is also very important to report the assumptions so researchers reading manuscripts will know the exact assumptions made. The main mechanisms for missing values are: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) [3, 10].

MCAR is what most people would think if told that data was randomly missing. Under the MCAR mechanism, the observed data are a random subset of the hypothetical, (but unobserved) complete data set, and are representative of the hypothetically complete set and population. This happens when missingness is unrelated to values in the data set, either missing or observed [3]. Consider an HIV prevention study where HIV status is missing due to a random error in data entry, this condition of “nonresponse” would be MCAR; the missingness is unrelated to the response variables. Another analogue is to think of MCAR as scenario in which a lightning strike destroyed certain parts of the data completely by chance.

MAR should be thought as conditional missingness. Under the MAR assumption, missingness can be related to an observed part of the data. For example, in the same HIV prevention trial, if HIV status is missing as a function of age and gender alone, having complete data of the variables age and gender will constitute a MAR mechanism. Also, consider a case that missing values are more prevalent in the treatment arm relative to control arm, as long as we have the treatment assignment, this will be considered MAR mechanism.

When MAR cannot be assumed, we have to assume the data are missing at some non-random mechanism, MNAR. Under this assumption missing values can be due to unobserved (missing) values, even after controlling for other variables. For example, if HIV status is more likely to be missing for individuals whose unobserved HIV status is positive, then unobserved HIV status values are MNAR. In this case, the observed data represent a subgroup of participants whose HIV status is more likely to be negative. Clearly, statistical inferences derived from the available data would be unrepresentative of non-responders.

The caveat is that the distinction between MAR and MNAR assumptions cannot be verified with unplanned missingness without follow up with non-responders, i.e. getting more information about the missing values. Obviously, following up non-respondents does not occur in prevention trials. The distinction between MAR and MNAR is important in order to define ignorable missingness.

When data are complete, researchers have to come up with a substantive model (e.g. regression model) in order to explain the data. When data are incomplete, researchers have to model not only the available data, but the missingness as well unless they are willing to assume some ignorability assumption. Ignorable Missingness refers to whether or not the mechanism accounting for the missingness must be explicitly modeled together with the substantive model [3, 10]. Often, researchers mistakenly confuse the ignorability concept with the claim that the missing data can be ignored. However, when making statistical inferences the missing data should never be ignored. The difference between ignorable and non-ignorable missingness is whether a separate model for missingness must be included together with the substantive model. Under the ignorability assumption only the substantive model need to be specified (e.g. regression model), while under non-ignorable set-ups joint models for the substantive and missingness models need to be specified using either selection models, [3] pattern-mixture models [3] or shared parameter models [3].

We can assume ignorable missingness when data are MAR and when the missingness has no bearing on the substantive model parameters [5]. When using likelihood-based or Bayesian estimation techniques MAR and MCAR can reasonably be treated as ignorable [11] which means that no additional information is required about the distribution of the nonresponse [10]. However, when using semi-parametric techniques such as generalized estimating equations (GEE) [12], only the MCAR condition is ignorable. Always, if data are MNAR, the condition is non-ignorable.

Possible Solutions to Missing Data

There are several statistical procedures that deal with incomplete data. We introduce a few of them here with relative advantages and disadvantages.

Complete Case Analysis

The most common and straightforward approach to dealing with incomplete data is to omit those subjects with incomplete data from the analysis. This is often the default method of handling incomplete data by statistical procedures in commonly-used statistical software packages, such as Stata [13], SAS [14], and SPSS [15]. The advantage of case deletion is that it can be used for any kind of statistical analysis and no special computational methods are required, when data are MCAR this approach may yield results that are unbiased [3, 16]. However the disadvantages are loss of power, inefficiency, and possible bias. Reduced sample size may impose limitations on the types of analyses that can be conducted, and may preclude the use of large-sample techniques. In particular, consider data trial with 10 variables and each variable is missing 5% of it values randomly (MCAR). Using case deletion will reduce the data to around 60%, larger data sets and larger rates of missing values can have even bigger impacts (e.g. 20 variable and 10% missing values will result in 12% of the data).

Generalized Estimating Equation (GEE)

For correlated data, generalized estimating equation (GEE) [12], became one of the most used procedures in practice. GEE Procedures are used regularly is large studies when clustering or longitudinal structures are desired or unavoidable. Using this procedure, the researcher specifies a working correlation structure but this structure does not have to hold exactly. This constitutes as a semi-parametric procedure as the model for the data has to be evaluated but the correlations are not of main interest. Unfortunately, GEE of incomplete data is unbiased only under the MCAR assumption. However, an infrequently used extension, weighted GEE [17], allows missing data under the MAR condition.

Maximum Likelihood

Maximum likelihood is a large sample technique looking for the parameter estimates that have the greatest likelihood of producing the observed data, given a specified model. These parameters are called the maximum likelihood estimates (MLE) [3, 16, 18, 19]. Maximum likelihood estimation does not require observations to be balanced; individuals may have differing numbers of observations spaced at different intervals. All complete and partially-observed cases contribute to the maximum likelihood estimation of model parameters, and the missing data values are treated as random variables to be averaged over [20].

Bayesian Estimation and Multiple Imputation

Bayesian estimation techniques use prior information (distribution) together with the likelihood distribution to produce a posterior distribution. The estimates drawn from the posterior distribution takes into account prior knowledge and the distribution of the data [3, 21]. Usually this is being done using Markov Chain Monte Carlo (MCMC) estimation which allows the analysis of the data without dropping cases.

Multiple imputation (MI) [2] replaces missing observations with m>1 plausible values to complete multiple alternative completed data sets [3, 4, 11]. The complete data sets are analyzed individually, and multiple parameter estimates are combined. MI provides the advantage of allowing complete-data analytical routines while accounting for uncertainty of estimates due to imputation. In the past a small number of imputations were considered adequate for efficient parameter estimation [5], but many more may be needed to improve efficiency [2224].

Missing Data in HIV Prevention Trials

Our review was performed with assistance of the HIV/AIDS Prevention Research Synthesis (PRS) Project at the Centers for Disease Control and Prevention (CDC) [25]. This is database of prevention studies maintained by the CDC for monitoring evidence-based HIV behavioral interventions. The review process is conducted using well-established systematic procedures for searching and reviewing the intervention research literature. Our search was based on automated strategies in four electronic bibliographic databases (EMBASE, MEDLINE, PsycINFO, and Sociological Abstracts) together with a manual search which involved reviewing approximately 35 journals to identify articles not yet indexed in the electronic databases. More detailed information about the CDC PRS database can be found in the CDC literature [25].

The PRS cumulative database was searched on June 15, 2010 and all citations that met the following criteria were retrieved.

  1. 1.

    Reports of HIV/STD/HBV/HCV behavioral interventions. Rationale, the interpretation of biological and behavioral interventions differs along multiple dimensions. We chose to exclude biological interventions because the conditions under which data are missing vary considerably from those of behavioral interventions.

  2. 2.

    Based on a randomized control trial (RCT) research design. The RCT is the most rigorous design for testing clinical methods and procedures and findings can be blurred by missing data.

  3. 3.

    Reported a biological outcome. In particular, any sexually transmitted infection/disease endpoint, such as HIV incidence/seroconversion, STD incidence/re-infection, or Hepatitis B or C infection. Biological endpoints in HIV prevention trials avoid self report biases and represent a clinical disease outcome.

  4. 4.

    From 2005 to present. Studies prior to 2005 do not represent the current state of HIV prevention science and the implications for missing data are fewer for these studies.

The search resulted in (n = 57) citations that met the inclusion criteria. A reviewer with background in HIV prevention assessed each study and extracted pre-specified needed information (See Table 1). A second reviewer with background in biostatistics assessed a sample of studies and arrived at 100% agreement with the first reviewer. The results of the reviewing process plus descriptive information on the studies were entered into a computer database and are summarized in Table 1 [2682].

Table 1 Summary of behavioral HIV prevention trials reporting biological outcomes

The trials were conducted all over the world with regions/countries ranging from the United States and Mexico in North America, Jamaica in Central America, United Kingdom and Belgium in Western Europe, Russia and Bulgaria in Eastern Europe, Thailand, Philippines and China in Asia and several countries in Sub-Saharan Africa such as Tanzania, Zimbabwe, South Africa, Kenya, Uganda and Madagascar.

All of the trials had some level of missing values. Although not all studies reported them in the same manner, we found that the averaged missing values per study ranged between 3 and 97%. Averaging over all studies the percent of missing values was 26% (median 23%). In this (missing values) sample, values greater then 50% were considered outliers. We had four outliers in our sample. We extracted the information about the missing data levels from the participant flow charts reported in each trial. It is clear from the range of cell sizes that many studies varied in available data for different analyses. We speculate that the majority of missing values are due to missing outcomes, but cannot know it for certain for all studies due to the differences in reporting. However, due to the fact that both types of missing values may bias the results we do not distinguish between them.

None (0%) of the studies reported any information on what missing data assumptions were used in their analyses. In most cases this implies that only analyses under the MCAR the results will be unbiased. The majority of studies (42, 74%) used complete case analysis (CCA) and reduced the sample only for those with complete data (Table 1, references [2667]). Eight studies (14%) used some variation of GEE analysis which used the whole data (observed and missing) but is potentially biased under MAR assumption (Table 1, references [6875]). There are few studies that used maximum likelihood estimation (7, 12%) and therefore their results will be unbiased under the MAR assumption (Table 1, references [7682]). Collins et al. [21] showed that if one collects enough auxiliary information, one can get close to the MAR assumption. Assuming all studies collected enough information so that the MAR assumption is reasonable, and since we know that MCAR rarely happens in practice, only seven studies out of 57 (12%) had some of their analyses done so we can expect them to be unbiased.

The studies reviewed that used complete case analysis (n = 42, 74%; references [2667] did so for many different types of analyses. For example, parametric tests such as t test, F test, and χ2 tests; non-parametric tests such as Rank tests; regression analyses such as linear regression, multiple regression, logistic regression, Poisson and binomial regressions; Analysis of variance (ANOVA) with its derivatives MANOVA and ANCOVA. All these analyses are in danger of being biased under MAR and MNAR, and have a chance of being unbiased under MCAR.

The studies using GEE in our review (n = 8, 14%; references [6875]) reported conventional (unweighted) GEE, which implies possible biased results unless missing data were missing completely at random (MCAR).

Studies using maximum likelihood estimation (n = 7, 12%; references [7682]) used Generalized multilevel models and linear mixed models. These procedures are also called generalized linear mixed model, mixed effect linear regression, random effect regression, and multilevel random effect model. These procedures are expected to be unbiased under both MAR and MCAR.

Bayesian and multiple imputation procedures are well equipped to deal with incomplete data. Unfortunately none of the trials we reviewed used these procedures. Both of these procedures (with adequate modeling) can be unbiased under MAR, MCAR and MNAR scenarios.

Recommendations

With any applied research and in particular RCTs, the best thing to do with regard to missing data is to avoid it. The second best thing is to plan for it, understand it and address it with appropriate modeling techniques. (1) Plan for missingness. Researchers should anticipate unavoidable missing data. Variables determined to relate to non-response should be identified and measured. (2) Minimize nonresponse. Incorporate procedures into the study plan to reduce missed assessments and ensure regular review of data. (3) Determine the mechanism of missingness. Researchers should test the assumption of MCAR, and carefully consider the plausibility of ignorable missingness. (4) Apply appropriate techniques. Techniques such as ML, GEE, Bayesian, and MI are effective when applied appropriately under proper assumptions, but will provide misleading results when implemented incorrectly. (5) Report missingness and techniques used. Researchers should fully describe missing data methods; the incomplete data structure, missing data assumptions, and the techniques selected to handle them. (6) Sensitivity analysis. Researchers should analyze their data under different missing data assumptions and report the differences the missing data assumptions make on conclusions.

Conclusions

In this review, we examined the past 5 years of behavioral HIV prevention RCTs reporting biological outcomes. We found that all the reviewed publications had varying degrees of missing data, and yet none reported assumptions regarding the management of missing data. Most studies used statistical methods which are most probably biased to most common missing data assumptions. In particular, most studies reviewed used complete case analysis (n = 42, 74%; references [2667]), eight studies (14%) used some GEE type procedures [6875], seven studies (12%) used maximum likelihood procedures [7682], while none used Bayesian or multiple imputation procedures. Although we cannot comment on the direction and magnitude of the bias, the fact that approximately 88% (74 + 14%) of the studies reported possibly biased results (under the MAR assumption) is alarming. We touched on some available methodology more appropriate to deal with incomplete data and gave some general recommendations of how to deal with incomplete data.

The idea that missing data can impact the results of clinical trials is not new. Researchers in many fields have shown the risk of ignoring the missing data complications [3, 5, 11, 16]. Recently there were several reviews which examined the problem from different directions. One study, for example, reports on the use and abuse of missing data procedures in longitudinal data settings in developmental psychology [83], while another discuss issues of noncompliance in randomized trials [84].

We hope researchers will attend more closely to the missing data in HIV prevention trials. Methods for incomplete data are available and offer the potential for unbiased and efficient estimation. Not thinking of the missing data problem does not mean the problem goes away. Leaving the problem to the pre-specified statistical software will, in most cases, reduce the data to complete set, an unsatisfactory solution to missing data.

We entreat researchers to disclose missing data rates, missing data assumptions, and the methods used to address them in published work. We hope that this practice will promote the application of proper techniques and a greater understanding of the methodological and statistical issues involved in handling incomplete data.