Abstract
Missing data such as data missing at random (MAR) are unavoidable in real data and have the potential to undermine the validity of research results. Multiple imputation is one of the most widely used MAR-based methods in education and behavioral science applications. Arbitrarily specifying imputation models can lead to incompatibility and cause biased estimation. Building on the recent developments of model-based imputation and Arnold’s compatibility work, this paper systematically summarizes when the traditional fully conditional specification (FCS) is applicable and how to specify a model-based imputation model if needed. We summarize two Compatibility Requirements to help researchers check compatibility more easily and a decision tree to check whether the traditional FCS is applicable in a given scenario. Additionally, we present a clear overview of two types of model-based imputation: the sequential and separate specifications. We illustrate how to specify model-based imputation with examples. Additionally, we provide example code of a free software program, Blimp, for implementing model-based imputation.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Missing data are unavoidable in practice and have the potential to undermine the validity of research results. Various methods have been proposed to deal with the missing at random (MAR) mechanism where the probability of missingness is only related to observed variables (Little & Rubin, 2019). Among those methods, multiple imputation is one of the most widely used MAR-based methods. Multiple imputation consists of three major steps: the imputation phase, the analysis phase, and the pooling phase (Enders, 2022; Little & Rubin, 2019; Rubin, 2004; Schafer, 1997). We focus on the imputation step for incomplete covariates in this paper. When covariates (i.e., predictors) and outcomes are incomplete, the missing outcomes or covariates are estimated and imputed in the imputation step in order to construct a complete dataset to analyze the substantive model. We refer to the model of substantive interest as the substantive model and the model used to estimate the incomplete covariates or outcomes as the imputation model. We focus on the imputation model of covariates because the substantive model itself is the imputation model of the missing outcome unless there are auxiliary variables. There are no existing imputation models for covariates and we need to specify an imputation model for the incomplete covariate(s) in order to provide the needed information. A popular imputation approach is the fully conditional specification (FCS; e.g., Enders et al., 2016, 2017; Van Buuren 2012, 2011; Van Buuren et al. 2006). FCS uses a round robin sequence of regression models where each incomplete covariate is regressed on all other covariates and the outcome, complete or previously imputed. An important feature of FCS is that the regression models used to impute the substantive model outcome and to impute each covariate take on identical functional forms. Because FCS imputes all model variables in an identical manner, the imputation algorithm does not need to distinguish the substantive model from the models for covariates, leading Enders (2022) to refer to FCS imputation as a type of “agnostic imputation”. For example, a three-predictor regression analysis requires four regression models to define the distributions of the missing variables (the substantive model and one regression model for each covariate). When all specified regression models are linear (i.e., without any polynomial or interactive terms) with normally distributed errors, the implied joint distribution of all covariates and outcome is multivariate normal. However, models are not necessarily linear in practice. Recent research on missing data has revealed that arbitrarily specifying an imputation model such as FCS may lead to the so-called incompatibility issue and cause noticeable biases of estimation when the substantive model contains nonlinear covariate effects, such as quartic terms and interaction effects, or random slopes (e.g., Bartlett et al., 2015; Enders et al., 2018; Erler et al., 2016; Grund et al., 2018; Kim et al., 2015; Seaman et al., 2012; Van Buuren et al., 2006). Only when compatibility is ensured can we get accurate imputation and parameter estimation. A succinct definition of compatibility is that the joint distribution of all covariates and outcome exists (existence indicates that we can write out a joint density function and it meets all requirements of a distribution such as integrability). For example, we are interested in understanding the relationship between number of hours worked (Y ) and reported happiness (X). Some participants’ happiness scores (X) are missing. Based on a scatterplot of existing data, we find the relationship is quadratic. Then, the substantive model is Y = β0 + β1X + β2X2 + eY with \(e_{Y}\sim N\left (0,{\sigma _{Y}^{2}}\right )\). To impute X (happiness scores), FCS regresses Y on X, and this regression can take various forms. Researchers usually either assume that the incomplete covariate is a linear function of Y, X = γ0 + γ1Y + eX, or assume that the imputation model has a similar form as the substantive model, X = γ0 + γ1Y + γ2Y2 + eX. But regardless of which imputation model, the joint distribution of X and Y does not exist (cannot write out a valid density function) since the substantive model has contained a quadratic term. Throughout this paper, we use the term “traditional FCS” to refer to this round robin sequence of identically-specified regressions.
To solve the dilemma of the incompatibility issue, imputation models guaranteeing that the conditional distribution of the incomplete covariates is mathematically correct and compatible with the substantive model have been proposed and developed by researchers such as Bartlett et al., (2015), Enders et al., (2020), Erler et al. (2016; 2019), Goldstein et al., (2014), Kim et al., (2015), Kim et al., (2018), and Zhang and Wang (2017). Following Du et al., (2021), Enders et al., (2020), and Kim et al., (2018), we refer to the imputation models of covariates that are compatible with the substantive model and are compatible with all other covariates as the model-based imputation model. The rationale of the model-based imputation method is that instead of specifying the imputation models of covariates directly, we use the substantive model and the so-called covariate models that capture the relationship among the covariates to construct the imputation model for incomplete covariates. In this way, we can ensure that the joint distribution of all covariates and outcome exists. “Model-based imputation” emphasizes that imputation models are specified based on the substantive model and covariate models to ensure compatibility.
The existing model-based imputation methods were proposed in different scenarios for different types of substantive models. To help researchers better understand and use model-based imputation, the big framework for model-based imputation methods needs to be summarized. Why, when, and how to use model-based imputation is not clear to researchers, except a few papers focusing on consequences of incompatibility (e.g., Van Buuren et al., 2006). In addition, there are two kinds of model-based specifications, sequential specification and separate specification; and they have not been systematically summarized and compared except in a recent paper by Lüdtke et al., (2020). It is necessary to synthesize previous work and relevant findings to provide a theoretical framework for methodological researchers. Therefore, the aims of this paper are to: 1) provide a decision tree and the requirements of compatibility to help researchers choose appropriate imputation methods to ensure compatibility (i.e., traditional FCS vs. model-based imputation), 2) present a clear overview of the sequential and separate specifications, and 3) note differences about which method (e.g., sequential vs. separate) to prefer under specific circumstances, as well as how these are implemented in Blimp.
The outline of this paper is as follows. In “Compatibility and related concepts” section, we define compatibility. In “Requirements of compatibility and decision tree” section, we present the requirements of compatibility in the normal distribution family, and illustrate how to use a decision tree to check compatibility and select the appropriate imputation approach with examples. In “Model-based imputation: Single incomplete covariate” section, we illustrate how to calculate the model-based imputation model when a single covariate is incomplete. In “Model-based imputation: Multiple incomplete covariates” section, we present and compare two specification strategies for model-based imputation when multiple covariates are incomplete. In “Misuse in model-based imputation” section, we give examples in which the model-based imputation may be misused. In “Hypothetical data examples” section, we illustrate and compare the two specification strategies for model-based imputation with hypothetical data. In “Conclusion and recommendation” section, we end with several concluding remarks.
Compatibility and related concepts
The general definition of compatibilit y is that at least one joint distribution exists whose conditional distributions match the specified conditional distributions (Arnold et al., 1999; Arnold and Press, 1989; Liu et al., 2014; Van Buuren et al., 2006; Van Buuren, 2012). It implies that given the conditional distributions, the joint distribution exists. We can view the substantive model and the imputation model for each covariate as conditional distributions. For example, the substantive model with one covariate has a conditional distribution \(p\left (Y|X\right )\) and the imputation model for the covariate has a conditional distribution \(p\left (X|Y\right )\). The imputation model for X and substantive model are compatible when the joint distribution \(p\left (Y,X\right )\) exists. Meng (1994) referred to this type of compatibility as congeniality. As mentioned above, FCS directly specifies the imputation model (i.e., \(p\left (X|Y\right )\)), but there is no guarantee that the implied joint distribution of the covariate and outcome \(p\left (Y,X\right )\) exists. When the specified imputation models of covariates are incompatible with the substantive model, the imputation models are mathematically misspecified and can lead to biased parameter estimates and inaccurate coverage rates. For example, the simulation study by Enders et al., (2020) showed that in a two-level model with random slopes, the traditional FCS misspecified the imputation model and consistently underestimated the random slope variance by 10% to 20% even with a large sample size. The coverage rates of the fixed effects could be as low as 0.85, comparing this to the nominal level of 0.95, which indicates that we could have inaccurate statistical inference conclusions under this condition.
When there are more than one incomplete covariate (X), we need to consider whether the joint distribution of all incomplete covariates exist (\(p\left (\boldsymbol {X}\right )\); compatibility between all the covariate models) and whether the joint distribution of all incomplete covariates and outcome exist (\(p\left (\boldsymbol {X},Y\right )\); compatibility between the covariate models and the substantive model). There are two ways to specify the model-based imputation model when there are multiple incomplete covariates, the sequential specification and separate specification (we will elaborate on them later). The sequential specification can ensure compatibility between all the covariate models and the substantive model whereas the separate specification has the risk of failing to ensure the existence of \(p\left (\boldsymbol {X}\right )\) and \(p\left (\boldsymbol {X},Y\right )\).
Compatibility infers whether the imputation model is mathematically correctly specified, but it cannot tell us whether the imputation model is correctly specified to capture the true data generating model. In other words, when compatibility is met, we only know the imputation model is not mathematically wrong, but it still may be misspecified.
Requirements of compatibility and decision tree
We focus on the normal distribution because it is widely used for substantive models with continuous outcomes. In this section, based on the two conclusions from Arnold and Press (1989) and Arnold et al., (1999) (see A for more details), we summarize two Compatibility Requirements. The conclusions can be used to check the compatibility of a normal substantive model and a normal traditional FCS imputation model. Additionally, based on these two conclusions, we provide a decision tree and illustrate how to use the decision tree (see Fig. 1) to check whether the traditional FCS procedure can provide a compatible imputation model or whether the model-based imputation is needed.
Compatibility requirement 1
When both the substantive model \(p\left (Y|X\right )\) and the covariate imputation model \(p\left (X|Y\right )\) have normally distributed errors and constant variances (which implies the absence of random slopes), compatibility exists if and only if the conditional means of two models are linear.
Compatibility requirement 2
After integrating out all other variables (i.e., other covariates and random effects) in both the covariate imputation model \(p\left (X|Y\right )\) and substantive model \(p\left (Y|X\right )\), if both \(p\left (Y|X\right )\) and \(p\left (X|Y\right )\) are conditional normal and either of them has a conditional variance whose highest exponent is higher than 0 (such as \(var\left (Y|X\right )=X\sigma ^{2}\) or X2σ2), the imputation model of X (\(p\left (X|Y\right )\)) is not compatible with the substantive model \(p\left (Y|X\right )\) (i.e., the joint distribution of X and Y does not exist).
Based on these two Compatibility Requirements, we also provide a decision tree for checking compatibility (Fig. 1). We will use two examples where there is one incomplete covariate to illustrate how to use the aforementioned Compatibility Requirements and a decision tree to check compatibility. With only one covariate, we do not need to distinguish separate and sequential specifications (we will elaborate on them later) in Fig. 1 since they are special cases of model-based imputation when there are multiple incomplete covariates.
Example 1: Incompatibility example with a quadratic substantive model
In this example, we show that only model-based imputation can be used for a quadratic substantive model. As illustrated earlier, when the substantive model is Y = β0 + β1X + β2X2 + eY with \(e_{Y}\sim N\left (0,{\sigma _{Y}^{2}}\right )\) and we specify traditional FCS imputation model, based on the aforementioned Compatibility Requirement 1, the joint distribution of X and Y does not exist. If we still assume that the joint distribution of X and Y exists, the imputed values of missing X and Y, and consequently the estimation of the regression coefficients can be biased (Bartlett et al., 2015; Grund et al., 2018; Seaman et al., 2012). This model echos Branches 1 and 2 in the decision tree. More specifically, in terms of the question “does each regression model in the path model have only linear terms”, the answer is no because the substantive model has a nonlinear term, X2. And because there is no need to distinguish separate and sequential specifications with one covariate, we should use model-based imputation in this example (see Fig. 2 for the route of arriving Branches 1 and 2).
Example 2: Incompatibility example with a random slope model
In this example, we show that only model-based imputation can be used for a random slope substantive model. Consider a two-level random slope model as the substantive model with an incomplete Level-1 covariate,
where j indicates clusters (j = 1,...,J), i indicates individuals (i = 1,...,nj), nj indicates the sample size in the j th cluster, Xij indicates the Level-1 covariate, β0 is the average intercept, β1 is the average slope, u0i and u1i are the random effects with \(\boldsymbol {u_{j}}=\left (\begin {array}{c} u_{0j}\\ u_{1j} \end {array}\right )\sim MVN\left (\boldsymbol {0},\boldsymbol {D=}\left (\begin {array}{cc} \sigma _{u0}^{2} & \sigma _{u01}\\ \sigma _{u01} & \sigma _{u1}^{2} \end {array}\right )\right )\), and eij is the Level-1 error term with \(e_{ij}\sim N\left (0,{\sigma _{e}^{2}}\right )\). Since the error variance of Yij is conditional on Xij, it is not constant and varies depending on Xij and the highest exponent of Xij in \(var\left (Y_{ij}|X_{ij}\right )\) is 2 (i.e., \(X_{ij}^{2}\sigma _{u1}^{2}\) in \(var\left (Y_{ij}|X_{ij}\right )=\sigma _{u0}^{2}+X_{ij}^{2}\sigma _{u1}^{2}+2X_{ij}\sigma _{u01}+{\sigma _{e}^{2}}\) ). In the random slope analysis case, FCS usually employs a “reverse random coefficient” approach where the outcome serves as a random slope predictor of the incomplete covariate (Grund et al., 2016),
where \(e_{ij\left (x\right )}\) follows a univariate normal distribution and \(\boldsymbol {u_{j\left (x\right )}}=\left (\begin {array}{c} u_{0j\left (x\right )}\\ u_{1j\left (x\right )} \end {array}\right )\) follows a multivariate normal distribution, which are similar to the substantive model. Consequently, \(var\left (X_{ij}|Y_{ij}\right )\) is also not a constant and the highest exponent of Yij is 2.
Based on either the decision tree or the Compatibility Requirement 2, we can conclude that Eq. 2 is incompatible with the substantive model in this example. Based on Compatibility Requirement 2, the highest exponent of Xij in \(var\left (X_{ij}|Y_{ij}\right )\) after integrating out u0i and u1i should be between -2 and 0. In this example, Compatibility Requirement 2 thus is violated since in both \(p\left (X_{ij}|Y_{ij}\right )\) and \(p\left (Y_{ij}|X_{ij}\right )\), the highest exponents of Xij and Yij in the conditional variances are 2. Additionally, the substantive model leads to Branches 1 and 2 in the decision tree because of the random slope (see Fig. 3 for the route of arriving Branches 1 and 2). More specifically, in terms of the question “does each regression model in the path model have only random intercepts”, the answer is no. Branches 1 and 2 indicate that we only can use model-based imputation in this case.
Substantive model with complete nonlinear terms or random slopes with complete covariates
When random slopes are only associated with complete covariates and/or all nonlinear terms are complete in the substantive model, indeed we can use FCS, however we need to be very careful in specifying FCS. FCS is a general way to specify the imputation model where one incomplete variable is regressed on all other variables, but there are various options to implement FCS. In many cases, only one way to specify FCS can ensure compatibility.
For example, the substantive model is Yij = β0 + β1X1ij + β2X2ij + u0j + u1jX1ij + eij where X1ij is complete and X2ij is incomplete. In this example, the random slopes are only associated with the complete covariate and we can use FCS. However, when X1ij and X2ij are linearly correlated, the FCS imputation model for X2ij must be \(X_{2ij}=\gamma _{0}+\gamma _{1}Y_{ij}+\gamma _{2}X_{1ij}+u_{0j\left (x\right )}+u_{1j\left (x\right )}X_{1ij}+\text {{e}}_{ij\left (x\right )}\) to ensure compatibility. Neither \(X_{2ij}=\gamma _{0}+\gamma _{1}Y_{ij}+\gamma _{2}X_{1ij}+u_{0j\left (x\right )}+u_{1j\left (x\right )}Y_{ij}+\text {{e}}_{ij\left (x\right )}\) nor \(X_{2ij}=\gamma _{0}+\gamma _{1}Y_{ij}+\gamma _{2}X_{1ij}+u_{0j\left (x\right )}+\text {{e}}_{ij\left (x\right )}\) work, although these two models also can be called the FCS imputation model. In another example, the substantive model is \(Y=\beta _{0}+\beta _{1}{X_{1}^{2}}+\beta _{2}X_{2}+e_{Y}\) where X1 is complete and X2 is incomplete. In this example, the quadratic term is only associated with the complete covariate and we can use FCS. However, when X1ij and X2ij are linearly correlated, the FCS model for X2 must be \(X_{2}=\gamma _{0}+\gamma _{1}{X_{1}^{2}}+\gamma _{2}X_{1}+\gamma _{3}Y+e_{x}\) to ensure compatibility, whereas \(X_{2}=\gamma _{0}+\gamma _{1}{X_{1}^{2}}+\gamma _{3}Y+e_{x}\) cannot guarantee compatibility. To avoid taking the risk of failing compatibility, we suggest always using model-based imputation instead of FCS when the substantive model has nonlinear terms and/or random slopes, regardless of whether those terms involve complete covariates. The decision tree also demonstrates this suggestion.
Model-based imputation: Single incomplete covariate
In this paper, we focus on calculating the model-based imputation in the Bayesian framework. To introduce the model-based imputation, we begin with the simplest case where there is only one incomplete covariate. We need to specify a model for the incomplete covariate itself p(X), which we refer to as the covariate model. Based on Bayes’ theorem, the model-based imputation model is
where Y indicates the outcome and \(p\left (Y\right )\) is the marginal distribution of the outcome. \(p\left (X|Y\right )\) in Eq. 3 is the model-based imputation model for X. \(p\left (X|Y\right )\) is compatible with the substantive model \(p\left (Y|X\right )\) because the imputation model for X is calculated based on the joint distribution, \(p\left (Y,X\right )=p\left (Y|X\right )p\left (X\right )\). If there are other complete covariates or auxiliary variables Z in addition to the incomplete covariate, a covariate model that captures the relationship between the incomplete and complete covariates/auxiliaries \(p\left (X|Z\right )\) should be specified and the model-based imputation model is \(p\left (X|Y,Z\right )=p\left (Y|X,Z\right )p\left (X|Z\right )/p\left (Y|Z\right )\). Auxiliary variables are not treated differently from covariates.
We can specify the covariate model and estimate the missing covariate via Bayesian analysis software and R packages such as BUGS (Spiegelhalter et al., 2003) and (Plummer, 2016), but they require relatively high programming skills. There are other more user-friendly R packages that we will mention in the later sections. We will focus on a free software program, Blimp, which offers a user-friendly environment for implementing model-based imputation (Keller and Enders, 2021). Besides the Blimp code that we illustrate in the A, more examples are available in Blimp user’s manual (Keller & Enders, 2021). To better illustrate how to specify Bayesian model-based imputation, we continue to use the two examples discussed in the previous section.
Example 1 (continued): Bayesian model-based imputation model for a quadratic substantive model
Previously in Example 1, we examined a substantive model with a quadratic term (\(Y|X\sim N\left (\beta _{0}+\beta _{1}X+\beta _{2}X^{2},{\sigma _{Y}^{2}}\right )\)), and concluded that the arbitrarily specified FCS imputation model for the incomplete covariate \(X|Y\sim N\left (\gamma _{0}+\gamma _{1}Y,{\sigma _{X}^{2}}\right )\) or \(X|Y\sim N\left (\gamma _{0}+\gamma _{1}Y+\gamma _{2}Y^{2},{\sigma _{X}^{2}}\right )\) is not compatible with the substantive model. Now we use model-based imputation with assuming a covariate model of \(p(X)=N\left (\gamma ,{\sigma _{X}^{2}}\right )\), and calculate the model-based imputation model by Eq. 3,
The kernel of p(X|Y ) follows a quartic exponential family (Cobb et al., 1983; Lüdtke et al., 2020) and it is difficult to directly sample from this family. Instead, we can use the Metropolis-Hastings algorithm to empirically construct the distribution based on the kernel and estimate the missing covariate (Gilks et al., 1996; Hastings, 1970). In the MH algorithm, the sampled X moves to a new position of the target kernel (e.g., Eq. 4) given its current position using a jumping distribution, and keeps updating. Bayesian software and Bayesian R packages can easily handle this case. Specifically, we provide the Blimp code for this example in the A.
Example 2 (continued): Bayesian model-based imputation model for a random slope model
We considered the multilevel substantive model previously in Example 2. Now we use matrices to express the equation of a random slope model for convenience,
where Yj is an nj × 1 vector of the outcome in the j th cluster (j = 1,...,J), \(\boldsymbol {\beta }=\left (\begin {array}{c} \beta _{0}\\ \beta _{1} \end {array}\right )\)is a 2 × 1 vector of the fixed effects, \(\boldsymbol {u_{j}}=\left (\begin {array}{c} u_{0j}\\ u_{1j} \end {array}\right )\) is a 2 × 1 vector of the random effects in the j th cluster, Xj is an nj × 1 matrix for the covariate in the j th cluster, and ej is an nj × 1 independently normally distributed error term with E(ej) = 0 and \(var(\boldsymbol {e_{j}})={\sigma _{e}^{2}}\boldsymbol {I_{nj}}\). The likelihood by augmenting the random effects uj is \(p(\boldsymbol {Y},\boldsymbol {u}|{\sigma _{e}^{2}},\boldsymbol {\beta },\boldsymbol {X},\boldsymbol {D})=\underset {j=1}{\overset {J}{\prod }}f(\boldsymbol {Y_{j}}|\boldsymbol {u_{j}},{\sigma _{e}^{2}},\boldsymbol {\beta },\boldsymbol {X_{j}})f(\boldsymbol {u_{j}}|\boldsymbol {D})\). To implement the model-based imputation, we assume \(\boldsymbol {X_{j}}\sim MN\left (\alpha \boldsymbol {1_{nj}},{\sigma _{X}^{2}}\boldsymbol {I_{nj}}\right )\) as the covariate model. The model-based imputation model for Xj is calculated by Eq. 3,
Model-based imputation: Multiple incomplete covariates
When multiple covariates are incomplete, we need to make sure that the joint distribution of all the incomplete covariates and outcome exists. In other words, the imputation model of each covariate should be compatible with the substantive model, and all the imputation models of incomplete covariates should be compatible with each other. In this section, we introduce and compare two ways to specify covariate models in the model-based imputation framework when multiple covariates are incomplete: the sequential and separate specification approaches. There is a third way to specify the imputation model and ensure compatibility in the context of multiple incomplete covariates: the substantive model based version of joint modeling (Carpenter & Kenward, 2013). Since we don’t focus on joint modeling in this paper, we present it in the A.
Sequential specification
In the first approach, the joint distribution of all the incomplete covariates, p(X), is specified as the covariate model. Considering the difficulty of specifying a multivariate distribution, Ibrahim et al., (1999) proposed to factor the joint distribution into a sequence of univariate distributions,
This approach is referred to as the sequential specification. The model-based imputation model for the k th covariate Xk using the sequential specification is
where X−k denotes all the covariates except Xk. If there are complete covariates or auxiliary variables, we specify the joint distribution of the incomplete covariates conditional on the complete covariates or auxiliary variables, p(X|Z), where Z is the set of complete covariates or complete auxiliary variables. This specification has been widely used in the imputation literature (e.g., Erler et al., 2016; Lüdtke et al. 2020). For different research questions, it may be more reasonable to use some specific orders for the joint specification. But it has been found that the sequential specification is quite robust against changes in the ordering (Chen and Ibrahim, 2001; Zhu & Raghunathan, 2015). The sequential specification is available in the R packages JointAI (Erler et al., 2019) and mdmb (Robitzsch & Luedtke, 2019), and the software program Blimp (Keller & Enders, 2021). Additionally, we show how to accommodate categorical variables in the sequential specification in the A.
Separate specification
Alternatively, we can specify the univariate conditional distributions for each incomplete covariate one by one as the covariate model, instead of focusing on the joint distribution of the covariates (Bartlett et al., 2015; Enders et al., 2020). The univariate conditional distribution is specified as regressing each incomplete covariate on all other incomplete covariates, p(Xk|X−k). Then the model-based imputation model is
We refer to this approach as the separate specification. If there are complete covariates or auxiliary variables, we specify the univariate conditional distribution of each incomplete covariate conditional on the complete covariates or auxiliary variables, p(Xk|X−k,Z) where Z is the set of complete covariates or complete auxiliary variables. The separate specification is available in the R package smcfcs (Bartlett & Keogh, 2019) and the software program Blimp (Keller & Enders, 2021). However, the separate specification has two issues. First, the covariate models p(Xk|X−k) may not be mutually compatible and lead to biased estimation. That is, based on the univariate conditional distributions of the covariates, the joint distribution of the incomplete covariates (p(X)) does not exist, and consequently the joint distribution of all the incomplete covariates and the outcome (p(X,Y )) does not exist. Based on the aforementioned Compatibility Requirement 1, if all the covariate models meet the following requirements: 1) the mean structure is linear, 2) normal errors, 3) no random slopes, and 4) constant variance, the covariate models are mutually compatible. If we can ensure compatibility among the covariate models (i.e., p(X) exists), the sequential specification and separate specification approaches are equivalent and would lead to the same imputation model because \(p\left (X_{k}|\boldsymbol {X_{-k}}\right )\propto \underset {s=1}{\overset {k}\prod }p\left (X_{s}|X_{>s}\right )\). Second, the separate specification is over-parameterized. For example, when two incomplete covariates follow a bivariate normal distribution, there are 5 redundant pieces of information: 2 means and 3 variance-covariance components. If we specify a simple linear regression in both p(X1|X2) and p(X2|X1), we estimate 2 intercepts, 2 slopes, and 2 residual variances. Thus one more parameter is estimated compared to the number of pieces of information in the data. The linear slope in p(X1|X2) is deterministic by the linear slope in p(X2|X1). If we freely estimate both slopes, it may cause problems especially when informative priors are used because the priors of two slopes may contain conflicting information (Hughes et al., 2014; Liu et al., 2014). In addition, because the separate specification approach estimates more parameters, it is a less efficient approach.
We do not propose to only use the separate specification or only use the sequential specification. The decision depends on our assumptions and what is known. If the joint distribution of covariates p(X) is already known or is assumed known, we can use p(X) to calculate all the univariate marginal and conditional distributions (e.g., \(p\left (X_{2}|X_{1}\right )\), \(p\left (X_{1}|X_{2}\right )\), \(p\left (X_{1}\right )\), and \(p\left (X_{2}\right )\)), and the sequential specification and separate specification are interchangeable, and either can be used though one might be easier to use than the other. Particularly, if we know or assume p(X) is a multivariate normal distribution, we can use either the joint specification or the separate specification easily. In this case, we just specify linear regressions with normally distributed residuals for \(p\left (X_{2}|X_{1}\right )\), \(p\left (X_{1}|X_{2}\right )\), \(p\left (X_{1}\right )\), and \(p\left (X_{2}\right )\). It is easy to prove that both of these specifications reach the same conclusion. In the case where we are not sure of p(X) (e.g., Example 3), the sequential specification and separate specification provide a different set of considerations and challenges. When we don’t know the correct form of p(Xk|X−k), the separate specification only can arbitrarily specify the covariate models. The consequence is either the covariate models are mutually incompatible, or the separate specification uses linear regressions to ensure compatibility but misspecifies the relationship among the covariates. In this case, we need to use the sequential specification to guarantee compatibility and accommodate the nonlinear relationship among covariates (see Branch 1 in Fig. 1). Although the sequential specification can accommodate the nonlinear relationship among the covariates, it is based on the specific assumption of the nonlinear relationship. It cannot guarantee that the specified covariate models and the joint distribution of the covariates are the true data generating models. Indeed, no matter which approach we use, whether the specified model is the true data generating model is untestable. We only can guarantee that the specification is mathematically valid.
We use Example 3 to illustrate how to use the decision tree to select the appropriate imputation method when there are multiple incomplete covariates and the difference between the separate and sequential specifications. We use Example 4 to illustrate the imputation specification when there are multiple regressions in a path model.
Example 3: Moderated regression with nonlinearly related covariates
In this example, we show that only the sequential specification can capture the nonlinear relationship among covariates. We assume a substantive model with an interaction term as Y = β0 + β1X1 + β2X2 + β3X1X2 + eY (i.e., \(p\left (Y|X_{1},X_{2}\right )=N\left (\beta _{0}+\beta _{1}X_{1}+\beta _{2}X_{2}+\beta _{3}X_{1}X_{2},{\sigma _{Y}^{2}}\right )\)). Both X1 and X2 are incomplete. First of all, as demonstrated in the decision tree (the answer is no for “does each regression model in the path model have only linear terms”), we cannot use FCS since the substantive model has an incomplete nonlinear term. Second, if any theory or assumption reveals that X1 and X2 are not linearly related, we need to use the sequential specification (the answer is no for “is there only one regression model with linearly related covariates” in the decision tree). See Fig. 4 for the route of arriving Branch 1. For example, the two covariates are nonlinearly related such that \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}{X_{2}^{2}},\sigma _{X1}^{2}\right )\). In terms of the covariate model of X2, if we consider the separate specification, regardless of whether we assume \(p\left (X_{2}|X_{1}\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}{X_{1}^{2}},\sigma _{X2}^{2}\right )\) or \(p\left (X_{2}|X_{1}\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1},\sigma _{X2}^{2}\right )\), \(p\left (X_{1}|X_{2}\right )\) and \(p\left (X_{2}|X_{1}\right )\) are incompatible because \(p\left (X_{1}|X_{2}\right )\) contains a nonlinear term and constant variance (Compatibility Requirement 1). In the sequential specification, we assume \(p\left (X_{2}\right )=N\left (\gamma _{0X2},\sigma _{X2}^{2}\right )\). Based on Eq. 8, we calculate the model-based imputation models for X1 and X2 by \(p\left (X_{1}|X_{2},Y\right )\propto p\left (Y|X_{1},X_{2}\right )p\left (X_{1}|X_{2}\right )\) and \(p\left (X_{2}|X_{1},Y\right )\propto p\left (Y|X_{1},X_{2}\right )p\left (X_{1}|X_{2}\right )p\left (X_{2}\right )\). Again, although the sequential specification can be used to ensure compatibility based on the assumed \(p\left (X_{1}|X_{2}\right )\), the resultant imputation model may or may not reflect the true underlying data generating process. We provide the Blimp code for this example in the A.
Example 4: Moderated mediation model
It is frequently of interest to fit large path models that may contain nonlinear relationships among upstream variables that later serve as covariates in a downstream regression. To illustrate the sequential specification in this case, we provide an example of a moderated mediation model. When there are multiple regressions in a model, such as a mediation model, we need to consider each regression model. Suppose the model is a moderated mediation model M = β0M + β1MX1 + β2MX2 + β3MX1X2 + eM and Y = β0Y + β1YM + β2YX3 + β3YMX3 + eY. X1, X2, X3, and M are incomplete. First, as demonstrated in the decision tree (the answer is no for “does each regression model in the path model have only linear terms”), we cannot use FCS because both the M model and Y model have interactive terms and thus X1, X2, X3, M, and Y cannot come from a multivariate normal distribution. Second, since this is a mediation model with two regressions, although carefully specified separate specification is feasible, it is too complex. With the sequential specification, \(p\left (X_{1},X_{2},X_{3},M,Y\right )=p\left (X_{1}|X_{2}\right )p\left (X_{2}\right )p\left (M|X_{1},X_{2}\right )p\left (X_{3}|M\right )p\left (Y|M,X_{3}\right )\). If any theory or assumption reveals that the covariates in the M model or Y model are linearly correlated, we can specify \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\), \(p\left (X_{2}\right )=N\left (\gamma _{0X2},\sigma _{X2}^{2}\right )\), \(p\left (X_{3}|M\right )=N\left (\gamma _{0X3}+\gamma _{1X3}M,\sigma _{X3}^{2}\right )\), and \(p\left (M\right )=N\left (\gamma _{0M},{\sigma _{M}^{2}}\right )\). Since the sequential specification can clearly decompose the joint distribution to a sequence of regressions, we suggest using sequential specification in path models even when covariates are linearly related. Hence, in the decision tree, when the answer is no for “are there multiple regression models in the path model with only linear terms”, the suggested specification is the sequential specification (see Fig. 5 for the route of arriving Branch 1). We provide the Blimp code for this example in the A. If any theory or assumption reveals that the covariates in the M model or Y model are nonlinearly correlated, we have to use sequential specification to capture the nonlinear relation among covariates. With the sequential specification, we calculate the model-based imputation models for X1 by \(p\left (X_{1}|X_{2},M\right )\propto p\left (X_{1}|X_{2}\right )p\left (M|X_{1},X_{2}\right )\), for X2 by \(p\left (X_{2}|X_{1},M\right )\propto p\left (X_{1}|X_{2}\right )p\left (X_{2}\right )p\left (M|X_{1},X_{2}\right )\), for M by \(p\left (M|X_{1},X_{2},X_{3},Y\right )\propto p\left (M|X_{1},X_{2}\right )\) \(p\left (X_{3}|M\right )p\left (Y|M,X_{3}\right )\), and for X3 by \(p\left (X_{3}|M,Y\right )\propto p\left (X_{3}|M\right )p\left (Y|M,X_{3}\right )\).
Misuse in model-based imputation
Regardless of using the sequential specification or separate specification, we should be careful in how to specify covariate models and use them for imputation. The specification of the covariate models and imputation models has to obey probability rules. We use a single level model with three incomplete covariates as an example. Under the separate specification, \(p\left (X_{3}|X_{1},X_{2}\right )\) is the covariate model for X3, and the model-based imputation of X3 should be calculated as \(p\left (X_{3}|X_{1},X_{2},Y\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{3}|X_{1},X_{2}\right )\), but not \(p\left (X_{3}|X_{1},X_{2},Y\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{3}|X_{1}\right )\) or \(p\left (X_{3}|X_{1},X_{2},Y\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{3}\right )\), unless we can prove that \(p\left (X_{3}|X_{1},X_{2}\right )\propto p\left (X_{3}|X_{1}\right )\) or \(p\left (X_{3}|X_{1},X_{2}\right )\propto p\left (X_{3}\right )\). Under the sequential specification, the model-based imputation of X3 is \(p\left (X_{3}|Y,X_{1},\right .\) \(\left .X_{2}\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{1}|X_{2},X_{3}\right )\) \(p\left (X_{2}|X_{3}\right )p\left (X_{3}\right )\). We will see in Example 5 that when the substantive model is complex with a multiple level structure and/or multiple covariates, it might be easy to make mistakes in choosing the covariate models, especially when cluster means are considered.
Example 5: A two-level model with multiple covariates
We consider a two-level model with two incomplete Level-1 covariates, X1ij and X2ij,
Since the substantive model contains random slopes, FCS is not applicable here. Instead, we can use the sequential approach or separate approach if we assume X1ij and X2ij are linearly related. We use latent cluster mean centering in the covariate models to partition covariates into within- and between-cluster components. Therefore, X1ij and X2ij consist of latent cluster means and within-cluster deviations. Latent cluster means are employed to accommodate unequal group sizes (Grund et al., 2018). We further define the within- and between-cluster parts of the covariates as normally distributed. Specifically, the within-cluster parts are within-cluster deviation scores, distributed as \(p\left (X_{1ij},X_{2ij}\right )=MN\left (\left (\begin {array}{c} \mu _{1j}\\ \mu _{2j} \end {array}\right ),\boldsymbol {{{{{\varSigma }}}}_{w}}\right )\). The between-cluster parts are the latent cluster means, distributed as \(p\left (\mu _{1j},\mu _{2j}\right )=MN\left (\left (\begin {array}{c} \mu _{1}\\ \mu _{2} \end {array}\right ),\boldsymbol {{{{{\varSigma }}}}_{b}}\right )\). X1ij, X2ij, μ1j, and μ2j are the variables that need to be estimated in this example.
We begin with the sequential specification, in which there are four covariate models in this example,
We summarize how to impute X1ij, X2ij, μ1j, and μ2j using these covariate models in Eqs. 10-13 respectively in Table 1. For example, the correct imputation model of μ2j is
It looks like there are simpler ways to impute μ2j. First, how about \(p\left (\mu _{2j}|\mu _{1j},\boldsymbol {X_{1j},X_{2j}}\right )\propto p\left (\boldsymbol {X_{1j}}|\mu _{1j},\right .\) \(\left .\mu _{2j},\boldsymbol {X_{2j}}\right )p\left (\mu _{2j}\right )\)? This specification is mathematically invalid unless \(p\left (\mu _{2j}\right )\propto p\left (\mu _{2j}|\mu _{1j},\boldsymbol {X_{2j}}\right )\). Second, how about \(p\left (\mu _{2j}|\mu _{1j},\boldsymbol {X_{2j}}\right )\propto p\left (\boldsymbol {X_{2j}}|\mu _{2j}\right )\)\(p\left (\mu _{1j}\right .\) \(\left .|\mu _{2j}\right )p\left (\mu _{2j}\right )\)? This specification is mathematically correct but it incorporates less information compared to Eq. 14 during iterations. Similarly, the imputation model of X2ij is proportional to \(p\left (Y_{ij}|u_{0j},u_{1j}X_{1ij},\right .\) \(\left .X_{2ij}\right )p\left (X_{1ij}|\mu _{1j},\mu _{2j},X_{2ij}\right )p\left (X_{2ij}|\mu _{2j}\right )\) but not proportional to \(p\left (Y_{ij}|u_{0j},u_{1j}X_{1ij},X_{2ij}\right )p\left (X_{2ij}|\mu _{2j}\right )\)). When there are three incomplete Level-1 covariates, the use of covariate models in sequential specification is illustrated in Table 1. As shown in Table 1, when there are more incomplete predictors, the sequential specification becomes much more complex.
In the separate specification, we also need to be careful about specifying and using the appropriate covariate model. In this example, we have four covariate models under separate specification,
The imputation model of μ1j is calculated as
but not \(p\left (\boldsymbol {X_{1j}}|\mu _{1j},\mu _{2j},\boldsymbol {X_{2j}}\right )p\left (\mu _{2j}|\mu _{1j}\right )\). We summarize the use of covariate models in the separate specification in Table 1 with two or three incomplete covariates. As shown in Table 1, when there are more incomplete predictors and the separate specification is applicable, the separate specification has a simpler form than the sequential specification and can be easier to use. But if one relies on software to impute missing observations, the difference of complexity in specifying imputation models is not a concern. For example, in Blimp, users only need to specify covariate models and Blimp will automatically impute the missing observations based on the correct imputation models. We provide the Blimp code for this example in the A.
Hypothetical data examples
Two hypothetical data examples are presented in order to illustrate and compare the separate specification and sequential specification when two covariates are linearly related and nonlinearly related, respectively. In both examples, the research goal is to test for the effect of family support (X2) in moderating the relationship between life event stress (X1) and depression (Y ). A reasonable model is that life event stress leads to depression but that strong family support might buffer the effects of stress. Hence, the true regression model is Y = 10 + 0.5X1 − 0.5X2 − 0.1X1 × X2 + eY. Because the substantive model contains incomplete nonlinear terms, the traditional FCS method causes incompatibility. We will compare different model-based imputation specifications with FCS in this section.
In the first scenario, we specify the relationship between life event stress (X1) and family support (X2) as X2 = 10 − X1 + eX2 to generate data. 106 observations were simulated. 25% of the observations in both X1 and X2 were made missing, if participants’ depression scores (Y ) were higher than the group mean. Hence, X1 and X2 are missing at random (MAR). The ordinary least squares (OLS) estimates and standard errors of the coefficients in the substantive model from the complete data are presented in Table 2. Given such a large sample size, the estimates were almost the same as the population values and the standard errors were small. Since X1 and X2 are linearly related, we can use either the separate or sequential specification to impute the missing X1 and X2 (see Branches 1 and 2). In the separate specification, we specified the covariate models as \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}|X_{1}\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1},\sigma _{X2}^{2}\right )\); and in the sequential specification, we specified the covariate models as \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}\right )=N\left (\gamma _{MX2},\sigma _{MX2}^{2}\right )\). Additionally, we also consider FCS, which specifies \(p\left (X_{1}|X_{2},Y\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}Y,\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}|X_{1},Y\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}Y,\sigma _{X2}^{2}\right )\). We applied multiple imputation with 20 imputed datasets from the posterior samples. The point estimates and standard errors of the regression coefficients in the substantive model β from these two specifications are presented in Table 2. The separate specification and sequential specification provided similar parameter estimates and standard errors of the substantive model regression coefficients. Their estimates had only small biases (i.e., the relative biases, \(\frac {\hat {\beta }-\beta }{\beta }\times 100\%\), were smaller than 5%). But the estimates from FCS had biases larger than 5%.
In the second scenario, we specify the relationship between life event stress (X1) and family support (X2) as \(X_{2}=10+0.5\times X_{1}-0.5\times {X_{1}^{2}}+e_{X2}\) to generate data. 106 observations were simulated, and 25% of X1 and 25% of X2 were missing depending on depression scores (Y ). The OLS estimates and standard errors of regression coefficients in the substantive model from the complete data are presented in Table 2. We compare four ways to specify the covariate models to impute the missing X1 and X2. First, under the separate specification, if we specify the covariate models as \(X_{1}|X_{2}\sim N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\) and \(X_{2}|X_{1}\sim N\left (\gamma _{0X2}+\gamma _{1X2}X_{1},\sigma _{X2}^{2}\right )\), it implies that X1 and X2 are linearly related and should follow a bivariate normal distribution. Although the specification does not match the true data generating model, the joint distribution of X1 and X2 theoretically exists and the imputation models are not mathematically wrong. Second, if we assume the covariate models as \(X_{1}|X_{2}\sim N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}{X_{2}^{2}},\sigma _{X1}^{2}\right )\) and \(X_{2}|X_{1}\sim N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}{X_{1}^{2}},\sigma _{X2}^{2}\right )\), it is an incompatible separate specification based on Compatibility Requirement 1. Hence, the specification is mathematically invalid and does not match the true data generating model. Third, under the sequential specification, we can specify \(X_{2}|X_{1}\sim N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X1}{X_{1}^{2}},\sigma _{X2}^{2}\right )\) and \(X_{1}\sim N\left (\gamma _{X1},\sigma _{X1}^{2}\right )\) to capture the nonlinear relationship between X1 and X2. Fourth, we consider FCS as \(p\left (X_{1}|X_{2},Y\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}Y,\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}|X_{1},Y\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}Y,\sigma _{X2}^{2}\right )\). We applied multiple imputation with 20 imputed datasets from the posterior samples. The parameter estimates and standard errors of the coefficients in the substantive model from these three specifications are presented in Table 2. The estimates from the compatible separate specification were biased, because the imputation models were misspecified and failed to capture the nonlinear relationship between covariates. The estimates from the incompatible separate specification were severely biased, because both misspecification and incompatibility cause biases. FCS also had biased estimates. In contrast, the compatible sequential specification provided estimates close to the true values.
In the first scenario, we illustrate a case of linearly related covariates, in which both the separate specification and sequential specification are applicable and equivalent. In the second scenario, we illustrate a case of nonlinearly correlated covariates, in which there is no way to use the separate specification to capture such a relationship. We need to highlight that even if we use the sequential specification to capture the nonlinear relationship, it only reflects our assumption of the relationship among covariates. If in the second scenario, we specify the sequential specification model as \(X_{1}|X_{2}\sim N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X2}{X_{2}^{2}},\sigma _{X1}^{2}\right )\) and \(X_{1}\sim N\left (\gamma _{X2},\sigma _{X2}^{2}\right )\), the covariate models are mathematically correct but they are misspecified, because they do not match the true data generating model. Overall, the sequential specification is more flexible than the separate specification in terms of allowing nonlinear relationships among X, but it does not help identify the true data generating model.
Conclusion and recommendation
Despite the broad appeal of multiple imputation and other imputation approaches, researchers might ignore the importance of compatibility when specifying imputation models. Incompatibility has been found to cause bias-inducing problems (e.g., Bartlett et al., 2015; Enders et al., 2020; Van Buuren et al., 2006). Developed from different models, a growing body of recent missing data work has focused on the model-based imputation model that is compatible with the substantive model (Bartlett et al., 2015; Enders et al., 2020; Erler et al., 2016, 2019; Goldstein et al., 2014; Kim et al., 2015, 2018; Zhang & Wang 2017). Building on these recent developments and Arnold’s compatibility work that is not limited to the missing data area (e.g., Arnold et al.; Arnold & Press; Liu et al.; Van Buuren et al.; Van Buuren), this paper systematically summarizes when the traditional FCS is applicable and how to specify a model-based imputation model if needed.
When researchers have a strong assumption of the imputation models and prefer to specify them directly (usually via the FCS procedure), compatibility should always be checked to assure that the imputation models for covariates are compatible with the substantive model and the imputation models are mutually compatible. To help researchers check compatibility more easily, first, we summarize two Compatibility Requirements which can help researchers decide whether the imputation models for covariates are compatible with the substantive model. Compatibility Requirement 1 is that if the conditional variances from the normal substantive model and imputation model are constant, the mean structure regarding the incomplete covariates cannot be nonlinear. Compatibility Requirement 2 is that after integrating out all other covariates and random effects, if both \(p\left (Y|X\right )\) and \(p\left (X|Y\right )\) are conditional normal, they cannot have a conditional variance whose highest exponent on X or Y is higher than 0. Second, we provide a decision tree to check whether the traditional FCS is applicable in a given scenario.
When the Compatibility Requirements or the decision tree reveals that FCS is not applicable, we should use the model-based imputation approach. The model-based imputation approach ensures the existence of a joint distribution of all incomplete covariates and outcome. With this goal, the model-based imputation procedure begins with specifying appropriate covariate models, and calculating the imputation models based on the substantive model and covariate models using Bayes’ theorem. When there are multiple incomplete covariates, this paper illustrates and compares two types of specifications: the separate specification and the sequential specification. If we know or assume that all the incomplete covariates come from a multivariate normal distribution, the two specifications are essentially the same. If the relationship between the covariates is not linear, we should use the sequential specification to capture the nonlinear relationship and guarantee compatibility. Although the sequential specification is more flexible compared to the separate approach because the sequential specification allows for nonlinearities among covariates, it cannot guarantee that the specified covariate model is the true underlying model but only guarantees that the specification is mathematically valid.
We need to caution researchers who calculate or program model-based imputation models by themselves that the specification of FCS has to obey probability rules. Omitting covariate models or using wrong covariate models in imputation model calculation will lead to wrongly imputed data and biased parameter estimation.
References
Agresti, A. (2018) An introduction to categorical data analysis. Hoboken, NJ: Wiley.
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679.
Arnold, B. C., Castillo, E., Sarabia, J. -M., & Sarabia, J. M. (1999) Conditional specification of statistical models. New York, NY: Springer Science & Business Media.
Arnold, B. C., & Press, S. J. (1989). Compatible conditional distributions. Journal of the American Statistical Association, 84(405), 152–156.
Bartlett, J. W., & Keogh, R. (2019). smcfcs: Multiple imputation of covariates by substantive model compatible fully conditional specification [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=smcfcs (R package version 1.4.0).
Bartlett, J. W., Seaman, S. R., White, I. R., & Carpenter, J. R. (2015). Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research, 24(4), 462–487.
Carpenter, J., & Kenward, M. (2013) Multiple imputation and its application. John Wiley & Sons.
Chen, M.-H., & Ibrahim, J. G. (2001). Maximum likelihood methods for cure rate models with missing covariates. Biometrics, 57(1), 43–52.
Cobb, L., Koppstein, P., & Chen, N. H. (1983). Estimation and moment recursion relations for multimodal distributions of the exponential family. Journal of the American Statistical Association, 78(381), 124–130.
Du, H., Enders, C., Keller, B. T., Bradbury, T. N., & Karney, B. R. (2021). A bayesian latent variable selection model for nonignorable missingness. Multivariate Behavioral Research, 1–49.
Enders, C. K. (2022) Applied missing data analysis, (edition 2). New York, NY: Guilford Press.
Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel regression models with random coefficients, interaction effects, and non-linear terms. Psychological Methods, 25 (1), 88–112.
Enders, C. K., Hayes, T., & Du, H. (2018). A comparison of multilevel imputation schemes for random coefficient models: Fully conditional specification and joint model imputation with random covariance matrices. Multivariate Behavioral Research, 53(5), 695–713.
Enders, C. K., Keller, B. T., & Levy, R. (2017). A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological Methods, 23(2), 298–317.
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological Methods, 21(2), 222–240.
Erler, N. S., Rizopoulos, D., Jaddoe, V. W., Franco, O. H., & Lesaffre, E. M. (2019). Bayesian imputation of time-varying covariates in linear mixed models. Statistical Methods in Medical Research, 28(2), 555–568.
Erler, N. S., Rizopoulos, D., & Lesaffre, E. M. (2019). JointAI: Joint analysis and imputation of incomplete data in r. arXiv:https://arxiv.org/abs/1907.10867.
Erler, N. S., Rizopoulos, D., Rosmalen, J. v., Jaddoe, V. W., Franco, O. H., & Lesaffre, E. M. (2016). Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full bayesian approach. Statistics in Medicine, 35(17), 2955–2974.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Introducing markov chain monte carlo. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.) Markov chain monte carlo in practice (pp. 339–357). London: Chapman & Hall.
Goldstein, H., Carpenter, J. R., & Browne, W. J. (2014). Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. Journal of the Royal Statistical Society: Series A (Statistics in Society), 177(2), 553–564.
Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Multiple imputation of missing covariate values in multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48(2), 640–649.
Grund, S., Lüdtke, O., & Robitzsch, A. (2018). Multiple imputation of missing data for multilevel models: Simulations and recommendations. Organizational Research Methods, 21(1), 111–149.
Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1), 97–109.
Hughes, R. A., White, I. R., Seaman, S. R., Carpenter, J. R., Tilling, K., & Sterne, J. A. (2014). Joint modelling rationale for chained equations. BMC Medical Research Methodology, 14(1), 28.
Ibrahim, J. G., Chen, M.-H., & Lipsitz, S. R. (1999). Monte carlo em for missing covariates in parametric regression models. Biometrics, 55(2), 591–596.
Johnson, V. E., & Albert, J. H. (2006) Ordinal data modeling. New York, NY: Springer Science & Business Media.
Keller, B. T., & Enders, C. K. (2021) Blimp user’s manual (version 3). Los Angeles, CA.
Kim, S., Belin, T. R., & Sugar, C. A. (2018). Multiple imputation with non-additively related variables: Joint-modeling and approximations. Statistical Methods in Medical Research, 27(6), 1683–1694.
Kim, S., Sugar, C. A., & Belin, T. R. (2015). Evaluating model-based imputation methods for missing covariates in regression models with interactions. Statistics in Medicine, 34(11), 1876–1888.
Little, R. J., & Rubin, D. B. (2019) Statistical analysis with missing data Vol. 333. Hoboken, NJ: John Wiley & Sons.
Liu, J., Gelman, A., Hill, J., Su, Y. -S., & Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika, 101(1), 155–173.
Lüdtke, O., Robitzsch, A., & West, S. G. (2020). Regression models involving nonlinear effects with missing data: A sequential modeling approach using bayesian estimation. Psychological Methods, 25(2), 157–181.
McCulloch, R., & Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics, 64(1-2), 207–240.
Meng, X. -L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 538–558.
Muthén, L., & Muthén, B. (1998) Mplus user’s guide, 8th edition. Los Angeles, CA: Author.
Plummer, M. (2016). rjags: Bayesian graphical models using mcmc [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=rjags (R package version 4-6).
Robitzsch, A., & Luedtke, O. (2019). mdmb: Model based treatment of missing data [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=mdmb (R package version 1.3-18).
Rubin, D. B. (2004) Multiple imputation for nonresponse in surveys (Vol. 81). John Wiley & Sons.
Schafer, J. L. (1997) Analysis of incomplete multivariate data. New York, NY: CRC press.
Seaman, S. R., Bartlett, J. W., & White, I. R. (2012). Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Medical Research Methodology, 12 (1), 46.
Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003) Winbugs user manual. Citeseer.
Van Buuren, S. (2011). Multiple imputation of multilevel data. In J. Hox, & J. K. Roberts (Eds.) Handbook of advanced multilevel analysis (pp. 173–196). New York, NY: Routledge.
Van Buuren, S. (2012) Flexible imputation of missing data. New York, NY: CRC press.
Van Buuren, S., Brand, J. P., Groothuis-Oudshoorn, C. G., & Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12), 1049–1064.
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psychological Methods, 22(4), 649–666.
Zhu, J., & Raghunathan, T. E. (2015). Convergence properties of a sequential regression multiple imputation algorithm. Journal of the American Statistical Association, 110(511), 1112– 1124.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Blimp code
################################################ ######## Example 1 (continued): Bayesian model-based imputation model for a quadratic substantive model ################################################ DATA: mydatafile.dat; # Read Data in VARIABLES: Y X; # Variable Names MISSING: NA; # What missing data is coded MODEL: Y \(\sim \) X X∧2; # Specify Substantive Model SEED: 398721; # Set a seed BURN: 10000; # Set number of burn iterations per chain ITERATIONS: 10000; # Set number of ite- rations after burn- in period across all chains NIMP: 20; # Set number of imputations to be saved across all chains CHAINS: 4 processors 4; # Set number of chains total across how many processors SAVE: stacked = imps.dat; # Request all imputa- tions stacked into one space delim file.
################################################ ########Example 2 (continued): Bayesian model-based imputation model for a random slope model ################################################ DATA: mydatafile.dat; # Read Data in VARIABLES: cluster Y X; # Variable Names CLUSTERID: cluster; # Specify Clustering Variable MISSING: NA; # What missing data is coded MODEL: Y \(\sim \) X | X;# Specify Substantive Model. Note that "|" denotes a random slope SEED: 398721; # Set a seed BURN: 10000; # Set number of burn iterations per chain ITERATIONS: 10000; # Set number of iterations after burn-in period across all chains NIMP: 20; # Set number of imputa- tions to be saved across all chains CHAINS: 4 processors 4; # Set number of chains total across how many processors SAVE: stacked = imps.dat; # Request all impu- tations stacked into one space delim file.
################################################ ########Example 3: Moderated regression with nonlinearly related covariates ############################################### DATA: mydatafile.dat; # Read Data in VARIABLES: Y X1 X2; # Variable Names MISSING: NA; # What missing data is coded MODEL: # Specify Sequential Models Y \(\sim \) X1 X2 X1∗X2;# p(Y | X1 X2 ) X1 \(\sim \) X2 X2∧2; # p(X1 | X2 )X2 \(\sim \) 1; # p(X2 ) SEED: 398721; # Set a seed BURN: 10000; # Set number of burn iterations per chain ITERATIONS: 10000; # Set number of itera- tions after burn-in period across all chains NIMP: 20; # Set number of impu- tations to be saved across all chains CHAINS: 4 processors 4; # Set number of chains total across how many processors SAVE: stacked = imps.dat; # Request all imputa- tions stacked into one space delim file.
############################################### ########Example 4: Moderated mediation model ############################################## DATA: mydatafile.dat; # Read Data in VARIABLES: Y X1 X2 x3 M; # Variable Names MISSING: NA; # What missing data is coded MODEL: # Specify Sequential Models M \(\sim \) X1 X2 X1∗X2; # p(M | X1 X2 ) Y \(\sim \) M X3 M∗X3; # p(Y | X3 M ) X1 \(\sim \) X2 X2∧2; # p(X1 | X2 ) X2 \(\sim \) 1; # p(X2 ) X3 \(\sim \) M; # p(X3 | M ) SEED: 398721; # Set a seed BURN: 10000; # Set number of burn itera- tions per chain ITERATIONS: 10000; # Set number of iterations after burn-in period across all chains NIMP: 20; # Set number of imputations to be saved across all chains CHAINS: 4 processors 4; # Set number of chains total across how many processors SAVE: stacked = imps.dat; # Request all imputations stacked into one space delim file.
################################################ ########Example 5: A two-level model with multiple covariates ############################################### #Separate specification DATA: mydatafile.dat; # Read Data in VARIABLES: cluster Y X1 X2; # Variable Names CLUSTERID: cluster; # Specify Clustering Variable MISSING: NA; # What missing data is coded MODEL: Y \(\sim \) X1 X2 | X1; # Analysis Model (The separate speci- fication is the default setting) SEED: 398721; # Set a seed BURN: 10000; # Set number of burn iterations per chain ITERATIONS: 10000; # Set number of itera- after burn-in period across all chains NIMP: 20; # Set number of imputa- tions to be saved across all chains CHAINS: 4 processors 4; # Set number of chains total across how many processors SAVE: stacked = imps.dat; # Request all imputa- tions stacked into one space delim file. # Sequential specification DATA: mydatafile.dat; # Read Data in VARIABLES: cluster Y X1 X2; # Variable Names CLUSTERID: cluster; # Specify Clustering Variable MISSING: NA; # What missing data is coded LATENT: l2 = mu1 mu2; # Specify Names of Latent Means MODEL: # Specify Sequential Models Y \(\sim \) X1 X2 | X1; # p(Y | X1 X2 mu1 mu2 ) X1 \(\sim \) 1@mu1 X2−mu2; # p(X1 | mu1 X2) X2 \(\sim \) 1@mu2; # p(X2 | mu2) mu1 \(\sim \) mu2; # p(mu1 | mu2 ) mu2 \(\sim \) 1; # p(mu2 ) SEED: 398721; # Set a seed BURN: 10000; # Set number of burn iterations per chain ITERATIONS: 10000; # Set number of itera- tions after burn-in period across all chains NIMP: 20; # Set number of impu- tations to be saved across all chains CHAINS: 4 processors 4; # Set number of chains total across how many processors SAVE: stacked = imps.dat; # Request all imputa- tions stacked into one space delim file.
Appendix B: Technical Details
Arnold’s Conclusions
The two conclusions from Arnold and Press (1989) and Arnold et al., (1999) can be used to check the compatibility of a normal substantive model and a normal traditional FCS imputation model.
Conclusion 1: Compatibility of k-variate normal distribution with linear mean structures and constant variances
Suppose there are K random variables \(\boldsymbol {X}=\left \{ X_{1},X_{2,...,}X_{K}\right \} \), and the conditional distribution of the k th variable, Xk, given all other variables, X−k, is univariate normal. If and only if the regression of Xk on X−k is linear and the conditional variance of Xk given X−k is constant, then X follows a k-variate normal distribution (Arnold et al., 1999; Arnold and Press, 1989).
Note that a joint distribution of X may still exist when Conclusion 1 is violated, but it is not a k-variate normal distribution anymore. In this case, it is more difficult to check compatibility especially in higher-dimension cases. Hence, we focus on the two-dimension case in Conclusion 2.
Conclusion 2: Compatibility of two normal conditional distributions without linear mean structures and constant variances
To ensure compatibility, any two normal conditional distributions \(p\left (X_{1}|X_{2}\right )\) and \(p\left (X_{2}|X_{1}\right )\) have to be written as \(p\left (X_{1}|X_{2}\right )=N\left (\mu _{X_{1}}=-\frac {m_{12}{X_{2}^{2}}+m_{11}X_{2}+m_{10}}{2\left (m_{22}{X_{2}^{2}}+m_{21}X_{2}+m_{20}\right )}\right .\), \(\left .\sigma _{X_{1}}^{2}=-\frac {1}{2\left (m_{22}{X_{2}^{2}}+m_{21}X_{2}+m_{20}\right )}\right )\) and \(p\left (X_{2}|X_{1}\right )=N\left (\mu _{X_{2}}\right .\), \(\left .= - \frac {m_{21}{X_{1}^{2}}+m_{11}X_{1}+m_{01}}{2\left (m_{22}{X_{1}^{2}}+m_{12}X_{1}+m_{02}\right )}\sigma _{X_{2}}^{2} = -\frac {1}{2\left (m_{22}{X_{1}^{2}}+m_{12}X_{1}+m_{02}\right )}\right )\) with some constants mij (i,j = 0, 1, 2). Additionally, the constants must meet some requirements to ensure that the marginal distributions and the joint distribution exist (e.g., positive variances). More specifically, mij must satisfy either (1) m22 < 0, \(4m_{22}m_{02}>m_{12}^{2}\), and \(4m_{22}m_{20}>m_{21}^{2}\), or (2) m22 = m12 = m21 = 0, m20 < 0, m02 < 0, \(4m_{20}m_{02}>m_{11}^{2}\) (Arnold et al., 1999). Conclusion 2 implies that in order to ensure compatibility, the highest exponents of X2 and X1 in \(\mu _{X_{1}}\) and \(\mu _{X_{2}}\) are between -2 and 2 respectively, and the highest exponent of X2 and X1 in both \(\sigma _{X_{1}}^{2}\) and \(\sigma _{X_{2}}^{2}\) should be between -2 and 0 respectively. More specifically, we can prove that when the conditional variances are constant (i.e., m22 = m12 = m21 = 0), the conditional distributions are compatible if and only if the conditional means are linear functions of X1 or X2 (the highest exponent is 0 or 1), which would bring us back to Conclusion 1. We also can prove that if both the highest exponents of X2 in conditional means (i.e., \(\mu _{X_{1}}\) and \(\mu _{X_{2}}\)) are 2 (i.e., there are nonlinear terms), these two conditional distributions are incompatible, because \(\mu _{X_{1}}\) requires that m21 = 0 and m12≠ 0, but \(\mu _{X_{2}}\) requires that m12 = 0 and m21≠ 0.
Specification of model-based imputation: categorical covariates
When covariates are categorical variables, generalized linear models are adopted to specify the covariate models. Particularly, logistic regression and probit regression are the most widely used. For ordinal variables, we can use a cumulative probit model or an ordinal logistic model, and for nominal responses, we can use a multinomial probit model or a multinomial logistic model (e.g., Agresti 2018; Albert & Chib, 1993; Johnson & Albert, 2006; McCulloch & Rossi, 1994). In these generalized linear models, there are underlying latent continuous variables. Instead of specifying covariate models on the categorical covariates, we specify covariate models on the latent continuous variables and calculate the model-based imputation. We illustrate the procedure for imputing binary covariates by probit models in this paper.
Consider a two-level model with incomplete covariates at different levels and a cross-level interaction,
where j indicates clusters (j = 1,...,J), i indicates individuals (i = 1,...,nj), nj indicates the sample size in the j th cluster, Xij indicates the Level-1 covariate, Zj indicates a binary Level-2 covariate, u0i and u1i are the random effects with \(\boldsymbol {u_{j}}=\left (\begin {array}{c} u_{0j}\\ u_{1j} \end {array}\right )\sim MV\left (\boldsymbol {0},\boldsymbol {D=}\left (\begin {array}{cc} \sigma _{u0}^{2} & \sigma _{u01}\\ \sigma _{u01} & \sigma _{u1}^{2} \end {array}\right )\right )\), and eij is the Level-1 error term with \(e_{ij}\sim N\left (0,{\sigma _{e}^{2}}\right )\). We introduce an auxiliary continuous random variable Z∗, which is the latent Z. In a probit model, Z∗ follows a standard normal distribution with mean of 0 and variance of 1. For a binary variable, a threshold divides the normal distribution of the latent variable into two segments, such that the latent variable is below the threshold when the binary variable equals zero and above the threshold when the binary variable equals one. This threshold parameter is usually fixed at zero. Thus, Z can be viewed as indicators for whether Z∗ is positive. The probit model for ordinal variables is similar but incorporates additional threshold parameters. For example, an ordered categorical variable with C response options requires C − 1 threshold parameters. If the latent variable is between the \(\left (c-1\right )\)th and c th thresholds, the categorical variable is c. In this case, the first threshold is still fixed at zero, but the remaining thresholds need to be estimated. Hence, the likelihood by augmenting the random effects u is
We consider sequential specification by specifying covariate models on the latent continuous variables \(p\left (X_{ij}|Z_{j}^{*}\right )\) and \(p\left (Z_{j}^{*}\right )\). That is, \(p\left (X_{ij}|Z_{j}^{*}\right )=N\left (\gamma _{0}+\right .\) \(\left .\gamma _{1}Z_{j}^{*},{\sigma _{X}^{2}}\right )\) and \(p\left (Z_{j}^{*}\right )=N\left (\eta _{0},1\right )\). The residual variance of \(Z_{j}^{*}\) is fixed to 1 for identification in the probit model. First, the model-based imputation model of Xj by Eq. 8 is
Second, the model-based imputation model of Zj in each cluster by Eq. 8 is
There is no analytical form for the model-based imputation model \(p\left (Z_{j}^{*}|\cdotp \right )\), therefore we use the Metropolis-Hastings algorithm to draw \(Z_{j}^{*}\) in Gibbs sampling.
Substantive model-based joint modeling
Although this paper focused on the sequential and separate specifications, in this section we present a brief sketch of the substantive model based version of joint modeling (Carpenter & Kenward, 2013). Mplus (Muthén & Muthén, 1998) adopts the joint modeling approach. Different from the sequential and separate specification approaches, which are model-based version of conditional modeling specification, the substantive model-based joint modeling also uses model-based idea but is on joint modeling specification. The substantive model-based joint modeling specifies a joint distribution of all covariates and the outcome, and the outcome (Y ) and the covariates that are linearly related to the outcome (X) are specified to be conditional on the covariates that are not linearly related to the outcome (Z). Based on the joint distribution \(p\left (Y,\boldsymbol {X},\boldsymbol {Z}\right )=p\left (Y,\boldsymbol {X}|\boldsymbol {Z}\right )p\left (\boldsymbol {Z}\right )\), we can impute X and Z using Bayesian techniques. Hence, the joint modeling can perfectly solve the nonlinear relation between the outcome and covariates. However, because \(p\left (\boldsymbol {Z}\right )\) is usually specified as a multivariate normal distribution, the joint modeling fails to handle the case where there is a nonlinear relationship within Z (Lüdtke et al., 2020). If we would like to modify the joint modeling by decomposing \(p\left (\boldsymbol {Z}\right )\) to allow it to accommodate nonlinear relationships within Z, then it is not different from the sequential specification approach.
Rights and permissions
About this article
Cite this article
Du, H., Alacam, E., Mena, S. et al. Compatibility in imputation specification. Behav Res 54, 2962–2980 (2022). https://doi.org/10.3758/s13428-021-01749-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13428-021-01749-5