Missing data are unavoidable in practice and have the potential to undermine the validity of research results. Various methods have been proposed to deal with the missing at random (MAR) mechanism where the probability of missingness is only related to observed variables (Little & Rubin, 2019). Among those methods, multiple imputation is one of the most widely used MAR-based methods. Multiple imputation consists of three major steps: the imputation phase, the analysis phase, and the pooling phase (Enders, 2022; Little & Rubin, 2019; Rubin, 2004; Schafer, 1997). We focus on the imputation step for incomplete covariates in this paper. When covariates (i.e., predictors) and outcomes are incomplete, the missing outcomes or covariates are estimated and imputed in the imputation step in order to construct a complete dataset to analyze the substantive model. We refer to the model of substantive interest as the substantive model and the model used to estimate the incomplete covariates or outcomes as the imputation model. We focus on the imputation model of covariates because the substantive model itself is the imputation model of the missing outcome unless there are auxiliary variables. There are no existing imputation models for covariates and we need to specify an imputation model for the incomplete covariate(s) in order to provide the needed information. A popular imputation approach is the fully conditional specification (FCS; e.g., Enders et al., 2016, 2017; Van Buuren 2012, 2011; Van Buuren et al. 2006). FCS uses a round robin sequence of regression models where each incomplete covariate is regressed on all other covariates and the outcome, complete or previously imputed. An important feature of FCS is that the regression models used to impute the substantive model outcome and to impute each covariate take on identical functional forms. Because FCS imputes all model variables in an identical manner, the imputation algorithm does not need to distinguish the substantive model from the models for covariates, leading Enders (2022) to refer to FCS imputation as a type of “agnostic imputation”. For example, a three-predictor regression analysis requires four regression models to define the distributions of the missing variables (the substantive model and one regression model for each covariate). When all specified regression models are linear (i.e., without any polynomial or interactive terms) with normally distributed errors, the implied joint distribution of all covariates and outcome is multivariate normal. However, models are not necessarily linear in practice. Recent research on missing data has revealed that arbitrarily specifying an imputation model such as FCS may lead to the so-called incompatibility issue and cause noticeable biases of estimation when the substantive model contains nonlinear covariate effects, such as quartic terms and interaction effects, or random slopes (e.g., Bartlett et al., 2015; Enders et al., 2018; Erler et al., 2016; Grund et al., 2018; Kim et al., 2015; Seaman et al., 2012; Van Buuren et al., 2006). Only when compatibility is ensured can we get accurate imputation and parameter estimation. A succinct definition of compatibility is that the joint distribution of all covariates and outcome exists (existence indicates that we can write out a joint density function and it meets all requirements of a distribution such as integrability). For example, we are interested in understanding the relationship between number of hours worked (Y ) and reported happiness (X). Some participants’ happiness scores (X) are missing. Based on a scatterplot of existing data, we find the relationship is quadratic. Then, the substantive model is Y = β0 + β1X + β2X2 + eY with \(e_{Y}\sim N\left (0,{\sigma _{Y}^{2}}\right )\). To impute X (happiness scores), FCS regresses Y on X, and this regression can take various forms. Researchers usually either assume that the incomplete covariate is a linear function of Y, X = γ0 + γ1Y + eX, or assume that the imputation model has a similar form as the substantive model, X = γ0 + γ1Y + γ2Y2 + eX. But regardless of which imputation model, the joint distribution of X and Y does not exist (cannot write out a valid density function) since the substantive model has contained a quadratic term. Throughout this paper, we use the term “traditional FCS” to refer to this round robin sequence of identically-specified regressions.

To solve the dilemma of the incompatibility issue, imputation models guaranteeing that the conditional distribution of the incomplete covariates is mathematically correct and compatible with the substantive model have been proposed and developed by researchers such as Bartlett et al., (2015), Enders et al., (2020), Erler et al. (2016; 2019), Goldstein et al., (2014), Kim et al., (2015), Kim et al., (2018), and Zhang and Wang (2017). Following Du et al., (2021), Enders et al., (2020), and Kim et al., (2018), we refer to the imputation models of covariates that are compatible with the substantive model and are compatible with all other covariates as the model-based imputation model. The rationale of the model-based imputation method is that instead of specifying the imputation models of covariates directly, we use the substantive model and the so-called covariate models that capture the relationship among the covariates to construct the imputation model for incomplete covariates. In this way, we can ensure that the joint distribution of all covariates and outcome exists. “Model-based imputation” emphasizes that imputation models are specified based on the substantive model and covariate models to ensure compatibility.

The existing model-based imputation methods were proposed in different scenarios for different types of substantive models. To help researchers better understand and use model-based imputation, the big framework for model-based imputation methods needs to be summarized. Why, when, and how to use model-based imputation is not clear to researchers, except a few papers focusing on consequences of incompatibility (e.g., Van Buuren et al., 2006). In addition, there are two kinds of model-based specifications, sequential specification and separate specification; and they have not been systematically summarized and compared except in a recent paper by Lüdtke et al., (2020). It is necessary to synthesize previous work and relevant findings to provide a theoretical framework for methodological researchers. Therefore, the aims of this paper are to: 1) provide a decision tree and the requirements of compatibility to help researchers choose appropriate imputation methods to ensure compatibility (i.e., traditional FCS vs. model-based imputation), 2) present a clear overview of the sequential and separate specifications, and 3) note differences about which method (e.g., sequential vs. separate) to prefer under specific circumstances, as well as how these are implemented in Blimp.

The outline of this paper is as follows. In “Compatibility and related concepts” section, we define compatibility. In “Requirements of compatibility and decision tree” section, we present the requirements of compatibility in the normal distribution family, and illustrate how to use a decision tree to check compatibility and select the appropriate imputation approach with examples. In “Model-based imputation: Single incomplete covariate” section, we illustrate how to calculate the model-based imputation model when a single covariate is incomplete. In “Model-based imputation: Multiple incomplete covariates” section, we present and compare two specification strategies for model-based imputation when multiple covariates are incomplete. In “Misuse in model-based imputation” section, we give examples in which the model-based imputation may be misused. In “Hypothetical data examples” section, we illustrate and compare the two specification strategies for model-based imputation with hypothetical data. In “Conclusion and recommendation” section, we end with several concluding remarks.

Compatibility and related concepts

The general definition of compatibilit y is that at least one joint distribution exists whose conditional distributions match the specified conditional distributions (Arnold et al., 1999; Arnold and Press, 1989; Liu et al., 2014; Van Buuren et al., 2006; Van Buuren, 2012). It implies that given the conditional distributions, the joint distribution exists. We can view the substantive model and the imputation model for each covariate as conditional distributions. For example, the substantive model with one covariate has a conditional distribution \(p\left (Y|X\right )\) and the imputation model for the covariate has a conditional distribution \(p\left (X|Y\right )\). The imputation model for X and substantive model are compatible when the joint distribution \(p\left (Y,X\right )\) exists. Meng (1994) referred to this type of compatibility as congeniality. As mentioned above, FCS directly specifies the imputation model (i.e., \(p\left (X|Y\right )\)), but there is no guarantee that the implied joint distribution of the covariate and outcome \(p\left (Y,X\right )\) exists. When the specified imputation models of covariates are incompatible with the substantive model, the imputation models are mathematically misspecified and can lead to biased parameter estimates and inaccurate coverage rates. For example, the simulation study by Enders et al., (2020) showed that in a two-level model with random slopes, the traditional FCS misspecified the imputation model and consistently underestimated the random slope variance by 10% to 20% even with a large sample size. The coverage rates of the fixed effects could be as low as 0.85, comparing this to the nominal level of 0.95, which indicates that we could have inaccurate statistical inference conclusions under this condition.

When there are more than one incomplete covariate (X), we need to consider whether the joint distribution of all incomplete covariates exist (\(p\left (\boldsymbol {X}\right )\); compatibility between all the covariate models) and whether the joint distribution of all incomplete covariates and outcome exist (\(p\left (\boldsymbol {X},Y\right )\); compatibility between the covariate models and the substantive model). There are two ways to specify the model-based imputation model when there are multiple incomplete covariates, the sequential specification and separate specification (we will elaborate on them later). The sequential specification can ensure compatibility between all the covariate models and the substantive model whereas the separate specification has the risk of failing to ensure the existence of \(p\left (\boldsymbol {X}\right )\) and \(p\left (\boldsymbol {X},Y\right )\).

Compatibility infers whether the imputation model is mathematically correctly specified, but it cannot tell us whether the imputation model is correctly specified to capture the true data generating model. In other words, when compatibility is met, we only know the imputation model is not mathematically wrong, but it still may be misspecified.

Requirements of compatibility and decision tree

We focus on the normal distribution because it is widely used for substantive models with continuous outcomes. In this section, based on the two conclusions from Arnold and Press (1989) and Arnold et al., (1999) (see A for more details), we summarize two Compatibility Requirements. The conclusions can be used to check the compatibility of a normal substantive model and a normal traditional FCS imputation model. Additionally, based on these two conclusions, we provide a decision tree and illustrate how to use the decision tree (see Fig. 1) to check whether the traditional FCS procedure can provide a compatible imputation model or whether the model-based imputation is needed.

Fig. 1
figure 1

Decision tree of different imputation procedures for ensuring compatibility

Compatibility requirement 1

When both the substantive model \(p\left (Y|X\right )\) and the covariate imputation model \(p\left (X|Y\right )\) have normally distributed errors and constant variances (which implies the absence of random slopes), compatibility exists if and only if the conditional means of two models are linear.

Compatibility requirement 2

After integrating out all other variables (i.e., other covariates and random effects) in both the covariate imputation model \(p\left (X|Y\right )\) and substantive model \(p\left (Y|X\right )\), if both \(p\left (Y|X\right )\) and \(p\left (X|Y\right )\) are conditional normal and either of them has a conditional variance whose highest exponent is higher than 0 (such as \(var\left (Y|X\right )=X\sigma ^{2}\) or X2σ2), the imputation model of X (\(p\left (X|Y\right )\)) is not compatible with the substantive model \(p\left (Y|X\right )\) (i.e., the joint distribution of X and Y does not exist).

Based on these two Compatibility Requirements, we also provide a decision tree for checking compatibility (Fig. 1). We will use two examples where there is one incomplete covariate to illustrate how to use the aforementioned Compatibility Requirements and a decision tree to check compatibility. With only one covariate, we do not need to distinguish separate and sequential specifications (we will elaborate on them later) in Fig. 1 since they are special cases of model-based imputation when there are multiple incomplete covariates.

Example 1: Incompatibility example with a quadratic substantive model

In this example, we show that only model-based imputation can be used for a quadratic substantive model. As illustrated earlier, when the substantive model is Y = β0 + β1X + β2X2 + eY with \(e_{Y}\sim N\left (0,{\sigma _{Y}^{2}}\right )\) and we specify traditional FCS imputation model, based on the aforementioned Compatibility Requirement 1, the joint distribution of X and Y does not exist. If we still assume that the joint distribution of X and Y exists, the imputed values of missing X and Y, and consequently the estimation of the regression coefficients can be biased (Bartlett et al., 2015; Grund et al., 2018; Seaman et al., 2012). This model echos Branches 1 and 2 in the decision tree. More specifically, in terms of the question “does each regression model in the path model have only linear terms”, the answer is no because the substantive model has a nonlinear term, X2. And because there is no need to distinguish separate and sequential specifications with one covariate, we should use model-based imputation in this example (see Fig. 2 for the route of arriving Branches 1 and 2).

Fig. 2
figure 2

Example 1 - Using the decision tree to ensure compatibility of a quadratic substantive model

Example 2: Incompatibility example with a random slope model

In this example, we show that only model-based imputation can be used for a random slope substantive model. Consider a two-level random slope model as the substantive model with an incomplete Level-1 covariate,

$$ Y_{ij}=\beta_{0}+\beta_{1}X_{ij}+u_{0j}+u_{1j}X_{ij}+e_{ij}, $$
(1)

where j indicates clusters (j = 1,...,J), i indicates individuals (i = 1,...,nj), nj indicates the sample size in the j th cluster, Xij indicates the Level-1 covariate, β0 is the average intercept, β1 is the average slope, u0i and u1i are the random effects with \(\boldsymbol {u_{j}}=\left (\begin {array}{c} u_{0j}\\ u_{1j} \end {array}\right )\sim MVN\left (\boldsymbol {0},\boldsymbol {D=}\left (\begin {array}{cc} \sigma _{u0}^{2} & \sigma _{u01}\\ \sigma _{u01} & \sigma _{u1}^{2} \end {array}\right )\right )\), and eij is the Level-1 error term with \(e_{ij}\sim N\left (0,{\sigma _{e}^{2}}\right )\). Since the error variance of Yij is conditional on Xij, it is not constant and varies depending on Xij and the highest exponent of Xij in \(var\left (Y_{ij}|X_{ij}\right )\) is 2 (i.e., \(X_{ij}^{2}\sigma _{u1}^{2}\) in \(var\left (Y_{ij}|X_{ij}\right )=\sigma _{u0}^{2}+X_{ij}^{2}\sigma _{u1}^{2}+2X_{ij}\sigma _{u01}+{\sigma _{e}^{2}}\) ). In the random slope analysis case, FCS usually employs a “reverse random coefficient” approach where the outcome serves as a random slope predictor of the incomplete covariate (Grund et al., 2016),

$$ X_{ij}=\gamma_{0}+\gamma_{1}Y_{ij}+u_{0j\left( x\right)}+u_{1j\left( x\right)}Y_{ij}+\text{{e}}_{ij\left( x\right)}, $$
(2)

where \(e_{ij\left (x\right )}\) follows a univariate normal distribution and \(\boldsymbol {u_{j\left (x\right )}}=\left (\begin {array}{c} u_{0j\left (x\right )}\\ u_{1j\left (x\right )} \end {array}\right )\) follows a multivariate normal distribution, which are similar to the substantive model. Consequently, \(var\left (X_{ij}|Y_{ij}\right )\) is also not a constant and the highest exponent of Yij is 2.

Based on either the decision tree or the Compatibility Requirement 2, we can conclude that Eq. 2 is incompatible with the substantive model in this example. Based on Compatibility Requirement 2, the highest exponent of Xij in \(var\left (X_{ij}|Y_{ij}\right )\) after integrating out u0i and u1i should be between -2 and 0. In this example, Compatibility Requirement 2 thus is violated since in both \(p\left (X_{ij}|Y_{ij}\right )\) and \(p\left (Y_{ij}|X_{ij}\right )\), the highest exponents of Xij and Yij in the conditional variances are 2. Additionally, the substantive model leads to Branches 1 and 2 in the decision tree because of the random slope (see Fig. 3 for the route of arriving Branches 1 and 2). More specifically, in terms of the question “does each regression model in the path model have only random intercepts”, the answer is no. Branches 1 and 2 indicate that we only can use model-based imputation in this case.

Fig. 3
figure 3

Example 2 - Using the decision tree to ensure compatibility of a random slope model

Substantive model with complete nonlinear terms or random slopes with complete covariates

When random slopes are only associated with complete covariates and/or all nonlinear terms are complete in the substantive model, indeed we can use FCS, however we need to be very careful in specifying FCS. FCS is a general way to specify the imputation model where one incomplete variable is regressed on all other variables, but there are various options to implement FCS. In many cases, only one way to specify FCS can ensure compatibility.

For example, the substantive model is Yij = β0 + β1X1ij + β2X2ij + u0j + u1jX1ij + eij where X1ij is complete and X2ij is incomplete. In this example, the random slopes are only associated with the complete covariate and we can use FCS. However, when X1ij and X2ij are linearly correlated, the FCS imputation model for X2ij must be \(X_{2ij}=\gamma _{0}+\gamma _{1}Y_{ij}+\gamma _{2}X_{1ij}+u_{0j\left (x\right )}+u_{1j\left (x\right )}X_{1ij}+\text {{e}}_{ij\left (x\right )}\) to ensure compatibility. Neither \(X_{2ij}=\gamma _{0}+\gamma _{1}Y_{ij}+\gamma _{2}X_{1ij}+u_{0j\left (x\right )}+u_{1j\left (x\right )}Y_{ij}+\text {{e}}_{ij\left (x\right )}\) nor \(X_{2ij}=\gamma _{0}+\gamma _{1}Y_{ij}+\gamma _{2}X_{1ij}+u_{0j\left (x\right )}+\text {{e}}_{ij\left (x\right )}\) work, although these two models also can be called the FCS imputation model. In another example, the substantive model is \(Y=\beta _{0}+\beta _{1}{X_{1}^{2}}+\beta _{2}X_{2}+e_{Y}\) where X1 is complete and X2 is incomplete. In this example, the quadratic term is only associated with the complete covariate and we can use FCS. However, when X1ij and X2ij are linearly correlated, the FCS model for X2 must be \(X_{2}=\gamma _{0}+\gamma _{1}{X_{1}^{2}}+\gamma _{2}X_{1}+\gamma _{3}Y+e_{x}\) to ensure compatibility, whereas \(X_{2}=\gamma _{0}+\gamma _{1}{X_{1}^{2}}+\gamma _{3}Y+e_{x}\) cannot guarantee compatibility. To avoid taking the risk of failing compatibility, we suggest always using model-based imputation instead of FCS when the substantive model has nonlinear terms and/or random slopes, regardless of whether those terms involve complete covariates. The decision tree also demonstrates this suggestion.

Model-based imputation: Single incomplete covariate

In this paper, we focus on calculating the model-based imputation in the Bayesian framework. To introduce the model-based imputation, we begin with the simplest case where there is only one incomplete covariate. We need to specify a model for the incomplete covariate itself p(X), which we refer to as the covariate model. Based on Bayes’ theorem, the model-based imputation model is

$$ p\left( X|Y\right)=p\left( Y|X\right)p\left( X\right)/p\left( Y\right), $$
(3)

where Y indicates the outcome and \(p\left (Y\right )\) is the marginal distribution of the outcome. \(p\left (X|Y\right )\) in Eq. 3 is the model-based imputation model for X. \(p\left (X|Y\right )\) is compatible with the substantive model \(p\left (Y|X\right )\) because the imputation model for X is calculated based on the joint distribution, \(p\left (Y,X\right )=p\left (Y|X\right )p\left (X\right )\). If there are other complete covariates or auxiliary variables Z in addition to the incomplete covariate, a covariate model that captures the relationship between the incomplete and complete covariates/auxiliaries \(p\left (X|Z\right )\) should be specified and the model-based imputation model is \(p\left (X|Y,Z\right )=p\left (Y|X,Z\right )p\left (X|Z\right )/p\left (Y|Z\right )\). Auxiliary variables are not treated differently from covariates.

We can specify the covariate model and estimate the missing covariate via Bayesian analysis software and R packages such as BUGS (Spiegelhalter et al., 2003) and (Plummer, 2016), but they require relatively high programming skills. There are other more user-friendly R packages that we will mention in the later sections. We will focus on a free software program, Blimp, which offers a user-friendly environment for implementing model-based imputation (Keller and Enders, 2021). Besides the Blimp code that we illustrate in the A, more examples are available in Blimp user’s manual (Keller & Enders, 2021). To better illustrate how to specify Bayesian model-based imputation, we continue to use the two examples discussed in the previous section.

Example 1 (continued): Bayesian model-based imputation model for a quadratic substantive model

Previously in Example 1, we examined a substantive model with a quadratic term (\(Y|X\sim N\left (\beta _{0}+\beta _{1}X+\beta _{2}X^{2},{\sigma _{Y}^{2}}\right )\)), and concluded that the arbitrarily specified FCS imputation model for the incomplete covariate \(X|Y\sim N\left (\gamma _{0}+\gamma _{1}Y,{\sigma _{X}^{2}}\right )\) or \(X|Y\sim N\left (\gamma _{0}+\gamma _{1}Y+\gamma _{2}Y^{2},{\sigma _{X}^{2}}\right )\) is not compatible with the substantive model. Now we use model-based imputation with assuming a covariate model of \(p(X)=N\left (\gamma ,{\sigma _{X}^{2}}\right )\), and calculate the model-based imputation model by Eq. 3,

$$ \begin{array}{@{}rcl@{}} p\left( X|Y\right) & \propto& p\left( Y|X\right)p\left( X\right) \\ & \propto &exp\left( -\frac{{\sigma_{X}^{2}}\left( \beta_{2}X^{2}+\beta_{1}X-Y+\beta_{0}\right)^{2}+{\sigma_{Y}^{2}}\left( X-\gamma\right)^{2}}{2{\sigma_{X}^{2}}{\sigma_{Y}^{2}}}\right). \end{array} $$
(4)

The kernel of p(X|Y ) follows a quartic exponential family (Cobb et al., 1983; Lüdtke et al., 2020) and it is difficult to directly sample from this family. Instead, we can use the Metropolis-Hastings algorithm to empirically construct the distribution based on the kernel and estimate the missing covariate (Gilks et al., 1996; Hastings, 1970). In the MH algorithm, the sampled X moves to a new position of the target kernel (e.g., Eq. 4) given its current position using a jumping distribution, and keeps updating. Bayesian software and Bayesian R packages can easily handle this case. Specifically, we provide the Blimp code for this example in the A.

Example 2 (continued): Bayesian model-based imputation model for a random slope model

We considered the multilevel substantive model previously in Example 2. Now we use matrices to express the equation of a random slope model for convenience,

$$ \boldsymbol{Y_{j}}=\boldsymbol{X_{j}\beta}+\boldsymbol{X_{j}u_{j}}+\boldsymbol{e_{j}}, $$
(5)

where Yj is an nj × 1 vector of the outcome in the j th cluster (j = 1,...,J), \(\boldsymbol {\beta }=\left (\begin {array}{c} \beta _{0}\\ \beta _{1} \end {array}\right )\)is a 2 × 1 vector of the fixed effects, \(\boldsymbol {u_{j}}=\left (\begin {array}{c} u_{0j}\\ u_{1j} \end {array}\right )\) is a 2 × 1 vector of the random effects in the j th cluster, Xj is an nj × 1 matrix for the covariate in the j th cluster, and ej is an nj × 1 independently normally distributed error term with E(ej) = 0 and \(var(\boldsymbol {e_{j}})={\sigma _{e}^{2}}\boldsymbol {I_{nj}}\). The likelihood by augmenting the random effects uj is \(p(\boldsymbol {Y},\boldsymbol {u}|{\sigma _{e}^{2}},\boldsymbol {\beta },\boldsymbol {X},\boldsymbol {D})=\underset {j=1}{\overset {J}{\prod }}f(\boldsymbol {Y_{j}}|\boldsymbol {u_{j}},{\sigma _{e}^{2}},\boldsymbol {\beta },\boldsymbol {X_{j}})f(\boldsymbol {u_{j}}|\boldsymbol {D})\). To implement the model-based imputation, we assume \(\boldsymbol {X_{j}}\sim MN\left (\alpha \boldsymbol {1_{nj}},{\sigma _{X}^{2}}\boldsymbol {I_{nj}}\right )\) as the covariate model. The model-based imputation model for Xj is calculated by Eq. 3,

$$ \begin{array}{@{}rcl@{}} &&p(\boldsymbol{X_{j}}|\boldsymbol{Y_{j}},{\sigma_{e}^{2}},\boldsymbol{\beta},\boldsymbol{u_{j}},\alpha,{\sigma_{X}^{2}},\boldsymbol{D})\\ & \propto &p(\boldsymbol{Y_{j}},\boldsymbol{u_{j}}|{\sigma_{e}^{2}},\boldsymbol{\beta},\boldsymbol{X},\boldsymbol{D})p(\boldsymbol{X_{j}}|\alpha,{\sigma_{X}^{2}}) \\ &=&MN\left( \frac{{\sigma_{e}^{2}}\alpha\boldsymbol{1_{nj}} + {\sigma_{X}^{2}}\left( \beta_{1} + u_{1j}\right)\left( \boldsymbol{Y_{j}} - \beta_{0}\boldsymbol{1_{nj}} - u_{0j}\boldsymbol{1_{nj}}\right)}{{\sigma_{e}^{2}}+\left( \beta_{1}+u_{1j}\right)^{2}{\sigma_{X}^{2}}},\right.\\ &&\qquad\qquad\left.\frac{{\sigma_{e}^{2}}{\sigma_{X}^{2}}\boldsymbol{I_{nj}}}{{\sigma_{e}^{2}}+\left( \beta_{1}+u_{1j}\right)^{2}{\sigma_{X}^{2}}}\right). \end{array} $$
(6)

Model-based imputation: Multiple incomplete covariates

When multiple covariates are incomplete, we need to make sure that the joint distribution of all the incomplete covariates and outcome exists. In other words, the imputation model of each covariate should be compatible with the substantive model, and all the imputation models of incomplete covariates should be compatible with each other. In this section, we introduce and compare two ways to specify covariate models in the model-based imputation framework when multiple covariates are incomplete: the sequential and separate specification approaches. There is a third way to specify the imputation model and ensure compatibility in the context of multiple incomplete covariates: the substantive model based version of joint modeling (Carpenter & Kenward, 2013). Since we don’t focus on joint modeling in this paper, we present it in the A.

Sequential specification

In the first approach, the joint distribution of all the incomplete covariates, p(X), is specified as the covariate model. Considering the difficulty of specifying a multivariate distribution, Ibrahim et al., (1999) proposed to factor the joint distribution into a sequence of univariate distributions,

$$ p(\boldsymbol{X}) = p(X_{K})p(X_{K-1}|X_{K})p(X_{K-2}|X_{K},X_{K-1})...p(X_{1}|X_{>1}) $$
(7)

This approach is referred to as the sequential specification. The model-based imputation model for the k th covariate Xk using the sequential specification is

$$ \begin{array}{@{}rcl@{}} p\left( X_{k}|Y,\boldsymbol{X_{-k}}\right) & =&p\left( Y|\boldsymbol{X}\right)p(X_{K})p(X_{K-1}|X_{K})...\\ &&p(X_{1}|X_{>1})/p\left( Y,\boldsymbol{X_{-k}}\right)\\ && \propto {p}\left( Y|\boldsymbol{X}\right)\underset{s=1}{\overset{k}\prod}p\left( X_{s}|X_{>s}\right), \end{array} $$
(8)

where X−k denotes all the covariates except Xk. If there are complete covariates or auxiliary variables, we specify the joint distribution of the incomplete covariates conditional on the complete covariates or auxiliary variables, p(X|Z), where Z is the set of complete covariates or complete auxiliary variables. This specification has been widely used in the imputation literature (e.g., Erler et al., 2016; Lüdtke et al. 2020). For different research questions, it may be more reasonable to use some specific orders for the joint specification. But it has been found that the sequential specification is quite robust against changes in the ordering (Chen and Ibrahim, 2001; Zhu & Raghunathan, 2015). The sequential specification is available in the R packages JointAI (Erler et al., 2019) and mdmb (Robitzsch & Luedtke, 2019), and the software program Blimp (Keller & Enders, 2021). Additionally, we show how to accommodate categorical variables in the sequential specification in the A.

Separate specification

Alternatively, we can specify the univariate conditional distributions for each incomplete covariate one by one as the covariate model, instead of focusing on the joint distribution of the covariates (Bartlett et al., 2015; Enders et al., 2020). The univariate conditional distribution is specified as regressing each incomplete covariate on all other incomplete covariates, p(Xk|X−k). Then the model-based imputation model is

$$ \begin{array}{@{}rcl@{}} p\left( X_{k}|Y\right) & =&p\left( Y|\boldsymbol{X}\right)p\left( X_{k}|\boldsymbol{X_{-k}}\right)/p\left( Y|\boldsymbol{X_{-k}}\right)\\ & \propto& p\left( Y|\boldsymbol{X}\right)p\left( X_{k}|\boldsymbol{X_{-k}}\right) \end{array} $$
(9)

We refer to this approach as the separate specification. If there are complete covariates or auxiliary variables, we specify the univariate conditional distribution of each incomplete covariate conditional on the complete covariates or auxiliary variables, p(Xk|X−k,Z) where Z is the set of complete covariates or complete auxiliary variables. The separate specification is available in the R package smcfcs (Bartlett & Keogh, 2019) and the software program Blimp (Keller & Enders, 2021). However, the separate specification has two issues. First, the covariate models p(Xk|X−k) may not be mutually compatible and lead to biased estimation. That is, based on the univariate conditional distributions of the covariates, the joint distribution of the incomplete covariates (p(X)) does not exist, and consequently the joint distribution of all the incomplete covariates and the outcome (p(X,Y )) does not exist. Based on the aforementioned Compatibility Requirement 1, if all the covariate models meet the following requirements: 1) the mean structure is linear, 2) normal errors, 3) no random slopes, and 4) constant variance, the covariate models are mutually compatible. If we can ensure compatibility among the covariate models (i.e., p(X) exists), the sequential specification and separate specification approaches are equivalent and would lead to the same imputation model because \(p\left (X_{k}|\boldsymbol {X_{-k}}\right )\propto \underset {s=1}{\overset {k}\prod }p\left (X_{s}|X_{>s}\right )\). Second, the separate specification is over-parameterized. For example, when two incomplete covariates follow a bivariate normal distribution, there are 5 redundant pieces of information: 2 means and 3 variance-covariance components. If we specify a simple linear regression in both p(X1|X2) and p(X2|X1), we estimate 2 intercepts, 2 slopes, and 2 residual variances. Thus one more parameter is estimated compared to the number of pieces of information in the data. The linear slope in p(X1|X2) is deterministic by the linear slope in p(X2|X1). If we freely estimate both slopes, it may cause problems especially when informative priors are used because the priors of two slopes may contain conflicting information (Hughes et al., 2014; Liu et al., 2014). In addition, because the separate specification approach estimates more parameters, it is a less efficient approach.

We do not propose to only use the separate specification or only use the sequential specification. The decision depends on our assumptions and what is known. If the joint distribution of covariates p(X) is already known or is assumed known, we can use p(X) to calculate all the univariate marginal and conditional distributions (e.g., \(p\left (X_{2}|X_{1}\right )\), \(p\left (X_{1}|X_{2}\right )\), \(p\left (X_{1}\right )\), and \(p\left (X_{2}\right )\)), and the sequential specification and separate specification are interchangeable, and either can be used though one might be easier to use than the other. Particularly, if we know or assume p(X) is a multivariate normal distribution, we can use either the joint specification or the separate specification easily. In this case, we just specify linear regressions with normally distributed residuals for \(p\left (X_{2}|X_{1}\right )\), \(p\left (X_{1}|X_{2}\right )\), \(p\left (X_{1}\right )\), and \(p\left (X_{2}\right )\). It is easy to prove that both of these specifications reach the same conclusion. In the case where we are not sure of p(X) (e.g., Example 3), the sequential specification and separate specification provide a different set of considerations and challenges. When we don’t know the correct form of p(Xk|X−k), the separate specification only can arbitrarily specify the covariate models. The consequence is either the covariate models are mutually incompatible, or the separate specification uses linear regressions to ensure compatibility but misspecifies the relationship among the covariates. In this case, we need to use the sequential specification to guarantee compatibility and accommodate the nonlinear relationship among covariates (see Branch 1 in Fig. 1). Although the sequential specification can accommodate the nonlinear relationship among the covariates, it is based on the specific assumption of the nonlinear relationship. It cannot guarantee that the specified covariate models and the joint distribution of the covariates are the true data generating models. Indeed, no matter which approach we use, whether the specified model is the true data generating model is untestable. We only can guarantee that the specification is mathematically valid.

We use Example 3 to illustrate how to use the decision tree to select the appropriate imputation method when there are multiple incomplete covariates and the difference between the separate and sequential specifications. We use Example 4 to illustrate the imputation specification when there are multiple regressions in a path model.

Example 3: Moderated regression with nonlinearly related covariates

In this example, we show that only the sequential specification can capture the nonlinear relationship among covariates. We assume a substantive model with an interaction term as Y = β0 + β1X1 + β2X2 + β3X1X2 + eY (i.e., \(p\left (Y|X_{1},X_{2}\right )=N\left (\beta _{0}+\beta _{1}X_{1}+\beta _{2}X_{2}+\beta _{3}X_{1}X_{2},{\sigma _{Y}^{2}}\right )\)). Both X1 and X2 are incomplete. First of all, as demonstrated in the decision tree (the answer is no for “does each regression model in the path model have only linear terms”), we cannot use FCS since the substantive model has an incomplete nonlinear term. Second, if any theory or assumption reveals that X1 and X2 are not linearly related, we need to use the sequential specification (the answer is no for “is there only one regression model with linearly related covariates” in the decision tree). See Fig. 4 for the route of arriving Branch 1. For example, the two covariates are nonlinearly related such that \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}{X_{2}^{2}},\sigma _{X1}^{2}\right )\). In terms of the covariate model of X2, if we consider the separate specification, regardless of whether we assume \(p\left (X_{2}|X_{1}\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}{X_{1}^{2}},\sigma _{X2}^{2}\right )\) or \(p\left (X_{2}|X_{1}\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1},\sigma _{X2}^{2}\right )\), \(p\left (X_{1}|X_{2}\right )\) and \(p\left (X_{2}|X_{1}\right )\) are incompatible because \(p\left (X_{1}|X_{2}\right )\) contains a nonlinear term and constant variance (Compatibility Requirement 1). In the sequential specification, we assume \(p\left (X_{2}\right )=N\left (\gamma _{0X2},\sigma _{X2}^{2}\right )\). Based on Eq. 8, we calculate the model-based imputation models for X1 and X2 by \(p\left (X_{1}|X_{2},Y\right )\propto p\left (Y|X_{1},X_{2}\right )p\left (X_{1}|X_{2}\right )\) and \(p\left (X_{2}|X_{1},Y\right )\propto p\left (Y|X_{1},X_{2}\right )p\left (X_{1}|X_{2}\right )p\left (X_{2}\right )\). Again, although the sequential specification can be used to ensure compatibility based on the assumed \(p\left (X_{1}|X_{2}\right )\), the resultant imputation model may or may not reflect the true underlying data generating process. We provide the Blimp code for this example in the A.

Fig. 4
figure 4

Example 3 - Using the decision tree to ensure compatibility of a moderated regression with nonlinearly related covariates

Example 4: Moderated mediation model

It is frequently of interest to fit large path models that may contain nonlinear relationships among upstream variables that later serve as covariates in a downstream regression. To illustrate the sequential specification in this case, we provide an example of a moderated mediation model. When there are multiple regressions in a model, such as a mediation model, we need to consider each regression model. Suppose the model is a moderated mediation model M = β0M + β1MX1 + β2MX2 + β3MX1X2 + eM and Y = β0Y + β1YM + β2YX3 + β3YMX3 + eY. X1, X2, X3, and M are incomplete. First, as demonstrated in the decision tree (the answer is no for “does each regression model in the path model have only linear terms”), we cannot use FCS because both the M model and Y model have interactive terms and thus X1, X2, X3, M, and Y cannot come from a multivariate normal distribution. Second, since this is a mediation model with two regressions, although carefully specified separate specification is feasible, it is too complex. With the sequential specification, \(p\left (X_{1},X_{2},X_{3},M,Y\right )=p\left (X_{1}|X_{2}\right )p\left (X_{2}\right )p\left (M|X_{1},X_{2}\right )p\left (X_{3}|M\right )p\left (Y|M,X_{3}\right )\). If any theory or assumption reveals that the covariates in the M model or Y model are linearly correlated, we can specify \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\), \(p\left (X_{2}\right )=N\left (\gamma _{0X2},\sigma _{X2}^{2}\right )\), \(p\left (X_{3}|M\right )=N\left (\gamma _{0X3}+\gamma _{1X3}M,\sigma _{X3}^{2}\right )\), and \(p\left (M\right )=N\left (\gamma _{0M},{\sigma _{M}^{2}}\right )\). Since the sequential specification can clearly decompose the joint distribution to a sequence of regressions, we suggest using sequential specification in path models even when covariates are linearly related. Hence, in the decision tree, when the answer is no for “are there multiple regression models in the path model with only linear terms”, the suggested specification is the sequential specification (see Fig. 5 for the route of arriving Branch 1). We provide the Blimp code for this example in the A. If any theory or assumption reveals that the covariates in the M model or Y model are nonlinearly correlated, we have to use sequential specification to capture the nonlinear relation among covariates. With the sequential specification, we calculate the model-based imputation models for X1 by \(p\left (X_{1}|X_{2},M\right )\propto p\left (X_{1}|X_{2}\right )p\left (M|X_{1},X_{2}\right )\), for X2 by \(p\left (X_{2}|X_{1},M\right )\propto p\left (X_{1}|X_{2}\right )p\left (X_{2}\right )p\left (M|X_{1},X_{2}\right )\), for M by \(p\left (M|X_{1},X_{2},X_{3},Y\right )\propto p\left (M|X_{1},X_{2}\right )\) \(p\left (X_{3}|M\right )p\left (Y|M,X_{3}\right )\), and for X3 by \(p\left (X_{3}|M,Y\right )\propto p\left (X_{3}|M\right )p\left (Y|M,X_{3}\right )\).

Fig. 5
figure 5

Example 4 - Using the decision tree to ensure compatibility of a moderated mediation model

Misuse in model-based imputation

Regardless of using the sequential specification or separate specification, we should be careful in how to specify covariate models and use them for imputation. The specification of the covariate models and imputation models has to obey probability rules. We use a single level model with three incomplete covariates as an example. Under the separate specification, \(p\left (X_{3}|X_{1},X_{2}\right )\) is the covariate model for X3, and the model-based imputation of X3 should be calculated as \(p\left (X_{3}|X_{1},X_{2},Y\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{3}|X_{1},X_{2}\right )\), but not \(p\left (X_{3}|X_{1},X_{2},Y\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{3}|X_{1}\right )\) or \(p\left (X_{3}|X_{1},X_{2},Y\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{3}\right )\), unless we can prove that \(p\left (X_{3}|X_{1},X_{2}\right )\propto p\left (X_{3}|X_{1}\right )\) or \(p\left (X_{3}|X_{1},X_{2}\right )\propto p\left (X_{3}\right )\). Under the sequential specification, the model-based imputation of X3 is \(p\left (X_{3}|Y,X_{1},\right .\) \(\left .X_{2}\right )\propto p\left (Y|X_{1},X_{2},X_{3}\right )p\left (X_{1}|X_{2},X_{3}\right )\) \(p\left (X_{2}|X_{3}\right )p\left (X_{3}\right )\). We will see in Example 5 that when the substantive model is complex with a multiple level structure and/or multiple covariates, it might be easy to make mistakes in choosing the covariate models, especially when cluster means are considered.

Example 5: A two-level model with multiple covariates

We consider a two-level model with two incomplete Level-1 covariates, X1ij and X2ij,

$$ Y_{ij}=\beta_{0}+\beta_{1}X_{1ij}+\beta_{2}X_{2ij}+u_{0j}+u_{1j}X_{1ij}+e_{ij}. $$

Since the substantive model contains random slopes, FCS is not applicable here. Instead, we can use the sequential approach or separate approach if we assume X1ij and X2ij are linearly related. We use latent cluster mean centering in the covariate models to partition covariates into within- and between-cluster components. Therefore, X1ij and X2ij consist of latent cluster means and within-cluster deviations. Latent cluster means are employed to accommodate unequal group sizes (Grund et al., 2018). We further define the within- and between-cluster parts of the covariates as normally distributed. Specifically, the within-cluster parts are within-cluster deviation scores, distributed as \(p\left (X_{1ij},X_{2ij}\right )=MN\left (\left (\begin {array}{c} \mu _{1j}\\ \mu _{2j} \end {array}\right ),\boldsymbol {{{{{\varSigma }}}}_{w}}\right )\). The between-cluster parts are the latent cluster means, distributed as \(p\left (\mu _{1j},\mu _{2j}\right )=MN\left (\left (\begin {array}{c} \mu _{1}\\ \mu _{2} \end {array}\right ),\boldsymbol {{{{{\varSigma }}}}_{b}}\right )\). X1ij, X2ij, μ1j, and μ2j are the variables that need to be estimated in this example.

We begin with the sequential specification, in which there are four covariate models in this example,

$$ \begin{array}{@{}rcl@{}} &&p\left( \boldsymbol{X_{1j}}|\mu_{1j},\mu_{2j},\boldsymbol{X_{2j}}\right): X_{1ij}\\ & =&\mu_{1j}+\gamma_{1}\left( X_{2ij}-\mu_{2j}\right)+\epsilon_{x1ij} \epsilon_{1ij}\sim N\left( 0,\sigma_{1X}^{2}\right) \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} &&p\left( \boldsymbol{X_{2j}}|\mu_{2j}\right): X_{2ij} =\mu_{2j}+\epsilon_{x2ij} \epsilon_{1ij}\\ &&\sim N\left( 0,\sigma_{2X}^{2}\right) \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} &&p\left( \mu_{1j}|\mu_{2j}\right): \mu_{1j} =\eta_{11}+\eta_{12}\mu_{2j}+\epsilon_{u1ij} \epsilon_{u1ij}\\ &&\sim N\left( 0,\sigma_{1\mu}^{2}\right) \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} &&p\left( \mu_{2j}\right): \mu_{2j} =\eta_{21}+\epsilon_{u2ij} \epsilon_{u2ij}\\ &&\sim {N}\left( 0,\sigma_{2\mu}^{2}\right). \end{array} $$
(13)

We summarize how to impute X1ij, X2ij, μ1j, and μ2j using these covariate models in Eqs. 10-13 respectively in Table 1. For example, the correct imputation model of μ2j is

$$ \begin{array}{@{}rcl@{}} p\left( \mu_{2j}|\mu_{1j},\boldsymbol{X_{1j},X_{2j}}\right) & \propto& p\left( \boldsymbol{X_{1j}}|\mu_{1j},\mu_{2j},\boldsymbol{X_{2j}}\right)\\&&p\left( \boldsymbol{X_{2j}}|\mu_{2j}\right)\\&&p\left( \mu_{1j}|\mu_{2j}\right)p\left( \mu_{2j}\right). \end{array} $$
(14)

It looks like there are simpler ways to impute μ2j. First, how about \(p\left (\mu _{2j}|\mu _{1j},\boldsymbol {X_{1j},X_{2j}}\right )\propto p\left (\boldsymbol {X_{1j}}|\mu _{1j},\right .\) \(\left .\mu _{2j},\boldsymbol {X_{2j}}\right )p\left (\mu _{2j}\right )\)? This specification is mathematically invalid unless \(p\left (\mu _{2j}\right )\propto p\left (\mu _{2j}|\mu _{1j},\boldsymbol {X_{2j}}\right )\). Second, how about \(p\left (\mu _{2j}|\mu _{1j},\boldsymbol {X_{2j}}\right )\propto p\left (\boldsymbol {X_{2j}}|\mu _{2j}\right )\)\(p\left (\mu _{1j}\right .\) \(\left .|\mu _{2j}\right )p\left (\mu _{2j}\right )\)? This specification is mathematically correct but it incorporates less information compared to Eq. 14 during iterations. Similarly, the imputation model of X2ij is proportional to \(p\left (Y_{ij}|u_{0j},u_{1j}X_{1ij},\right .\) \(\left .X_{2ij}\right )p\left (X_{1ij}|\mu _{1j},\mu _{2j},X_{2ij}\right )p\left (X_{2ij}|\mu _{2j}\right )\) but not proportional to \(p\left (Y_{ij}|u_{0j},u_{1j}X_{1ij},X_{2ij}\right )p\left (X_{2ij}|\mu _{2j}\right )\)). When there are three incomplete Level-1 covariates, the use of covariate models in sequential specification is illustrated in Table 1. As shown in Table 1, when there are more incomplete predictors, the sequential specification becomes much more complex.

Table 1 The use of covariate models in sequential and separate specifications for the special cases with two or three incomplete covariates in example 5

In the separate specification, we also need to be careful about specifying and using the appropriate covariate model. In this example, we have four covariate models under separate specification,

$$ \begin{array}{@{}rcl@{}} p\left( \boldsymbol{X_{1j}}|\mu_{1j},\mu_{2j},\boldsymbol{X_{2j}}\right): X_{1ij} & =&\mu_{1j}+\gamma_{1}\left( X_{2ij}-\mu_{2j}\right)\\ &&+\epsilon_{x1ij} \epsilon_{1ij}\sim N\left( 0,\sigma_{1X}^{2}\right) \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} p\left( \boldsymbol{X_{2j}}|\mu_{1j},\mu_{2j},\boldsymbol{X_{1j}}\right): X_{2ij} & =&\mu_{2j}+\gamma_{2}\left( X_{1ij}-\mu_{1j}\right)\\ &&+\epsilon_{x2ij} \epsilon_{1ij}\sim N\left( 0,\sigma_{2X}^{2}\right) \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} p\left( \mu_{1j}|\mu_{2j}\right): \mu_{1j} & =&\eta_{11}+\eta_{12}\mu_{2j}+\epsilon_{u1ij} \epsilon_{u1ij}\\ &&\sim N\left( 0,\sigma_{1\mu}^{2}\right) \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} p\left( \mu_{2j}|\mu_{1j}\right): \mu_{2j} & =&\eta_{21}+\eta_{22}\mu_{1j}+\epsilon_{u2ij} \epsilon_{u2ij}\\ &&\sim N\left( 0,\sigma_{2\mu}^{2}\right) \end{array} $$
(18)

The imputation model of μ1j is calculated as

$$ \begin{array}{@{}rcl@{}} p\left( \mu_{1j}|\mu_{2j},\boldsymbol{X_{1j},X_{2j}}\right) & \propto p\left( \boldsymbol{X_{1j}}|\mu_{1j},\mu_{2j},\boldsymbol{X_{2j}}\right)p\left( \mu_{1j}|\mu_{2j}\right) \end{array} $$
(19)

but not \(p\left (\boldsymbol {X_{1j}}|\mu _{1j},\mu _{2j},\boldsymbol {X_{2j}}\right )p\left (\mu _{2j}|\mu _{1j}\right )\). We summarize the use of covariate models in the separate specification in Table 1 with two or three incomplete covariates. As shown in Table 1, when there are more incomplete predictors and the separate specification is applicable, the separate specification has a simpler form than the sequential specification and can be easier to use. But if one relies on software to impute missing observations, the difference of complexity in specifying imputation models is not a concern. For example, in Blimp, users only need to specify covariate models and Blimp will automatically impute the missing observations based on the correct imputation models. We provide the Blimp code for this example in the A.

Hypothetical data examples

Two hypothetical data examples are presented in order to illustrate and compare the separate specification and sequential specification when two covariates are linearly related and nonlinearly related, respectively. In both examples, the research goal is to test for the effect of family support (X2) in moderating the relationship between life event stress (X1) and depression (Y ). A reasonable model is that life event stress leads to depression but that strong family support might buffer the effects of stress. Hence, the true regression model is Y = 10 + 0.5X1 − 0.5X2 − 0.1X1 × X2 + eY. Because the substantive model contains incomplete nonlinear terms, the traditional FCS method causes incompatibility. We will compare different model-based imputation specifications with FCS in this section.

In the first scenario, we specify the relationship between life event stress (X1) and family support (X2) as X2 = 10 − X1 + eX2 to generate data. 106 observations were simulated. 25% of the observations in both X1 and X2 were made missing, if participants’ depression scores (Y ) were higher than the group mean. Hence, X1 and X2 are missing at random (MAR). The ordinary least squares (OLS) estimates and standard errors of the coefficients in the substantive model from the complete data are presented in Table 2. Given such a large sample size, the estimates were almost the same as the population values and the standard errors were small. Since X1 and X2 are linearly related, we can use either the separate or sequential specification to impute the missing X1 and X2 (see Branches 1 and 2). In the separate specification, we specified the covariate models as \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}|X_{1}\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1},\sigma _{X2}^{2}\right )\); and in the sequential specification, we specified the covariate models as \(p\left (X_{1}|X_{2}\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}\right )=N\left (\gamma _{MX2},\sigma _{MX2}^{2}\right )\). Additionally, we also consider FCS, which specifies \(p\left (X_{1}|X_{2},Y\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}Y,\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}|X_{1},Y\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}Y,\sigma _{X2}^{2}\right )\). We applied multiple imputation with 20 imputed datasets from the posterior samples. The point estimates and standard errors of the regression coefficients in the substantive model β from these two specifications are presented in Table 2. The separate specification and sequential specification provided similar parameter estimates and standard errors of the substantive model regression coefficients. Their estimates had only small biases (i.e., the relative biases, \(\frac {\hat {\beta }-\beta }{\beta }\times 100\%\), were smaller than 5%). But the estimates from FCS had biases larger than 5%.

Table 2 The point estimates and standard errors from hypothetical data examples

In the second scenario, we specify the relationship between life event stress (X1) and family support (X2) as \(X_{2}=10+0.5\times X_{1}-0.5\times {X_{1}^{2}}+e_{X2}\) to generate data. 106 observations were simulated, and 25% of X1 and 25% of X2 were missing depending on depression scores (Y ). The OLS estimates and standard errors of regression coefficients in the substantive model from the complete data are presented in Table 2. We compare four ways to specify the covariate models to impute the missing X1 and X2. First, under the separate specification, if we specify the covariate models as \(X_{1}|X_{2}\sim N\left (\gamma _{0X1}+\gamma _{1X1}X_{2},\sigma _{X1}^{2}\right )\) and \(X_{2}|X_{1}\sim N\left (\gamma _{0X2}+\gamma _{1X2}X_{1},\sigma _{X2}^{2}\right )\), it implies that X1 and X2 are linearly related and should follow a bivariate normal distribution. Although the specification does not match the true data generating model, the joint distribution of X1 and X2 theoretically exists and the imputation models are not mathematically wrong. Second, if we assume the covariate models as \(X_{1}|X_{2}\sim N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}{X_{2}^{2}},\sigma _{X1}^{2}\right )\) and \(X_{2}|X_{1}\sim N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}{X_{1}^{2}},\sigma _{X2}^{2}\right )\), it is an incompatible separate specification based on Compatibility Requirement 1. Hence, the specification is mathematically invalid and does not match the true data generating model. Third, under the sequential specification, we can specify \(X_{2}|X_{1}\sim N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X1}{X_{1}^{2}},\sigma _{X2}^{2}\right )\) and \(X_{1}\sim N\left (\gamma _{X1},\sigma _{X1}^{2}\right )\) to capture the nonlinear relationship between X1 and X2. Fourth, we consider FCS as \(p\left (X_{1}|X_{2},Y\right )=N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X1}Y,\sigma _{X1}^{2}\right )\) and \(p\left (X_{2}|X_{1},Y\right )=N\left (\gamma _{0X2}+\gamma _{1X2}X_{1}+\gamma _{2X2}Y,\sigma _{X2}^{2}\right )\). We applied multiple imputation with 20 imputed datasets from the posterior samples. The parameter estimates and standard errors of the coefficients in the substantive model from these three specifications are presented in Table 2. The estimates from the compatible separate specification were biased, because the imputation models were misspecified and failed to capture the nonlinear relationship between covariates. The estimates from the incompatible separate specification were severely biased, because both misspecification and incompatibility cause biases. FCS also had biased estimates. In contrast, the compatible sequential specification provided estimates close to the true values.

In the first scenario, we illustrate a case of linearly related covariates, in which both the separate specification and sequential specification are applicable and equivalent. In the second scenario, we illustrate a case of nonlinearly correlated covariates, in which there is no way to use the separate specification to capture such a relationship. We need to highlight that even if we use the sequential specification to capture the nonlinear relationship, it only reflects our assumption of the relationship among covariates. If in the second scenario, we specify the sequential specification model as \(X_{1}|X_{2}\sim N\left (\gamma _{0X1}+\gamma _{1X1}X_{2}+\gamma _{2X2}{X_{2}^{2}},\sigma _{X1}^{2}\right )\) and \(X_{1}\sim N\left (\gamma _{X2},\sigma _{X2}^{2}\right )\), the covariate models are mathematically correct but they are misspecified, because they do not match the true data generating model. Overall, the sequential specification is more flexible than the separate specification in terms of allowing nonlinear relationships among X, but it does not help identify the true data generating model.

Conclusion and recommendation

Despite the broad appeal of multiple imputation and other imputation approaches, researchers might ignore the importance of compatibility when specifying imputation models. Incompatibility has been found to cause bias-inducing problems (e.g., Bartlett et al., 2015; Enders et al., 2020; Van Buuren et al., 2006). Developed from different models, a growing body of recent missing data work has focused on the model-based imputation model that is compatible with the substantive model (Bartlett et al., 2015; Enders et al., 2020; Erler et al., 2016, 2019; Goldstein et al., 2014; Kim et al., 2015, 2018; Zhang & Wang 2017). Building on these recent developments and Arnold’s compatibility work that is not limited to the missing data area (e.g., Arnold et al.; Arnold & Press; Liu et al.; Van Buuren et al.; Van Buuren), this paper systematically summarizes when the traditional FCS is applicable and how to specify a model-based imputation model if needed.

When researchers have a strong assumption of the imputation models and prefer to specify them directly (usually via the FCS procedure), compatibility should always be checked to assure that the imputation models for covariates are compatible with the substantive model and the imputation models are mutually compatible. To help researchers check compatibility more easily, first, we summarize two Compatibility Requirements which can help researchers decide whether the imputation models for covariates are compatible with the substantive model. Compatibility Requirement 1 is that if the conditional variances from the normal substantive model and imputation model are constant, the mean structure regarding the incomplete covariates cannot be nonlinear. Compatibility Requirement 2 is that after integrating out all other covariates and random effects, if both \(p\left (Y|X\right )\) and \(p\left (X|Y\right )\) are conditional normal, they cannot have a conditional variance whose highest exponent on X or Y is higher than 0. Second, we provide a decision tree to check whether the traditional FCS is applicable in a given scenario.

When the Compatibility Requirements or the decision tree reveals that FCS is not applicable, we should use the model-based imputation approach. The model-based imputation approach ensures the existence of a joint distribution of all incomplete covariates and outcome. With this goal, the model-based imputation procedure begins with specifying appropriate covariate models, and calculating the imputation models based on the substantive model and covariate models using Bayes’ theorem. When there are multiple incomplete covariates, this paper illustrates and compares two types of specifications: the separate specification and the sequential specification. If we know or assume that all the incomplete covariates come from a multivariate normal distribution, the two specifications are essentially the same. If the relationship between the covariates is not linear, we should use the sequential specification to capture the nonlinear relationship and guarantee compatibility. Although the sequential specification is more flexible compared to the separate approach because the sequential specification allows for nonlinearities among covariates, it cannot guarantee that the specified covariate model is the true underlying model but only guarantees that the specification is mathematically valid.

We need to caution researchers who calculate or program model-based imputation models by themselves that the specification of FCS has to obey probability rules. Omitting covariate models or using wrong covariate models in imputation model calculation will lead to wrongly imputed data and biased parameter estimation.