1 Introduction

Latent variable modeling is an increasingly popular statistical method in public health, health services, and social science research since many constructs of interest in these fields are not directly observable. For example, mental health conditions, such as depression, are not directly observable but rather measured through a diagnostic checklist. Standard analytic approaches would treat depression status, as measured by these diagnostic items, the same as any fully observed variable, such as gender or clinic site. On the other hand, latent variable methods appropriately account for the measurement error inherent in using a set of observed items to represent an underlying latent construct (Collins and Lanza 2010; Hagenaars and McCutcheon 2002; Lazarsfeld and Henry 1968), resulting in more appropriate statistical inferences.

One common type of latent variable modeling is latent variable regression, which models the association between a latent variable and auxiliary variables of interest (either predictors or distal outcomes). When causal inference is the objective, a common estimand is the average treatment effect (ATE). Under a potential outcomes framework, the ATE is the average difference (across the population) of the outcome had an individual received the treatment condition and the outcome had he or she received the control condition (Stuart 2010). In the case of more than two treatment conditions, estimated treatment effects compare the average pairwise differences in potential outcomes for two given treatments. The validity of the estimated causal effect relies on comparable treatment groups, obtained through randomization or careful analysis of observation data. When one is interested in estimating the causal effects of a latent treatment variable on a distal outcome, one must utilize observational data, given the impossibility of randomizing a latent treatment. Estimation of causal effects using observational data requires carefully addressing potential confounding; like in settings all variables are fully observed, latent variable regression that does not account for potential confounding may conflate true treatment effects with baseline group differences.

When interested in the effect of an observed treatment on a latent outcome, traditional methods to address confounding, such as propensity scores, can be easily implemented, given the fully observed nature of the treatment (Butera et al. 2013; Lanza et al. 2013b). In this paper we focus on the converse, estimating the effect of a latent treatment on a directly observed outcome; implementing propensity score methods is less straightforward in this context since standard propensity score approaches require that each individual have an observed treatment group. One could use a classify-analyze strategy (Clogg 1995) in order to predict latent class for each individual, and then implement standard propensity score methods with regard to predicted latent class. Alternatively, one could use a recently proposed joint modeling approach that estimates the effect of an observed treatment on a distal latent outcome while adjusting for confounders (Kang and Schafer 2010). In this paper, using both a simulation study and motivating example we compare the performance of classify-analyze methods that incorporate propensity scores to a joint estimation strategy.

Our motivating example involves estimating the effect of substance use treatment services for adolescents on subsequent substance use problems. Using national evaluation data from outpatient treatment providers funded through the Substance Abuse and Mental Health Services Administration’s (SAMHSA) Center for Substance Abuse Treatment (CSAT), we identified latent classes characterized by groupings of treatment services received by youth. Given the observational nature of our data, it is important to control for baseline differences across groups when estimating the effects of treatment class on substance use outcomes. It is plausible that demographic characteristics, justice system involvement, and baseline substance use may be associated with both the types of treatment a youth receives and substance use outcomes. Failing to account for baseline differences could lead to biased effect estimates, as is the case with non-experimental studies more generally.

In this paper, we first discuss the challenges associated with addressing confounding when estimating the effects of a latent variable on a distal outcome and review current methods. We then conduct a simulation study that compares three proposed methods for addressing confounding when estimating the effects of a latent treatment, as well as two methods that do not adjust for potential confounding in order to demonstrate the potential for bias. Finally, we apply these methods to our adolescent substance treatment dataset in order to address the substantive question at hand. We highlight that the statistical inference can change markedly when confounding is not addressed.

2 Background

2.1 Latent class analysis

Latent class analysis (LCA) is a widely used latent variable model that assumes an underlying structure of discrete, mutually exclusive, and exhaustive latent classes. Latent class membership cannot be directly observed; instead, it is indirectly measured using a comprehensive set of indicators that span the latent construct. LCA models individuals’ latent class membership based on their observed response pattern across the indicators; each individual, by definition, belongs to exactly one latent class (Collins and Lanza 2010; Hagenaars and McCutcheon 2002; Lazarsfeld and Henry 1968).

Let C = c k denote latent class membership in class c k , where \(k = 1, 2, \ldots ,K\), and let \(U_{j}\)denote one of the J observed latent class indicators, where \(j = 1, 2, \ldots ,J\). The classical LCA model represents the probability of observing response pattern \(\varvec{u}\) as follows: \(\Pr \left( {\varvec{U} = \varvec{u}} \right) = \mathop \sum \limits_{k = 1}^{K} { \Pr }(C = c_{k} )\mathop \prod \limits_{j = 1}^{J} \Pr \left( {U_{j} = u_{j} | C = c_{k} } \right)\), where \({\text{Pr}}(C = c_{k} )\) denotes the probability of belonging to class c k and \(\Pr (U_{j} = u_{j} | C = c_{k} )\) denotes the conditional indicator probability, namely the probability of responding to indicator U j with value u j , given membership in class c k . An additional quantity of interest is the posterior class membership probability, \(\Pr (C = c_{k} |\varvec{U} = \varvec{u})\), namely the probability of membership in class c k given an observed response pattern \(\varvec{u}\). A fundamental assumption of classical LCA is local independence, meaning that the indicators U 1, U 2, …, U J are assumed to be mutually independent after conditioning on latent class membership c k . This assumption is denoted in Fig. 1 by the lack of correlation arrows among the indicators U 1, U 2, …, U J . Maximum likelihood estimation is typically used to estimate LCA parameters.

Fig. 1
figure 1

Schematic figure of latent class analysis with distal outcomes. C denotes the latent class variable, \(U_{1} , U_{2} , \ldots ,U_{J}\) denote the \(J\)latent class indicators, and Y denotes the distal outcome

2.2 Latent class analysis with distal outcomes

Latent class models that regress latent classes on predictive covariates have long been used in social and behavioral research and are widely available in standard statistical software. Typically, latent class model with covariates is estimated with a binary or multinomial logistic regression model (depending on the number of classes). In contrast, methods to regress a distal outcome on latent class (see Fig. 1), have been developed more recently and are the focus of this paper (Asparouhov and Muthén 2013; Kang and Schafer 2010; Lanza et al. 2013a).

In addition to the standard LCA assumption of local independence, latent class regression with a distal outcome requires an additional assumption of unconfounded measurement, which assumes that the indicators are independent of the distal outcome, given latent class (Bakk et al. 2013; Kang and Schafer 2010). This assumption is denoted by a lack of a direct effect arrow connecting the indicators \(U_{1} ,U_{2} , \ldots ,U_{J} \varvec{ }\)and the outcome Y in Panel B of Fig. 1.

LCA with distal outcomes may be conducted using either a 1-step or 3-step method. The relative merits of each approach have been previously discussed (Asparouhov and Muthén 2013; Bolck et al. 2004; Feingold et al. 2013; Vermunt 2010). In brief, 1-step methods fit a joint model that simultaneously estimates the latent class measurement model and the structural model describing the relationship between the latent classes and the auxiliary variable (i.e., the distal outcome). In general, 1-step methods yield unbiased and efficient parameter estimates, yet may not converge in some cases due to complexity of the joint likelihood and are not easily implemented for all possible analyses. Thus, 3-step methods (“classify-analyze” methods) are also commonly used, the most common of which are modal assignment and pseudoclass assignment. Three-step methods first fit a latent class model and predict latent class based on the estimated posterior class membership probabilities. Then, the association between the latent classes and the auxiliary variable is estimated through a regression model using predicted latent class membership. Under modal assignment, individuals are predicted to be in the latent class for which they have the highest posterior class membership probability (Clogg 1995; Nagin 2005). Under pseudoclass assignment, latent class membership is predicted by random draws from the multinomial distribution defined by an individual’s posterior class membership probabilities (Bandeen-Roche et al. 1997; Goodman 2007; Wang et al. 2005); pseudoclass assignment is often performed multiple times (e.g., 20), with final estimates obtained by using multiple imputation combining rules to combine results across draws (Rubin 1987).

2.3 Propensity score methods as a means to address confounding

Propensity score methods are standard methods for addressing selection bias in an observational study (Rosenbaum and Rubin 1983; Rubin 2001; Stuart 2010). In the case of two treatment groups T = {0, 1}, the propensity score is defined as the probability that an individual received the treatment (T = 1), conditional on the individual’s observed covariates, and is denoted \(p\left( {\varvec{x}_{\varvec{i}} } \right) = \Pr (T_{i} = 1 | \varvec{X}_{i} = \varvec{x}_{i} )\) where \(\varvec{X}_{i}\) represents the individual’s vector of observed covariates and i = 1, 2, …, N. The propensity score can be extended to cases in which there are more than two treatment groups; Imbens (2000)refers to this as the generalized propensity score, defined as \(p\left( {t, \varvec{x}_{\varvec{i}} } \right) = \Pr (T_{i} = t |\varvec{ X}_{\varvec{i}} = \varvec{x}_{\varvec{i}} )\), where \(t \in {\mathcal{T}}\).

In order to obtain unbiased estimates of treatment effects, one would like to compare groups that only differ with regard to treatment assignment. Randomized experiments achieve this goal through randomization, which balances groups with regard to both observed and unobserved factors; observational studies attempt to mimic randomized studies by balancing groups on a rich set of observed factors. Rosenbaum and Rubin (1983) show that groups that are matched with regard to propensity score values are also matched with regard to all of the covariates that went into estimating the propensity score, making propensity scores a parsimonious and potent analytical approach. Propensity scores can be incorporated in the final analysis through propensity score matching, subclassification, or weighting; we primarily focus on propensity score weighting in this paper-details on other methods can be found in (Stuart 2010). Propensity score weighting implements a weighted regression, in which each individual’s weight is a function of his or her propensity score. A common weighting approach is inverse probability of treatment weighting (IPTW) which weights each individual by the inverse probability of receiving the treatment he or she truly did receive; treated individuals receive a weight of 1/\(p\left( {\varvec{x}_{\varvec{i}} } \right)\) and control individuals a weight of 1/[1 − \(p\left( {\varvec{x}_{\varvec{i}} } \right)\)] (Lunceford and Davidian 2004). Under IPTW, both the treatment and control groups are weighted to reflect the overall study population; thus the difference in potential outcomes between the treatment and control groups after weighting estimates the ATE. In this study, we will use an extension of IPTW for more than two treatment groups proposed by McCaffrey et al. (2013). In brief, this approach fits k binary propensity score models for k treatment groups (Class 1 vs. not, Class 2 vs. not, and Class k vs. not); each individual’s IPTW is based on the propensity score estimated from the model corresponding to his or her observed treatment group. Each of the k treatment groups is weighted to look like the overall study population, and ATE estimates comparing pairwise differences in treatment groups can be obtained after weighting.

Propensity score methods are preferable to regression covariate adjustment for several reasons. First, propensity score methods do not necessarily rely on the parametric modeling assumptions required by regression adjustment (Ho et al. 2007). Additionally, propensity score methods avoid potential bias that arises from extrapolating beyond observed data in traditional regression models when the treatment groups have little overlap with respect to covariates (Stuart 2010). Furthermore, propensity scores are an advantageous dimension reduction technique when there are a substantial number of baseline covariates to adjust for (Rosenbaum and Rubin 1984). Finally, as advocated by Rubin, it is philosophically cleaner to separate the analytic step of controlling for confounding from the step of implementing the final structural model (Rubin 2001). Separation prevents potential bias that may arise from adjusting for covariates solely because they favorably influence the treatment effect estimates.

2.4 Confounding in latent variable regression when treatment is latent

We now discuss extensions of LCA with distal outcomes that can account for potential confounders (see Fig. 2). Complexity arises when controlling for confounding when the treatment variable of interest is a latent variable given the uncertainty regarding an individual’s true treatment status.

Fig. 2
figure 2

Schematic figure of latent variable regression with confounding when the treatment is a latent variable. \(C\)denotes the latent class variable, \(U_{1} , U_{2} , \ldots ,U_{J}\) denote the \(J\)latent class indicators, \(X_{1} , X_{2} , \ldots ,X_{L}\) denote the L potential confounders, and Y denotes the distal outcome

Recently, Kang and Schafer 2010) proposed a 1-step method known as latent class causal analysis (LCCA) that jointly models the latent class indicators \(\varvec{U}\), the potential confounders \(\varvec{X}\), and the distal outcome Y, modeled as the vector of potential outcomes (Y i (1), Y i (2), …, Y i (K)) corresponding to the \(K\) classes. Again, let (U 1U 2, …, U J ) denote the J latent class indicators and let (X 1X 2, …, X L ) denote the L potential confounders. One component of the LCCA model is the LCA modeling the relations between indicators and the latent classes; parameters of interest are the conditional indicator probabilities, Pr(U j  = u j |C = c k ). LCCA models the relations between covariates \(\varvec{X}\) and latent class membership with a multinomial logistic regression model; the parameters of interest are denoted\(\varvec{ \alpha }\), a L × K matrix of class-specific coefficients, the \(\varvec{c}_{\varvec{k}}^{th}\)column of which is denoted \(\varvec{\alpha}_{{\varvec{c}_{\varvec{k}} }} = (\alpha_{{1,c_{k} }} , \alpha_{{2,c_{k} }} , \ldots ,\alpha_{{L,c_{k} }} )^{T}\). LCCA specifies a linear model for the potential outcome model, such that \(\Pr \left( {Y |\varvec{X}_{\varvec{i}} = \varvec{x}_{\varvec{i}} } \right)\sim N\left( {\varvec{\beta}^{T} \varvec{x}_{\varvec{i}} ,{\varvec{\Sigma}}} \right)\), where \(\varvec{\beta}\) denotes a L × K matrix of the class-specific coefficients, the \(\varvec{c}_{\varvec{k}}^{th}\)column of which is denoted \(\varvec{\beta}_{{\varvec{c}_{\varvec{k}} }} = (\beta_{{1,c_{k} }} , \beta_{{2,c_{k} }} , \ldots ,\beta_{{L,c_{k} }} )^{\varvec{T}}\), and \(\varSigma\) is a K × K covariance matrix for Y. Thus, the general form of the likelihood can be expressed as follows:

$$l_{i} \left( \theta \right) = log\mathop \sum \limits_{{c_{k} = 1}}^{K} \left\{ {\frac{{{ \exp }(\varvec{x}_{i}^{T}\varvec{\alpha}_{{\varvec{c}_{\varvec{k}} }} )}}{{\mathop \sum \nolimits_{{c_{k}^{'} = 1}}^{K} { \exp }(\varvec{x}_{\varvec{i}}^{\varvec{T}}\varvec{\alpha}_{{\varvec{c}_{\varvec{k}}^{\varvec{'}} }} )}}} \right\}\left\{ {\mathop \prod \limits_{j = 1}^{J} Pr(U_{j} = u_{j} |C = c_{k} )} \right\}\left\{ { \left( {2\pi \sigma_{{c_{k} }}^{2} } \right)^{ - 1/2} exp\left\{ {\frac{ - 1}{{2\sigma_{{c_{k} }}^{2} }}\left( {y_{i} - \varvec{x}_{i}^{T}\varvec{\beta}_{{\varvec{c}_{\varvec{k}} }} } \right)^{2} } \right\}} \right\}$$

where \(\varvec{\alpha}_{{\varvec{c}_{\varvec{k}} }} = (\alpha_{{1,c_{k} }} , \alpha_{{2,c_{k} }} , \ldots ,\alpha_{{L,c_{k} }} )^{T}\), \(\varvec{\beta}_{{\varvec{c}_{\varvec{k}} }} = (\beta_{{1,c_{k} }} , \beta_{{2,c_{k} }} , \ldots ,\beta_{{L,c_{k} }} )^{\varvec{T}} ,\) and \(\sigma_{{c_{k} }}^{2} \in\) \({\varvec{\Sigma}}\) for \(c_{k} = 1,2, \ldots ,K.\)Estimates of the ATE are then obtained from the maximum-likelihood parameter estimates via expected estimating equations (Wang et al. 2008). LCCA is implemented in the lcca package for R (Kang and Schafer 2010; Schafer and Kang 2013).

This 1-step method for latent class regression with confounders faces the same limitations previously discussed regarding 1-step methods for latent class regression. Particularly, LCCA may not converge under all conditions, given the added complexity of the joint model due to the inclusion of the confounders. Furthermore, these methods require specialized software in order to maximize the joint likelihood; although 1-step methods for latent class regression are currently available in some statistical packages (e.g., Mplus and SAS), the lcca package for R is one of the only packages that implements a 1-step method that addresses confounding. Given the implementation challenges of a 1-step approach, as well as the fact that the LCCA method uses a covariate adjustment approach, rather than propensity score methods which may be more flexible and yield better statistical performance in some settings, we investigate the incorporation of propensity score methods with two commonly used 3-step methods, namely modal and pseudoclass assignment.

Three-step methods resolve challenges of uncertainty of latent class membership by creating a predicted latent class variable, a trade-off that introduces some misclassification of individuals in order to gain tractability. A 3-step approach allows the use of standard propensity score methods, as the propensity scores are estimated with regard to the predicted latent class. The general outline for incorporating propensity scores into a 3-step approach is as follows: (1) fit a LCA model and obtain estimates of posterior class membership probabilities; (2) use modal or pseudoclass assignment to create the predicted latent class variable; (3) estimate the propensity score model by regressing predicted latent class on potential confounders; (4) calculate propensity score weights (IPTW) and assess covariate balance after weighting; (5) fit a weighted regression model for the distal outcome on predicted latent class, using propensity score weights. Under the pseudoclass approach, steps (3)–(5) are implemented multiple times; final estimates are obtained through the use of standard multiple imputation combining rules.

3 Simulation study

3.1 Methods

First, we conducted a latent class simulation study to compare Kang and Schafer’s 1-step method to the proposed 3-step approaches, modal assignment with propensity score weighting and pseudoclass assignment with propensity score weighting. In addition to these three methods, we also considered the 1-step method without covariate adjustment and modal assignment without propensity score weighting in order to assess the impact of ignoring potential confounding.

Data were simulated in R (R Core Team 2013) and were comprised of 15 binary latent class indicators, defining a 3-class structure, 8 independent and normally-distributed covariates, and a single continuous distal outcome. For the purpose of data generation, we created a random variable representing true treatment class T = t where t = {1, 2, 3}, which was generated under a multinomial distribution with equal probabilities for the three treatment groups. Based on one’s true latent class, 15 binary latent class indicators were generated as independent random Bernoulli variables. Within a given class, all indicators were generated with the same probability (conceptually, “low,” “medium,” or “high”); the more distinct these indicator probabilities were across classes, the greater the class separation. We considered the following sets of indicator probabilities for Class 1, 2, and 3: (5, 50, 95 %) (10, 50, 90 %), (20, 50, 80 %), and (30, 50, 70 %).

The covariates, representing potential confounders, were associated with both true latent class and the outcome; the strength of these associations was controlled by way of the \(\varvec{\alpha}\) parameters (i.e., the coefficient vector linking the covariates and class membership) and \(\varvec{\beta}\) parameters (i.e., the coefficient vector linking the covariates and the distal outcome). We specified class-specific parameter vectors \(\varvec{\alpha}_{\varvec{c}} = \left( {\alpha_{1,c} , \alpha_{2,c} , \alpha_{3,c} ,\alpha_{4,c} , \alpha_{5,c} , \alpha_{6,c} , \alpha_{7,c} , \alpha_{8,c} } \right)\) and \(\varvec{\beta}_{\varvec{c}} = \left( {\beta_{1,c} , \beta_{2,c} , \beta_{3,c} ,\beta_{4,c} , \beta_{5,c} , \beta_{6,c} , \beta_{7,c} , \beta_{8,c} } \right)\) where c = {1, 2, 3}. Each individual’s vector of covariates \(\varvec{X} = (X_{1} ,X_{2} , \ldots ,X_{8} )\) was generated as the product of the vector \(\varvec{\alpha}_{\varvec{c}}\) corresponding to his or her true treatment class (\(c = t)\)and a vector of independent standard normal random variables \(\varvec{Z} = \left( {Z_{1} ,Z_{2} , \ldots ,Z_{8} } \right)\), where \(\varvec{Z}\sim N\left( {0,1} \right).\) Subsequently, the potential outcome for class c was generated as the linear combination of an individual’s covariates \(\varvec{X}\) and the parameters \(\varvec{\beta}_{\varvec{c}}\), such that Y c  = β 0,c  + β 1,c X 1 + β 2,c X 2 + β 3,c X 3 + β 4,c X 4 + β 5,c X 5 + β 6,c X 6 + β 7,c X 7 + β 8,c X 8. An individual’s observed outcome was taken to be the potential outcome associated with his or her true treatment class (c = t). We specified the true treatment effect size in terms of β 0,c : for all simulations \((\beta_{0,1} = 1,\beta_{0,2} = 1.5, \beta_{0,3} = 2).\)

Simulations investigated the effect of varying both class separation (i.e., entropy) and degree of confounding. By varying the magnitude of both the \(\varvec{\alpha}\) parameters and \(\varvec{\beta}\) parameters, we could control the magnitude of the confounding. For simplicity, within a given class, all \(\varvec{\alpha}\) parameters were equal (α 1 = α 2 = … = α 8) and all \(\varvec{\beta}\) parameters were equal (β 1 = β 2 = … = β 8). We considered the following values for the \(\varvec{\alpha}\) and \(\varvec{\beta}\) vectors, where larger values of \(\varvec{\alpha}\) and \(\varvec{\beta}\) indicate greater confounding: (\(\varvec{\alpha}_{1} =\varvec{\alpha}_{2} =\varvec{\alpha}_{3} =\varvec{\beta}_{1} =\varvec{\beta}_{2} =\varvec{\beta}_{3} = 1)\); \((\varvec{\alpha}_{1} =\varvec{\beta}_{1} = 1;\varvec{ \alpha }_{2} =\varvec{\beta}_{2} = 1.1;\varvec{\alpha}_{3} =\varvec{\beta}_{3} = 1.2), (\varvec{\alpha}_{1} =\varvec{\beta}_{1} = 1;\varvec{ \alpha }_{2} =\varvec{\beta}_{2} = 1.25;\varvec{\alpha}_{3} =\varvec{\beta}_{3} = 1.5),\) and \(({\varvec{\upalpha}}_{1} =\varvec{\beta}_{1} = 1;\varvec{ \alpha }_{2} =\varvec{\beta}_{2} = 1.5;\varvec{\alpha}_{3} =\varvec{\beta}_{3} = 2)\). Each simulated dataset contained 5,000 observations and 1,000 simulations were performed at each setting.

The 1-step method was implemented using the lcca function in the lcca package for R (Schafer and Kang 2013), specifying a 3-class model. In the lcca function, the user separately specifies covariates to control for with respect to the latent class indicators and with respect to the outcome; we allowed all 8 covariates to predict both the indicators and the outcome. We obtained estimates of the ATE from the lcca function. We implemented modal and pseudoclass assignment based on 3-class LCA results obtained using the lca function in the lcca package. Propensity scores, modeling modal or pseudoclass predicted class, were estimated using logistic regression; propensity score weighting for multiple groups was conducted using the method described by McCaffrey et al. (2013). We fit 3 binary propensity score models (Class 1 vs. not, Class 2 vs. not, and Class 3 vs. not) and for each individual used the propensity score estimated from the model corresponding to his or her predicted class membership to calculate an inverse probability of treatment weight (IPTW). Weights were trimmed at the 98th percentile to avoid extreme weights (Cole and Hernán 2008). Differences in outcomes across classes were then estimated using propensity score weighted models that regressed the distal outcome on modal or pseudoclass assignment; this was implemented using the survey package in R (Lumley 2004, 2013). IPTW also generates estimates of the ATE, making these results directly comparable to the results from LCCA. Twenty pseudoclass draws were obtained (Graham et al. 2007), which generated 20 effect estimates that were then combined using the multiple imputation combining rule (Rubin 1987; Wang et al. 2005). Unadjusted models were estimated by implementing the lcca function specifying no covariates and by implementing modal assignment without propensity score weighting. For the purposes of this simulation, all outcome and propensity score models were correctly specified.

Our primary interest was estimation of the three pairwise class effect estimates with regard to the distal outcome (namely \(\overline{{Y_{2} }} - \overline{{Y_{1} }}\), \(\overline{{Y_{3} }} - \overline{{Y_{1} }}\), and \(\bar{Y}_{3} - \bar{Y}_{2}\)). We assessed statistical performance in terms of percent bias (% bias), standard error (SE), root mean squared error (RMSE), and the 95 % confidence interval (CI) coverage rate (i.e., the percentage of 95 % confidence intervals that contained the true difference in means). For each simulation condition investigated, performance statistics were calculated with regard each of the three pairwise class effects; the results we report represent the averages across the three class effects. Bias is reported as the standardized percent bias \(\left( {(\hat{\theta } - \theta )/ \theta } \right) \times 100\) to account for the fact that the three true treatment effects were not equal across pairwise comparisons.

3.2 Results

Figure 3 presents four figures depicting percent bias, SE, RMSE, and 95 % CI coverage rates for each method as a function of both entropy and degree of confounding (numerical results presented in Table 1). In the absence of confounding, the percent bias for unadjusted and adjusted LCCA were similar, as was the percent bias for unadjusted and adjusted modal assignment, as expected. When confounding was present, the percent bias for both unadjusted methods were an order of magnitude larger than the percent bias of the three adjusted methods for nearly every condition. The percent bias for LCCA (unadjusted and adjusted) was primarily affected by the degree of confounding, whereas the 3-step methods (unadjusted and adjusted) were affected by both the degree of confounding and entropy. Adjusted LCCA showed very small percent bias (<10 %) regardless of the degree of confounding or entropy. Modal and pseudoclass assignment with IPTW both showed notable reductions in the magnitude of percent bias compared to the unadjusted methods, yet the percent bias for these methods was consistently larger than for adjusted LCCA. Modal and pseudoclass assignment with IPTW generally performed similarly with respect to percent bias, with the exception that modal assignment performed consistently better than pseudoclass assignment for conditions with the lowest entropy (denoted E4 in Fig. 3, Table 1).

Fig. 3
figure 3

Average percent bias (% bias), standard error, root mean square error, and 95 % confidence interval (CI) coverage across the three pairwise class contrasts as a function of both entropy and degree of confounding. Abbreviations: UN.1  unadjusted 1-step, UN.M unadjusted modal assignment, ADJ.1 1-step with covariates, ADJ.M modal assignment with IPTW, ADJ.PC pseudoclass assignment with IPTW. Entropy: E1 = 0.50, E2 = 0.70, E3 = 0.90, E4 = 0.96. Confounding: C0 = (\(\varvec{\alpha}_{1} =\varvec{\alpha}_{2} =\varvec{\alpha}_{3} =\varvec{\beta}_{1} =\varvec{\beta}_{2} =\varvec{\beta}_{3} = 1)\); C1=\((\varvec{\alpha}_{1} =\varvec{\beta}_{1} = 1;\varvec{ \alpha }_{2} =\varvec{\beta}_{2} = 1.1;\varvec{\alpha}_{3} =\varvec{\beta}_{3} = 1.2)\); C2=\((\varvec{\alpha}_{1} =\varvec{\beta}_{1} = 1;\varvec{ \alpha }_{2} =\varvec{\beta}_{2} = 1.25;\varvec{\alpha}_{3} =\varvec{\beta}_{3} = 1.5)\); C3=\((\varvec{\alpha}_{1} =\varvec{\beta}_{1} = 1;\varvec{ \alpha }_{2} =\varvec{\beta}_{2} = 1.5;\varvec{\alpha}_{3} =\varvec{\beta}_{3} = 2)\). In all figures dark shading indicates worse performance

Fig. 4
figure 4

Estimated class differences, relative to the low services class, with respect to change in substance problem scale (from baseline to 3 months), as estimated by three methods that adjust for potential confounding and two unadjusted methods. * denotes p values <0.05, † denotes stepwise Bonferroni-corrected p < 0.05 Abbreviations: Low low service utilization class, Indiv individual-focused services class, Indiv & Fam individual- and family-focused services class, Multiple multiple services class

Table 1 Average percent bias (% bias), standard error (SE), root mean square error (RMSE), and 95 % confidence interval (CI) coverage across the three pairwise class contrasts

With regard to SE, adjusted LCCA consistently yielded the smallest SE estimates, while the other four methods yielded SE estimates approximately 2–6 times larger in magnitude. When there was no confounding (denoted C0) or minimal confounding (denoted C1), these four methods have similar SEs; as confounding increases (denoted C2 and C3 in our simulations), both modal and pseudoclass assignment with IPTW yield notably larger SEs than unadjusted LCCA and unadjusted modal. For all methods, SEs increase as entropy decreases; the magnitude of this increase is smallest for adjusted LCCA.

In general, the RMSE estimates for the three adjusted methods were much smaller than the RMSE estimates for the two unadjusted methods, with RMSE for adjusted LCCA being particularly small. The large RMSE for the unadjusted methods was primarily driven by the magnitude of the bias. The magnitude of RMSE for the adjusted 3-step methods reflects both notable bias and larger SE estimates, whereas the RMSE for adjusted LCCA is quite small due to both smaller bias and SE. RMSE estimates for both modal and pseudoclass assignment with IPTW were significantly smaller than the unadjusted methods, yet were often an order of magnitude larger than adjusted LCCA for the conditions with the greatest degree of confounding.

In the absence of confounding and high entropy (C0E1, C0E2), both unadjusted methods show close to nominal 95 % CI coverage, yet coverage for these methods is 0 % under almost all conditions that involve confounding (C1, C2, and C3). In general, the adjusted LCCA method yields conservative 95 % CI coverage (greater than 97 % for all conditions) and is not significantly affected by degree of confounding or entropy. Coverage rates for both modal and pseudoclass assignment with IPTW are also conservative (near 100 %) when there is little or no confounding and high entropy; however, coverage notably decreases as entropy decreases and confounding increases. Both of these methods show quite poor coverage rates under the conditions with the greatest confounding (C2 and C3).

4 Motivating example

4.1 Methods

We applied the five previously discussed methods to our substantive question of interest: what is the effect of classes of substance use treatment services that youth receive in typical outpatient treatment on substance use problems? First, we empirically identified classes of treatment services (grouped into domains of individual-focused, family-based, and case management services) that are commonly provided in outpatient treatment using LCA; we then estimated the association between class membership and subsequent substance use problems, while controlling for potential confounding associated with the nonrandomized allocation of treatment services. Data came from a national database of adolescents who received drug treatment services funded by Substance Abuse and Mental Health Services Administration’s Center for Substance Abuse Treatment (see Table 5 Appendix for details). This analysis was restricted to the 5,527 youth ages 12–18 who exclusively received outpatient services (i.e., no inpatient and residential treatment services) between study baseline and 3 months (see Table 2 for youth characteristics). For study participation, parents provided written informed consent and adolescents provided assent; institutional review boards approved the study protocol at each site.

Table 2 Descriptive statistics of the overall adolescent sample (n = 5,527)

All youth were assessed with the global appraisal of individual needs (GAIN; Dennis et al. 2003b), a comprehensive instrument that assesses the following domains: demographics, substance use and substance use treatment, risk behaviors, mental and physical health, legal status, environment risk factors, and education/vocation status. All data collected with the GAIN are based on youth self-report; reliability studies have found very good reliability statistics for the majority of the GAIN indices (i.e., Cronbach’s α greater than 0.85; Dennis et al. 2010). The GAIN’s Treatment Received Scale (TxRS) was used to assess the substance use treatment services that youth received from study baseline to 3 months; this 20-item scale includes subscales that measure provision of direct (i.e., individual-focused), family (i.e., family-based), and external (i.e., case management) services (Dennis et al. 2010). A total of 12 items, four from each of the subscales, were used as latent class indicators (see Table 3). In previous work, we determined that a 4-class model best described out data, based on information criteria (BIC, adjusted BIC, AIC), entropy, and class interpretability ( Schuler 2013). We identified the following classes: Low Service Utilization class (10.5 % of youth), individual-focused services class (42.3 %), individual- and family-focused services class (36.5 %), and multiple services class (10.7 %).

Table 3 Probability of endorsing the latent class indicators in the total sample and by latent class

Our objective was to estimate the causal effects of these four treatment classes on subsequent substance use problems. The distal outcome of interest is the change in the past month substance problem scale (SPS) score from baseline to 3 months. The SPS scale is a count of 16 symptoms, including the 7 DSM-IV criteria for substance dependence, the 4 DSM-IV criteria for substance abuse, 2 items concerning substance-related health and psychological problems, and 3 items related to less severe symptoms (e.g., hiding use, people complaining about use, and weekly use; Dennis et al. 2010).

Given the observational nature of the data, it is likely that both treatment services received by youth as well as substance use outcomes are associated with baseline youth characteristics such demographics, baseline substance use, and justice system involvement. Thus, adjusted analysis controlled for the following potential confounders: demographic variables [age, sex, and race/ethnicity (self-reported as White, Black, Hispanic, and Other)]; baseline substance use variables [prior substance use treatment (lifetime), current recognition of substance problems, days of substance use (past 90 days), substance dependence scale (past year), and Treatment Motivation Index]; legal status variables [any justice system involvement; any arrests; any days in a controlled environment (each with respect to past 90 days); and the Crime Violence Scale]; and mental health variables [days affected by emotional problems (past 90 days), and the Behavioral Complexity Scale].

Analyses for adjusted LCCA, modal assignment with propensity score weighting, and pseudoclass assignment with propensity score weighting included the 12 latent class indicators, the distal outcome (SPS change score), and the potential confounders. Analyses using the unadjusted LCCA model and modal assignment included only the 12 latent class indicators and the distal outcome. A 4-class model was specified for all methods. The same covariates were included in the LCCA model as were included in the propensity score models. For each method we present all 6 of the estimated pairwise differences in distal outcomes between classes; we applied a stepwise Bonferroni correction to adjust for multiple comparisons (Hochberg 1988).

4.2 Results

As Fig. 4 and Table 4 show, unadjusted and adjusted estimates vary significantly with regard to the resulting statistical inference. The unadjusted 1-step method suggests that the Individual-Focused Services class, the Individual- and Family-Focused Services class, and the Multiple Services class each have significantly larger decreases on the Substance Problem Scale from baseline to 3 months than the Low Service Utilization class (respective estimates are −0.37, p = 0.04; −0.60, p = 0.001; and −0.59, p = 0.01). Similarly, the unadjusted analysis based on modal assignment also suggests that the individual- and family-focused services class and the multiple services class each have significantly larger decreases on the SPS than the low service utilization class (respective estimates are −0.49 p = 0.004; and −0.45, p = 0.03). When a stepwise Bonferroni correction was applied, the following contrast remained significant: individual- and family-focused services versus low service utilization and multiple services versus low service utilization (unadjusted LCCA), and individual- and family-focused services versus low service utilization (unadjusted modal). However, none of the adjusted methods (LCCA, modal assignment with IPTW, or pseudoclass assignment with IPTW) show any significant differences across classes with regard to changes in SPS.

Table 4 Estimated class differences with respect to change in substance problem scale (from baseline to 3 months), as estimated by three methods that adjust for potential confounding and two unadjusted methods

This example highlights that conducting LCA with distal outcomes with and without controlling for potential confounding can lead to notably different substantive interpretations. The unadjusted analyses suggest that youth in each of the three other treatment classes show significantly larger decreases on the substance problem scale at 3 months compared to the low services class; yet, the adjusted analyses finds no significant differences in substance problems among the groups after controlling for baseline substance use, demographics, and factors such as juvenile justice involvement. These unadjusted and adjusted comparisons suggest different clinical interpretations—the unadjusted analysis indicates that the Low Services group show significantly smaller substance problems improvements relative to the other groups, indicating that youth should be provided a greater number of treatment services (in keeping with the other latent classes) in order to achieve greater reductions in substance problems. However, the adjusted analyses indicate that treatment groups are similarly effective, given baseline need, such that the services provided to youth in the Low Service class are as effective, given their baseline characteristics, as the services provided to youth in the three other classes, given their baseline characteristics. One interpretation of the adjusted results is that the similar effect sizes seen across treatment groups reflect efficient referral to, self-selection into, or tailoring of services based on youth need. Alternatively, treatment effectiveness may be relatively independent of the specific treatment services a youth receives, and instead reflect a general supervision effect; thus, the similar effect sizes may represent the magnitude of a general supervision effect, adjusting for case mix differences across classes. Given the significantly different clinical ramifications of our unadjusted and adjusted analyses, this example highlights the importance of accounting for significant baseline differences across treatment groups in order to facilitate an unbiased comparison when conducting latent class regression with distal outcomes.

5 Discussion

The results from our simulation study and our motivating example of adolescents in substance use treatment both demonstrate that effect estimates from latent class regression with distal outcomes may vary substantially whether or not potential confounding is adjusted for. Confounding in settings where all variables are fully observed is widely recognized and addressed statistically, yet recognition of and statistical methods for confounding in latent variable regression are only recently emerging. Controlling for confounding in latent variable regression presents unique challenges, particularly when the latent variable is the treatment of interest. In this paper we examine a recently proposed 1-step method, LCCA, which addresses confounding through joint modeling of the latent class indicators, confounders, and the distal outcome. Additionally, we examine methods to incorporate propensity score weighting with classical 3-step methods, namely modal and pseudoclass assignment.

In general, our results indicate that LCCA performs quite well under a range of conditions, yielding very small bias, reasonable SE, and small RMSE estimates. Confidence interval coverage rates were somewhat conservative in our simulation results. However, LCCA (or broadly, 1-step methods) may not be feasible in all settings due to implementation challenges, such as lack of model convergence. Additionally, in some cases, the latent class estimation under a 1-step method may be unduly influenced by the distal outcome. Initially, we considered an additional outcome, the Substance Frequency Scale. However, implementation of LCCA with this outcome yielded a notably different 4-class structure, with regard both to conditional indicator probabilities and estimated class prevalences, than the 4-class model presented. We were unable to present results regarding the Substance Frequency Scale given that the 1-step and 3-step results were not comparable due to differences in the estimated latent class structure. Conceptually, it is undesirable for the distal outcome to significantly influence the latent classes, particularly when the goal is to estimate the causal effect of class membership on the distal outcome; Petras and Masyn (2010) further discuss this limitation of 1-step methods for distal outcomes.

We found that modal and pseudoclass assignment combined with IPTW was able to significantly reduce the bias in the effect estimates in the presence of confounding, indicating that combining 3-step methods with propensity score methods is a promising approach. Consistent with previous studies, the 3-step methods performed much more favorably in conditions of high entropy and showed poor performance when entropy was low (0.50), since low entropy increased the rate of misclassification (Asparouhov and Muthén 2013; Bolck et al. 2004; Vermunt 2010). These methods also showed worse performance under conditions with greater degrees of confounding, indicating that the IPTW approach was not able to fully adjust for confounding. This may be due to the fact that the propensity score is calculated with respect to the predicted latent class, meaning that IPTW balances predicted latent classes, rather than true latent classes, on baseline covariates. Misclassification with regard to individual’s latent class in the propensity score model results in residual bias, since the propensity score is not able to fully adjust for the true association between latent classes and confounders. In a sense, by adding an additional estimation step (i.e., propensity score modeling) that relies on predicted latent class, these 3-step methods introduce additional bias relative to 3-step methods in contexts with no confounding. This additional bias is evident when comparing simulation results for conditions with no confounding to conditions with confounding for a given entropy level. However, this limitation is inherent to the nature of latent variables: since latent classes are unobserved, we can never estimate the propensity score or assess balance with regard to the true latent class.

For simplicity and given that corrected 3-step methods have not yet been widely adopted by applied researchers, we chose to focus only on classical 3-step methods in this study. Several methods have been proposed in recent years to correct the bias from 3-step methods for latent class regression with covariates or with distal outcomes—see Asparouhov and Muthén (2013), Bakk et al. (2013), Lanza et al. (2013a), Petersen et al. (2012), and Vermunt (2010) for details on these correction methods. In the absence of confounding, these corrected 3-step methods perform quite similarly to 1-step methods with respect to bias, SE, RMSE and 95 % CI coverage, yet are often less computationally intensive. Given that classical 3-step methods combined with IPTW were found to significantly reduce confounding, future work will explore extending correction methods for use in conjunction with propensity score methods to further improve the performance of 3-step methods in the context of confounding.

Note that our simulation studies did not vary sample size, although other work has shown that sample size does affect the performance of both 1-step and 3-step methods in contexts without confounding. Thus, it is plausible that sample size would also impact performance in our context. Previous work has shown that sample size particularly impacts performance when combined with poor class separation (low entropy). As Vermunt (2010) describes, maximum likelihood estimation of LCA models often generates solutions which overestimate class differences, particularly in cases of low entropy and small sample size which yields bias in both the 1-step (typically resulting in parameter overestimation) and 3-step (typically resulting in parameter underestimation) methods, although 3-step methods are more sensitive to sample size. In the context of confounding, the addition of covariates in the 1-step model does provide additional information on class membership and improve classification, thus potentially offsetting some of the deleterious effects of small sample size.

Although our simulation results highlight the performance of the 1-step method, a 3-step approach offers greater modeling flexibility. One notable advantage is that 3-step methods allow estimation of the latent classes without influence from the distal outcome. Additionally, it is possible that a 3-step method with propensity scores would be more advantageous relative to a 1-step model in the case of model misspecification. In our simulation study, both the latent class/covariate model and the covariate/distal outcome model were correctly specified. Given the parametric constraints of the joint model specified in the lcca package, it is likely in practice that one or both of the association models will be incorrectly specified. Propensity score methods have been shown to be relatively robust to model misspecification relative to covariate adjustment; additional strategies to buffer the effects of model misspecification include non-parametric estimation of the propensity score (Lee et al. 2009; McCaffrey et al. 2004; Stuart 2010) and using doubly robust estimation (Kang and Schafer 2007). Thus, another notable advantage of a 3-step approach is that it allows the incorporation of propensity score methods, which may perform better under some conditions, particularly model misspecification, than the covariate adjustment implemented in 1-step methods.

Overall, this paper highlights that applied researchers should think critically about confounding in the context of latent variable regression; as in contexts with fully observed variables, failure to adjust for potential confounders may lead to significantly biased results and potentially misleading inferences. Although methodological development in this area has been limited so far, given the complications of latent treatment groups, we discuss three proposed methods, a 1-step approach as well as 3-step approaches that include propensity score weighting. As we discuss, each of these approaches do reduce confounding bias, the 1-step method more effectively than the 3-step methods, yet each approach has limitations. Future methodological work should focus on developing and refining methods that can address confounding for LCA with distal outcomes, and assess performance under a broader array of conditions, including model misspecification.