Keywords

1 Introduction

It is a fundamental task to query or estimate the causal effect of a treatment (a.k.a. exposure, intervention or action) on an outcome of interest in causal inference. Causal effect estimation has wide applications across many fields, including but not limited to, economics [19], epidemiology [16, 28], and computer science [30]. The gold standard method for causal effect estimation is randomised controlled trials (RCT), but they are often impractical or unethical due to cost restrictions or ethical constraints [19, 30]. Instead of conducting an RCT, estimating causal effects from observational data offers an alternative to evaluate the effect of a treatment on the outcome of interest.

Fig. 1.
figure 1

Three causal DAGs are utilised to illustrate the problems of causal effect estimation from observational data. In all three DAGs, \(\textbf{X}\), U, W and Y are the set of pretreatment variables, latent confounder, treatment and outcome variables, respectively. (a) indicates the unconfoundedness assumption holding, and (b) shows the causal effect of W on Y is non-identification since there is a latent confounder U. (c) illustrates the problem studied in this work, in which the set \(\textbf{X}\) is represented by three sets \(\{\textbf{S, C, F}\}\).

Confounding bias is a major obstacle in estimating causal effects from observational data. It arises from confounders that affect both the treatment variable W and the outcome variable Y. When all confounders are measured (i.e., the unconfoundedness assumption [19, 31] is satisfied), adjusting for the set of all measured confounders is sufficient to obtain an unbiased estimation of the causal effect from observational data [1, 19]. For example, in the causal graph of Fig. 1(a), the unconfoundedness is satisfied when given \(\textbf{X}\). Nevertheless, the unconfoundedness assumption is untestable, and there exists a latent (a.k.a. unobserved, unmeasured) confounder affecting both W and Y in many real-world applications, e.g. the latent confounder U affects both W and Y in the causal graph in Fig. 1(b). In such a situation, the causal effect of W on Y is non-identification [30]. Most existing data-driven methods rely on the unconfoundedness assumption and thus it becomes challenging and questionable for them to obtain unbiased causal effects from data with latent confounders.

The instrumental variable (IV) approach is a practical and powerful technique for addressing the challenging problem of causal effect estimation in the presence of latent confounders. The IV approach requires a valid IV for eliminating the confounding bias caused by latent confounders [2, 18]. Valid IVs are exogenous variables that are associated with W but not directly associated with Y [16, 27]. A valid IV S needs to satisfy three conditions: (1) S is correlated to W; (2) S and Y do not share confounders (i.e. unconfounded instrument); and (3) the effect of S on Y is entirely through W (i.e. exogenous) [16, 27]. However, the last two conditions are too strict and not testable in real-world applications. Therefore, in many existing IV-based methods, an IV is nominated based on prior or domain knowledge. However, in many real-world applications, the nominated IVs based on domain knowledge could violate one of the three conditions, resulting in a biased estimate and potentially leading to incorrect conclusions [6, 16].

It is a challenging problem to discover a valid IV directly from data. Investigators usually collect as many covariates as possible, but few of them are valid IVs that satisfy the three conditions. Instead of discovering a valid IV, Kang et al. [20] proposed a data-driven method, referred to as sisVIVE, based on the assumption of some invalid and some valid IVs (i.e. more than half of candidate IVs are valid IVs ) to provide a bound of causal effect estimations. Hartford et al. [14] proposed DeepIV, a deep learning based IV approach for counterfactual predictions, but it requires a nominated IV and the corresponding conditioning set. Kuang et al. [23] developed a method to model a summary IV as a latent variable based on the statistical dependencies of the set of candidate IVs. Yuan et al. [36] proposed a data-driven method to automatically generate a synthetic IV for counterfactual predictions, but the method does not consider the confounding bias between the IV and the outcome, and the condition of unconfounded instrument may be violated in many cases. Therefore, it is desirable to develop an algorithm for learning a valid IV that considers the unconfounded instrument for causal effect estimations, especially conditional average causal effect estimations, from data with latent confounders.

To provide a practical solution for conditional average causal effect estimations, in this work, we focus on conditional IV (CIV), which can be considered as an IV with relaxed conditions and a CIV requires a conditioning set to instrumentalise it to function as an IV (details see Definition 1). We propose to leverage disentangled representation learning technique to learn from data the representations of a CIV and its conditioning set.

Specifically, as shown in Fig. 1(c), we assume that the observed covariates are learned through three representations, \(\textbf{S}\), \(\textbf{C}\) and \(\textbf{F}\). Here, \(\textbf{S}\) affects both treatment W and \(\textbf{C}\), \(\textbf{C}\) represents the confounding factor affecting both W and the outcome Y, and \(\textbf{F}\) represents the risk factor affecting both \(\textbf{C}\) and Y. We then establish a theorem that \(\textbf{S}\) is a valid CIV that is instrumentalised by \(\{\textbf{C, F}\}\), meaning that \(\{\textbf{C, F}\}\) is the conditioning set of \(\textbf{S}\). Supported by this theorem, we design and develop a novel disentangled representation learning algorithm called DVAE.CIV model, which is based on the Variational AutoEncoder (VAE) model [22]. This model allows us to obtain the representations of the CIV \(\textbf{S}\) and its conditioning set \(\{\textbf{C, F}\}\), enabling us to use \(\textbf{S}\) as a valid IV conditioning on \(\{\textbf{C, F}\}\) for estimating the conditional average causal effects of W on Y from data when there are latent confounders. The main contributions of the paper are summarised as follows.

  • We address a challenging problem in conditional average causal effect estimations from data with latent confounders by utilising the CIV approach and VAE models.

  • We propose a novel disentanglement learning model based on the conditional VAE model to learn and disentangle the representations of covariates into the representations of a CIV \(\textbf{S}\) and its conditioning set \(\{\textbf{C, F}\}\) for conditional average causal effect estimations from data with latent confounders.

  • We conduct extensive experiments on synthetic and real-world datasets to show the performance of the DVAE.CIV model, w.r.t. causal effect estimations from data with latent confounders.

2 Preliminaries

In this paper, uppercase and lowercase letters are utilised to represent variables and their values, respectively. Bold-faced uppercase and lowercase letters indicate a set of variables and a value assignment of the set, respectively.

A DAG (direct acyclic graph) is a graph that contains directed edges (i.e. \(\rightarrow \)) without cycles. In a DAG \(\mathcal {G}\), the directed edge \(X_i \rightarrow X_j\) represents that \(X_i\) is a cause of \(X_j\), and \(X_j\) is an effect of \(X_i\). A DAG is a causal DAG when a direct edge \(X_i \rightarrow X_j\) represents that \(X_i\) is a cause of \(X_j\). In this work, we assume a causal DAG \(\mathcal {G}=(\textbf{V, E})\) to represent the underlying system, where \(\textbf{V}=\textbf{X}\cup \textbf{U}\cup \{W, Y\}\), and \(\textbf{E}\subseteq \textbf{V} \times \textbf{V}\) denotes directed edges. In \(\textbf{V}\), we assume that \(\textbf{X}\) is the set of pretreatment variables, \(\textbf{U}\) is the set of latent confounders, W is a binary treatment variable (\(w=1\) and \(w=0\) denote the treated sample and control sample, respectively), and Y(w) is an outcome of interest. Following the potential outcome model [19, 31], we have the potential outcomes \(Y(w=1)\) and \(Y(w=0)\) relative to the treatment W. Note that we can only measure one of the two potential outcomes for a given individual \(x_i\). Conceptually, the individual causal effect (ICE) at \(x_i\) is defined as \(ICE_i = Y_i(w=1) -Y_i(w=0)\). The average causal effect of W on Y is defined as ACE\((W, Y)= \mathbb {E}[Y_i(w=1) -Y_i(w=0)]\), where \(\mathbb {E}\) is the expectation function.

The conditional average causal effect (CACE) of W on Y is referred to as CACE(WY), and defined as the form \(P(Y| do(w), \textbf{X})\), where \(do(\cdot )\) is do-operation and indicates an intervention on the treatment (i.e. set the value of W as per [30]). Conceptually, \(P(Y| do(w), \textbf{X})\) can be obtained as:

$$\begin{aligned} \begin{aligned} \textrm{CACE}(W, Y) = \mathbb {E}[Y_i (w=1) - Y_i(w=0)\mid \textbf{x}_i =x] \end{aligned} \end{aligned}$$
(1)

In this work, we would like to estimate CACE(WY) from data that there exists at least a latent confounder U affecting both W and Y. When there is an IV S and the set of conditioning covariates \(\textbf{Z}\) available in data, CACE(WY) can be calculated by the following formula as in [3, 19]:

$$\begin{aligned} \textrm{CACE}(W, Y)= \frac{\mathbb {E}(Y|W = 1, S = 1, \textbf{Z}) - \mathbb {E}(Y|W = 0, S = 1, \textbf{Z})}{\mathbb {E}(W|S = 1, \textbf{Z}) - \mathbb {E}(W|S = 0, \textbf{Z})} \end{aligned}$$
(2)

The approach of CIV allows a measured covariate to be a valid IV conditioning on a set of measured variables. The formal definition of the CIV in a DAG (Definition 7.4.1 on Page 248 [30]) is introduced as follows.

Definition 1 (Conditional IV)

Let \(\mathcal {G}=(\textbf{V, E})\) be a DAG with \(\textbf{V}=\textbf{X}\cup \textbf{U}\cup \{W, Y\}\), a variable \(Q\in \textbf{X}\) is a conditional IV w.r.t. \(W\rightarrow Y\) if there exists a set of measured variables \(\textbf{Z}\subseteq \textbf{X}\) such that (i) , (ii) in \(\mathcal {G}_{\underline{W}}\), and (iii) \(\forall Z\in \textbf{Z}\), Z is not a descendant of Y.

Here, and are d-separation and d-connection for reading the conditioning relationships between nodes in a DAG [30]. The manipulated DAG \(\mathcal {G}_{\underline{W}}\) in Definition 1 is obtained by deleting the direct edge \(W\rightarrow Y\) from the DAG \(\mathcal {G}\). Note that Definition 1 is defined on a single CIV Q that can be generalised to a set of CIVs \(\textbf{Q}\) easily.

With the pretreatment variables assumption, there is not a descendant of Y in \(\textbf{X}\), i.e. the condition (iii) of Definition 1 is always held. It means that one needs to check the first two conditions for verifying whether a variable is a CIV or not. Note that discovering a conditioning set \(\textbf{Z}\) from a given DAG is NP-complete [37]. Under the pretreatment assumption, the time complexity of discovering a conditioning set is still NP-complete. Instead of discovering a conditioning set from a given causal DAG, in this work, we will utilise disentangled representation learning to learn the representations of CIVs and the representations of the conditioning set directly from data with latent confounders.

3 The Proposed DVAE.CIV Model

3.1 The Disentangled Representation Learning Scheme for Causal Effect Estimation

In this work, we would like to estimate CACE(WY) from observational data with latent confounders. Note that the causal effect of W on Y is non-identifiable when there exists a latent confounder \(U\in \textbf{U}\) affecting both W and Y, i.e. \(W\leftarrow U \rightarrow Y\) in the underlying DAG [6, 30]. It is challenging to recover CACE(WY) from data with latent confounders due to the effect of U is not computable. If there is a nominated CIV and its corresponding conditioning set, CACE(WY) can be obtained unbiasedly from data by using an IV-based estimator. However, a CIV and its conditioning set are usually unknown in many real-world applications. Furthermore, if an invalid CIV is used, the wrong result or conclusion may be drawn [9, 27].

To estimate the conditional average causal effects and average causal effects from data with latent confounders, we propose and design the DVAE.CIV model to learn three representations \(\{\textbf{S, C, F}\}\) as in the scheme of Fig. 1(c). Here \(\textbf{S}\) is the representation of CIVs that only affect W but not Y, \(\textbf{F}\) is the representation of the risk factors that affects Y but not W, and \(\textbf{C}\) is the confounding representation that affecting both W and Y.

Our proposed DVAE.CIV model relies on VAEs: we assume that the measured covariates factorise conditioning on the latent variables, and use an inference model [22] which follows a factorisation of the true posterior [15, 26]. Based on our disentanglement setting in Fig. 1(c), we have the following theoretical result for causal effect estimation from data with latent confounders.

Theorem 1

Let \(\mathcal {G}=(\textbf{X}\cup \textbf{U}\cup \{W, Y\}, \textbf{E})\) be a causal DAG, in which \(\textbf{X}\) is a set of pretreatment variables, \(\textbf{U}\) is a set of latent confounders, W and Y are treatment and outcome variables, respectively, and \(W\rightarrow Y\) is in \(\textbf{E}\). If we can learn the three representations as per the scheme in Fig. 1(c), then the quantities of CACE(WY) can be calculated by using IV-based method.

Proof

The directed edge \(W\rightarrow Y\) in \(\mathcal {G}\) is to ensure that W has a causal effect on Y. In the causal DAG in Fig. 1(c), we first show that the set \(\textbf{C}\cup \textbf{F}\) instrumentalists \(\textbf{S}\) to be a valid CIV. \(\textbf{S}\) is a common cause of W and \(\textbf{C}\), so , i.e. the first condition of Definition 1 holds. In the causal DAG \(\mathcal {G}\) in Fig. 1(c), \(\textbf{C}\) is a collider and is a common cause of W and Y. That is, conditioning on \(\textbf{C}\), the path \(W\leftarrow \textbf{S}\rightarrow \textbf{C}\leftarrow \textbf{F}\rightarrow Y\) is open, but \(\textbf{F}\) is sufficient to block this path. For the path \(\textbf{S}\rightarrow \textbf{C}\rightarrow Y\), \(\textbf{C}\) blocks it. Furthermore, in the manipulated DAG \(\mathcal {G}_{\underline{W}}\), W is a collider such that the empty set blocks the three paths between \(\textbf{S}\) and Y, i.e. \(\textbf{S}\rightarrow W\leftarrow U\rightarrow Y\), \(\textbf{S}\rightarrow W\leftarrow \textbf{C}\leftarrow \textbf{F}\rightarrow Y\) and \(\textbf{S}\rightarrow W\leftarrow \textbf{C}\rightarrow Y\). Hence, the set \(\textbf{C}\cup \textbf{F}\) blocks all paths between \(\textbf{S}\) and Y in \(\mathcal {G}_{\underline{W}}\), i.e. the second condition of Definition 1 holds. Finally, \(\textbf{C}\cup \textbf{F}\) does not contains a descendant of Y due to the pretreatment variables assumption. Thus, the set \(\textbf{C}\cup \textbf{F}\) instrumentalists \(\textbf{S}\). As in Eq.(2), the IV-based estimators, such as DeepIV [14], can be applied to remove the effect of \(\textbf{U}\) by inputting the CIV representation \(\textbf{S}\) and the representations of its conditioning set \(\textbf{C}\cup \textbf{F}\). Therefore, the quantities of \(\textrm{CACE}(W, Y)\) can be obtained by using the CIV \(\textbf{S}\) and its conditioning set \(\textbf{C}\cup \textbf{F}\) in an IV-based estimator.

Theorem 1 ensures that a family of data-driven methods can be applied for causal effect estimation from data with latent confounders.

3.2 Learning the Three Representations

Based on Theorem 1, we have known that the set \(\{\textbf{C}, \textbf{F}\}\) instrumentalists \(\textbf{S}\). In this section, we present our proposed DVAE.CIV model for obtaining the three latent representations from data by using the VAE technique [22], and the architecture of DVAE.CIV is presented in Fig. 2. As shown in Fig. 2, the DVAE.CIV model is to learn and disentangle the latent representation \(\mathbf {\Phi }\) of \(\textbf{X}\) into two disjoint sets \(\{\textbf{S, F}\}\) by using disentangled variational autoencoder [15, 38], and generate the representation \(\textbf{C}\) conditioning on \(\textbf{X}\) by jointing the Conditional Variational AutoEncoder (CVAE) network [32].

Fig. 2.
figure 2

The architecture of DVAE.CIV model. A yellow box indicates the drawing of samples from the respective distributions, a grey box indicates the parameterised deterministic neural network transitions, and a circle represents switching paths based on the value of W. (Color figure online)

The DVAE.CIV model is designed to learn three representations shown in Fig. 1(c) by utilising the inference model and generative model to approximate the posterior distribution \(p(\textbf{X}|\textbf{S},\textbf{C},\textbf{F})\). The inference model comprises three independent encoders \(q(\textbf{S}|\textbf{X})\), \(q(\textbf{C}|\textbf{X})\), and \(q(\textbf{F}|\textbf{X})\), which are treated as variational posteriors over the three latent representations. The generative model utilises the three latent representations with a decoder model \(p(\textbf{X}|\textbf{S},\textbf{C}, \textbf{F})\) to reconstruct the measured distribution \(\textbf{X}\).

Following the standard VAE model [22], the prior distributions \(p(\textbf{S})\) and \(p(\textbf{F})\) are drawn from the Gaussian distributions as:

$$\begin{aligned} \begin{aligned} p(\textbf{S}) = \prod _{i=1}^{D_{\textbf{S}}} \mathcal {N}(S_{i} | 0, 1);~ p(\textbf{F}) = \prod _{i=1}^{D_{\textbf{F}}} \mathcal {N}(F_{i} | 0, 1). \end{aligned} \end{aligned}$$
(3)

where \(D_{\textbf{S}}\) and \(D_{\textbf{F}}\) are the dimensions of \(\textbf{S}\) and \(\textbf{F}\), respectively. In the inference model, the variational approximations of the posteriors are described as:

$$\begin{aligned} \begin{aligned}&q(\textbf{S}|\textbf{X}) = \prod _{i=1}^{D_{\textbf{S}}} \mathcal {N}(\mu = \hat{\mu }_{\textbf{S}_{i}}, \sigma ^2 = \hat{\sigma }^2_{\textbf{S}_{i}});~ q(\textbf{C}|\textbf{X}) = \prod _{i=1}^{D_{\textbf{C}}} \mathcal {N}(\mu = \hat{\mu }_{\textbf{C}_i}, \sigma ^2 = \hat{\sigma }^2_{\textbf{C}_i}); \\&q(\textbf{F}|\textbf{X}) = \prod _{i=1}^{D_{\textbf{F}}} \mathcal {N}(\mu = \hat{\mu }_{\textbf{F}_i}, \sigma ^2 = \hat{\sigma }^2_{\textbf{F}_i}), \end{aligned} \end{aligned}$$
(4)

where \(D_{\textbf{C}}\) is the dimension of \(\textbf{C}\), and \(\hat{\mu }_{\textbf{S}}, \hat{\mu }_{\textbf{C}}, \hat{\mu }_{\textbf{F}}\) and \(\hat{\sigma }^2_{\textbf{S}}, \hat{\sigma }^2_{\textbf{C}}, \hat{\sigma }^2_{\textbf{F}}\) are the parameters of means and variances in the Gaussian distributions parameterised by neural networks.

In the generative model, we utilise the Monte Carlo (MC) sampling strategy to sample the distribution \(\textbf{C}\) based on the Conditional Variational AutoEncoder network (CVAE) [32] such that the latent representation of \(\textbf{C}\) is generated from the distribution \(\textbf{X}\):

$$\begin{aligned} \begin{aligned} p(\textbf{C}) \backsim p(\textbf{C} |\textbf{X}). \end{aligned} \end{aligned}$$
(5)

Furthermore, the generative models for W and \(\textbf{X}\) with the three latent representations are formalised as:

$$\begin{aligned} \begin{aligned} p(W|\textbf{S},\textbf{C}) = Bern(\sigma (\psi _1(\textbf{S}, \textbf{C})));~ p(\textbf{X}|\textbf{S}, \textbf{F}) = \prod _{i=1}^{D_{\textbf{X}}} p(X_i|\textbf{S}, \textbf{C}), \end{aligned} \end{aligned}$$
(6)

where \(\psi _1(\cdot )\) is a function parameterised by neural networks, \(\sigma (\cdot )\) is the logistic function and \(Bern(\cdot )\) is the function of Bernoulli distribution.

In our generative model, the latent representation for the outcome Y is based on the data type of Y. For the outcome Y with continuous values, we use a Gaussian distribution with its mean and variance parameterised by a pair of independent neural networks, i.e. \(p(Y | w = 0, \textbf{C}, \textbf{F})\) and \(p(Y | w = 1, \textbf{C}, \textbf{F})\). Thus, the continuous Y is modelled by:

$$\begin{aligned} \begin{aligned}&p(Y | W, \textbf{C}, \textbf{F}) = \mathcal {N}(\mu = \hat{\mu }_{Y}, \sigma ^2 = \hat{\sigma }^2_{Y}),\\&\hat{\mu }_{Y} = W \cdot \psi _2(\textbf{C}, \textbf{F}) + (1-W) \cdot \psi _3(\textbf{C},\textbf{F});\\&\hat{\sigma }^2_{Y} = W \cdot \psi _4(\textbf{C}, \textbf{F}) + (1-W) \cdot \psi _5(\textbf{C}, \textbf{F}), \end{aligned} \end{aligned}$$
(7)

where \(\psi _2(\cdot ), \psi _3(\cdot ), \psi _4(\cdot )\) and \(\psi _5(\cdot )\) are neural networks parameterised by their own parameters.

For the outcome Y with binary values, a Bernoulli distribution function based on neural networks is employed to model it and described as:

$$\begin{aligned} \begin{aligned}&p(Y|W,\textbf{C}, \textbf{F}) = Bern(\sigma (\psi _6(W, \textbf{C}, \textbf{F}))), \end{aligned} \end{aligned}$$
(8)

where \(\psi _6(\cdot )\) is the same with the function \(\psi _1\). These parameters of neural networks can be approximated by maximising the Evidence lower bound (ELBO) \(\mathcal {L}_{ELBO}\):

$$\begin{aligned} \begin{aligned} \mathcal {L}_{ELBO} (\textbf{X}, W, Y)=~&\mathbb {E}_{q}[\log p(\textbf{X}|\textbf{S}, \textbf{C}, \textbf{F})] - D_{KL}[q(\textbf{S}|\textbf{X})||p(\textbf{S})] \\ {}&- D_{KL}[q(\textbf{C}|\textbf{X})||p(\textbf{C}|\textbf{X})] - D_{KL}[q(\textbf{F}|\textbf{X})||p(\textbf{F})], \end{aligned} \end{aligned}$$
(9)

where the decoder \(p(\textbf{C}|\textbf{X})\) is to ensure that the latent representation \(\textbf{C}\) captures as much information of \(\textbf{X}\) as possible.

To ensure that the treatment W can be recovered from the latent representations \(\textbf{S}\) and \(\textbf{C}\), and the outcome Y can be recovered from the latent representations \(\textbf{C}\) and \(\textbf{F}\), two auxiliary predictors are added and the objective function of DVAE.CIV can be formalised as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{DVAE.CIV} =~&-\mathcal {L}_{ELBO} (\textbf{X}, W, Y) + \alpha \mathbb {E}_{q}[\log q(W|\textbf{S}, \textbf{C})] \\ {}&+ \beta \mathbb {E}_{q}[\log q(Y|W, \textbf{C}, \textbf{F})], \end{aligned} \end{aligned}$$
(10)

where \(\alpha \) and \(\beta \) are the weights for the auxiliary predictors.

After training the DVAE.CIV model, we get the CIV representation \(\textbf{S}\) and the conditioning set representations \(\{\textbf{C, F}\}\) based on Theorem 1. For estimating conditional causal effects, we employ an IV-based prediction, DeepIV [14], to implement this part, i.e. we feed \(\textbf{S}\) and \(\{\textbf{C, F}\}\) into the DeepIV method for conditional causal effect estimation.

4 Experiments

In this section, we evaluate the performance of the proposed DVAE.IV model by applying it to a set of synthetic datasets and three real-world datasets for CACE(WY) and average causal effect (ACE) estimation. The three real-world datasets include SchoolingReturns [7], Cattaneo [8] and RHC [11] that are usually utilised in evaluating the methods of causal effect estimation from observational data. Details of the implementation of DVAE.CIV and the appendix are provided in the GitHubFootnote 1.

4.1 Experimental Setup

We compare the DVAE.CIV against the famous estimators in conditional causal effect estimation that are widely utilised in causal inference from observational data. Note that the ACE can be obtained by averaging the CACE(WY) of all individuals. These compared causal effect estimators are introduced in the following.

Compared Causal Effect Estimators. We compare our proposed DVAE.CIV with two Variational AutoEncoder based (VAE-based) causal effect estimators, three tree-based causal effect estimators, two machine learning based (ML-based) causal effect estimators, and three IV-based causal effect estimators. The two VAE-based causal effect estimators are Causal Effect Variational AutoEncoder (CEVAE) [26] and Treatment Effect estimation by Disentangled Variational AutoEncoder (TEDVAE) [38]). The three tree-based causal effect estimators are the standard Bayesian Additive Regression Trees (BART) [17], causal random forest (CF) [35] and causal random forest for IV regression (CFIVR) [4]. Note that CFIVR also belongs to IV-based estimators. The two ML-based causal effect estimators are double machine learning (DML) [10] and doubly robust learning (DRL) [12]. The three IV-based causal effect estimators are DeepIV [14], orthogonal instrumental variable (OrthIV) [33] and double machine learning based IV (DMLIV) [10].

Remarks. The five estimators TEDVAE, BART, CF, DML and DRL rely on the assumption of unconfoundedness [19] (i.e. no latent confounders in data), so the five estimators cannot deal with the case with the data with latent confounders. CEVAE can deal with latent confounders, but it requires that all measured variables are proxy variables of the latent confounders, while our DVAE.CIV model does not have the restriction. The IV-based estimators CFIVR, DeepIV, OrthIV and DMLIV require a known IV that is nominated based on domain knowledge, but the nominated IV usually is not a valid IV and thus may result in a wrong conclusion as argued in Introduction.

Implementation Details. We use Python and the libraries including pytorch [29], pyro [5] and econml to implement DVAE.CIV. In our experiments, the dimension of latent representations is set as \(\left| S \right| =1\), \(\left| C \right| =5\) and \(\left| F \right| =5\), respectively. The implementation of CEVAE is based on the Python library pyro [5] and the code of TEDVAE is from the authors’ GitHubFootnote 2. For BART, we use the implementation in the R package bartCause [17]. For CF and CFIVR, we use the implementations in the R functions causal\(\_\)forest and instrumental\(\_\)forest in the R package grf [4], respectively. The implementations of DML, DRL, DeepIV, OrthIV and DMLIV are from the Python package encoml.

Evaluation Metrics. For performance evaluation, two commonly used metrics are employed in our experiments. For the synthetic datasets, we use absolute error of average causal effect [17], i.e. \(\varepsilon _{ACE} = |ACE-\hat{ACE}|\) where ACE is the true causal effect and \(\hat{ACE}\) is the estimated causal effect, and Precision of the Estimation of Heterogeneous Effect (PEHE, it is used to evaluate the CACE estimations.) [17, 26] \(\sqrt{\varepsilon _{PEHE}} = \sqrt{\textbf{E}(((y_1 - y_0)-(\hat{y}_1 - \hat{y}_0))^{2})}\) where \(y_1, y_0\) are the true outcomes and \(\hat{y}_1, \hat{y}_0\) are the predicted outcomes, to assess the performance of all methods in terms of the causal effect estimation. Lower values of both metrics indicate better performance. For multiple replications, we present the mean with standard deviation. For the three real-world datasets, we use the reference causal effect in the literature as the baseline to evaluate the performance of all estimators since there is no ground truth causal effect available.

4.2 Simulation Study

It is challenging to evaluate a causal effect estimation method with real-world data since there is no ground truth in the real-world data. In this section, we design simulation studies to evaluate the performance of our proposed DVAE.CIV method in the case that there exists a latent confounder U affecting both W and Y, and there exists a CIV and its conditioning set in the synthetic datasets.

We use a causal DAG \(\mathcal {G}\) provided in the appendix to generate synthetic datasets with a range of sample sizes: 2k, 6k, 10k, and 20k. In the causal DAG \(\mathcal {G}\), \(\textbf{X}=\{S, X_1, X_2, X_3, X_4, X_5\}\) is the set of measured covariates and \(\textbf{U}=\{U, U_1, U_2, U_3, U_4\}\) is the set of latent confounders in which U affects both W and Y. Note that S is a CIV conditioning on the set \(\{X_1, X_2\}\) for all synthetic datasets. Moreover, the data generation process allows the synthetic datasets to have the true individual causal effect. We provide the details of the synthetic data generating process in the appendix. In our experiments, the IV-based estimators OrthIV, DMLIV, DeepIV and CFIVR utilise the true CIV S and the conditioning set \(\{X_1, X_2\}\) as input for causal effect estimation.

Table 1. The out-of-sample absolute error \(\varepsilon _{ACE}\) (mean ± std) over 30 synthetic datasets. The best results are highlighted in boldface and the runner-up results are underlined. DVAE.CIV is the runner-up on all synthetic datasets, and it relies on the least domain knowledge among all estimators compared since it learns and disentangles the representations of CIV and its conditioning set from data directly.

To provide a reliable assessment, we repeatedly generate 30 synthetic datasets for each sample size setting and utilize the aforementioned metrics to evaluate the performance of the DVAE.CIV against the compared estimators with respect to the task of ACE estimation and CACE estimation from data with latent confounders. For each dataset, we randomly take 70% of samples for training and 30% for testing. The results of all estimators with respect to the ACE estimations and CACE estimations measured by the metrics \(\varepsilon _{ACE}\) and \(\sqrt{\varepsilon _{PEHE}}\) in the out-of-sample set are provided in Tables 1 and 2, respectively. The out-of-sample set is on testing samples, and the within-sample set is on training samples. The results of the within-sample set are provided in the appendix.

Results. By analysing the experiment results in Table 1, we have the following observations: (1) the ML-based and VAE-based estimators, DML, DRL, CEVAE and TEDVAE have the largest \(\varepsilon _{ACE}\) because the confounding bias caused by confounders and the latent confounder U is not adjusted at all. (2) the tree-based estimators, BART and CF have the second largest \(\varepsilon _{ACE}\) as they fail to deal with the confounding bias caused by the latent confounder U. (3) the IV-based estimators including DVAE.CIV significantly outperform the other estimators including DML, DRL, BART, CF, CEVAE and TEDVAE. (4) DVAE.CIV is the second best performer on all synthetic datasets and its performance is comparable with CFIVR and DeepIV. (5) as the sample size increases, the standard deviation of most estimators including DVAE.CIV decreases significantly. It’s worth mentioning that DVAE.CIV requires the least domain knowledge among all estimators since it only relies on the assumption that there exists a CIV and the conditioning set (maybe an empty set). This is very important in practice, as in many real-world applications, there is rarely sufficient prior knowledge for nominating a valid IV.

Table 2. The out-of-sample \(\sqrt{\varepsilon _{PEHE}}\) (mean ± std) over 30 synthetic datasets. The lowest \(\sqrt{\varepsilon _{PEHE}}\) are highlighted in boldface and the runner-up results are underlined. DVAE.CIV is in the runner-up results on the first two groups of synthetic datasets and achieves the third smallest \(\sqrt{\varepsilon _{PEHE}}\) on the last four groups of synthetic datasets. It’s worth mentioning that DVAE.CIV obtains the lowest standard deviation on all synthetic datasets.

From the results in Table 2, we can conclude that (1) the ML-based, tree-based, and VAE-based estimators have the worst performance with respect to conditional causal effect estimations. (2) Among the IV-based estimators, DeepIV achieves the best performance on the first two groups of synthetic datasets and the second-best performance on the other datasets, and CFIVR obtains the best performance on the last four groups of synthetic datasets and the second-best performance on the first two groups of synthetic datasets. (3) DVAE.CIV obtains the second-best performance on all synthetic datasets. (4) The standard deviation of DVAE.CIV is the smallest on all datasets, and as the sample size increases, the standard deviation of DVAE.CIV reduces significantly. These conclusions demonstrate that DVAE.CIV can learn and disentangle the representations of the CIV and its conditioning set for CACE estimation from data with latent confounders.

In conclusion, DVAE.CIV achieves competitive performance compared to state-of-the-art causal effect estimators while requiring the least prior knowledge in ACE and CACE estimations from observational data with latent confounders.

Table 3. Estimated ACEs by all methods on the three real-world datasets. We highlight the estimated causal effects within the empirical interval on SchoolingReturns and Cattaneo. We use ‘-’ to indicate that an IV-based estimator does not work on Cattaneo and RHC since there is not a nominated IV. Note that all estimators on RHC obtain a consistent result.

4.3 Experiments on Three Real-World Datasets

We selected three real-world datasets with their empirical causal effect values available and commonly used in the literature to assess the performance of DVAE.CIV in ACE estimations. We did not conduct experiments on CACE estimation on the three datasets since there were no ground truth or empirical estimates of CACEs available for these datasets. The three real-world datasets are SchoolingReturns [7], Cattaneo [8], and RHC [11]. These datasets are widely utilized in the evaluation of either IV estimators or data-driven causal effect estimators [13]. Note that SchoolingReturns has a nominated CIV, and the last two datasets do not have a nominated IV for causal effect estimation. Thus, we only compared the DVAE.CIV model with all the aforementioned estimators on SchoolingReturns and the ML-based, tree-based, and VAE-based estimators on both Cattaneo and RHC datasets.

SchoolingReturns. The dataset is from the national longitudinal survey of youth (NLSY), a well-known dataset of US young employees, aged range from 24 to 34 [7]. The dataset has 3,010 samples and 19 variables [7]. The variable of the education of employees is the treatment variable, and the variable of the raw wages in 1976 (in cents per hour) is the outcome variable. The dataset was collected to study the causal effect of education on earnings. Note that the variable of geographical proximity to a college, i.e. nearcollege is nominated to be an IV by Card [7]. The empirical estimate ACE\((W, Y) = 0.1329\) with 95% confidence interval (0.0484, 0.2175) is from [34] and used as the reference value.

Cattaneo. The dataset has the birth weights of 4,642 singleton births with 20 variables ( [8]) that were collected from Pennsylvania, USA for the study of the average of maternal smoking status during pregnancy (W) on a baby’s birth weight (Y, in grams). The dataset contains several covariates: mother’s age, mother’s marital status, an indicator for the previous infant where the newborn died, mother’s race, mother’s education, father’s education, number of prenatal care visits, months since last birth, an indicator of firstborn infant and indicator of alcohol consumption during pregnancy. The authors [8] found a strong negative effect of maternal smoking on the weights of babies, i.e., about 200g to 250g lighter for a baby with a mother smoking during pregnancy.

Right Heart Catheterization (RHC). RHC is a real-world dataset obtained from an observational study regarding a diagnostic procedure for the management of critically ill patients [11]. The RHC dataset can be downloaded from the \(\textbf{R}\) package HmiscFootnote 3. The dataset contains 2,707 samples with 72 covariates [11, 25]. RHC was for investigating the adult patients who participated in the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT). The treatment variable W is whether a patient received an RHC within 24 h of admission, and the outcome variable Y is whether a patient died at any time up to 180 d after admission. Note that the empirical conclusion is that applying RHC leads to higher mortality within 180 d than not applying RHC [11].

Results. All results on the three real-world datasets are reported in Table 3. From Table 3, we make the following observations: (1) the estimated causal effects by DVAE.CIV and CF on SchoolingReturns and Cattaneo fall within the empirical intervals, while DML, DRL, and BART provide an opposite estimate to the empirical value on SchoolingReturns; (2) as there is no nominated IV on Cattaneo and RHC, the estimators OrthIV, DMLIV, DeepIV, and CFIVR do not work on both datasets; (3) all estimators, including DVAE.CIV, obtain a consistent estimation on the RHC data, and they reach the same conclusion as the empirical conclusion [11]. These observations further confirm that DVAE.CIV is capable of removing the bias between W and Y in real-world datasets.

In conclusion, our simulation studies show the high performance of DVAE.CIV in ACE and CACE estimations from data with latent confounders, and our experiments on three real-world datasets further confirm the capability of DVAE.CIV in ACE estimation from observational data.

Limitations. The performance of DVAE.CIV relies on the assumptions made in this work and the assumptions on the VAE model. Note that the identification of the VAE model [21] is an important issue for our proposed DVAE.CIV model. When some of the assumptions or the VAE identification do not hold, DVAE.CIV may obtain an inconsistent conclusion. To obtain a consistent conclusion, it would be better to conduct a sensitivity analysis [19, 30] together with DVAE.CIV to achieve a reliable conclusion in real-world applications.

5 Conclusion

It is a crucial challenge to deal with the bias caused by latent confounders in conditional causal effect estimations from observational data. IV-based methods allow us to remove such confounding bias in an effective way, but it relies on a nominated IV/CIV based on domain knowledge. In this paper, we propose an efficient approach, DVAE.CIV for conditional causal effect estimations from observational data with latent confounders. The DVAE.CIV utilizes the advantages of deep generative models for learning the representations of a CIV and its conditioning set from data with latent confounders. We theoretically show the soundness of the DVAE.CIV model. The effectiveness and potential of the DVAE.CIV are demonstrated by extensive experiments. In simulation studies, DVAE.CIV achieves competitive performance against state-of-the-art estimators that require extra prior knowledge in ACE and CACE estimation from data with latent confounders. The experimental results on three real-world datasets show the superiority of the DVAE.CIV model on ACE estimation over the existing estimators.