1.1 Introduction

We consider a general resource-allocation problem, namely, to maximize a mean outcome given a cost constraint, through the choice of a treatment rule that is a function of an arbitrary fixed subset of an individual’s covariates. In pharmaceutical applications, we typically think of maximizing a clinical outcome given a monetary cost constraint, through the allocation of medication to patients, although our model is much more general. We focus on the setting where unmeasured confounding is a possibility, but a valid instrumental variable is available. Thus, our setup allows for consistent estimation of the optimal treatment rule and causal effects in a range of non-randomized studies, including post-market and other observational studies, as well as studies involving imperfect randomization due to non-adherence. The goal is both to: (1) find an optimal intervention d(V) for maximizing the mean counterfactual outcome, where V is an arbitrary fixed subset of baseline covariates W, and (2) estimate the mean counterfactual outcome under this rule d(V). We make no restrictions on the type of data; however, the case of a continuous or categorical instrument or treatment variable is discussed in Toth (2016). To our knowledge, this work is the first to estimate the effect of an optimal individualized treatment regime, under a non-unit cost constraint, in the instrumental variables setting.

Utilizing instrumental variables. A classic solution for obtaining a consistent estimate of a causal effect under unmeasured confounding is to use an instrumental variable, assuming one exists. Informally, an instrumental variable, or instrument, is a variable Z that affects the outcome Y only through its effect on the treatment A, and the residual (error) term of the instrument is uncorrelated with the residual term of the outcome (Imbens and Angrist 1994; Angrist et al. 1996; Angrist and Krueger 1991). Thus, the instrument produces exogenous variation in the treatment. Instrumental variables have been used widely in biostatistics and pharmaceutics. (See Brookhart et al. 2010 for a large collection of references.) In these settings, the instrumental variable is usually some attribute that is related to the health care a patient receives, but is not at the level of individual patients. For example, Brookhart and Schneeweiss (2007) exploit variation in physician preference for prescribing NSAID medications to infer the effect of these medications on gastrointestinal bleeding.

In this work, we solve two versions of the optimal individualized treatment problem: (1) when the intervention is on the treatment variable A (Sect. 1.7), and (2) when the intervention is actually on the instrument Z (Sect. 1.6). For example, consider a study in which HIV-positive patients were encouraged to undergo antiretroviral therapy (ART) with a randomized (or quasi-randomized) encouragement design, but a number of factors caused non-adherence among some patients (Chesney 2006). The methods in this chapter allow one to infer what would be the optimal assignment of patients to ART treatment, based on patient characteristics, to achieve a desirable outcome (i.e., suppressed viral load, 5-year survival), given a limited budget. One parameter of interest is the mean outcome under optimal assignment of individuals to actually receive ART. This is the problem of finding an optimal treatment regime. However, in this setting of non-adherence, it might not be possible to intervene directly on the treatment variable. Thus, another parameter of interest is the mean outcome under the optimal intervention on the instrumental variable. We call this the problem of finding an optimal intent-to-treat regime, so named because the instrument is often a randomized assignment to treatment or encouragement mechanism. Under our randomization assumption on instrument Z, the optimal intent-to-treat problem is the same as an optimal treatment problem without unmeasured confounding, as Z can be seen as a treatment variable that is unconfounded with Y.

Causal effects given arbitrary subgroups of the population.

A key feature of our work is that the optimal intervention d(V) is a function of a fixed arbitrary subset V of all baseline covariates W. There is currently great interest and computational feasibility in designing individualized treatment regimes based on a patient’s characteristics and biomarkers. The paradigm of precision medicine calls for incorporating high-dimensional spaces of genetic, environmental, and lifestyle variables into treatment decisions (Editors: National Research Council Committee 2011). Incorporating many covariates for estimating relevant components of the data-generating distribution can be helpful in: (1) improving the precision of the statistical model and (2) ensuring that the instrument induces exogenous variation given the covariates. However, a physician typically has a smaller set of patient variables that are available and that he/she considers reliable predictors. Thus, being able to calculate an optimal treatment (or intent-to-treat) regime as a function of an arbitrary subset of baseline covariates is of great use.

The targeted minimum loss-based framework.

Our estimators use targeted minimum loss-based estimation (TMLE) , which is a methodology for semiparametric estimation that has very favorable theoretical properties and can be superior to other estimators in practice (van der Laan and Rubin 2006; van der Laan and Rose 2011). TMLE guarantees asymptotic efficiency when certain components of the data-generating distribution are consistently estimated. Thus, under certain conditions, the TMLE estimator is optimal in having the asymptotically lowest variance for a consistent estimator in a general semiparametric model, thereby achieving the semiparametric Cramer–Rao lower bound (Newey 1990). The TMLE method also has a robustness guarantee: It produces consistent estimates even when the functional form is not known for all relevant components of the parameter of interest (see Sects. 1.6.3.4 and 1.7.3). Another beneficial property is asymptotic linearity. This ensures that TMLE-based estimates are close to normally distributed for moderate sample sizes, which makes for accurate coverage of confidence intervals. Finally, TMLE has the advantage over other semiparametric efficient estimators that it is a substitution estimator, meaning that the final estimate is made by evaluating the parameter of interest on the estimates of its relevant components. This property has been linked to good performance in sparse data in Gruber and van der Laan (2010).

The TMLE methodology uses the following procedure for constructing an estimator:

  1. 1.

    Let \(P_0\) denote the true data-generating distribution. One first notes that the parameter of interest \(\varPsi (P_0)\) depends on \(P_0\) only through certain relevant components \(Q_0\) of the full distribution \(P_0\); in other words, \(\varPsi (P_0)=\varPsi (Q_0)\).Footnote 1 TMLE targets these relevant components by only estimating these \(Q_0\) and certain nuisance parameters \(g_0\)Footnote 2 that are needed for updating the relevant components. An initial estimate \((Q_n^0,g_n)\) is formed of the relevant components and nuisance parameters. This is typically done using the Super Learner approach described in van der Laan et al. (2007), in which the best combination of learning algorithms is chosen from a library using cross-validation.

  2. 2.

    Then, the relevant components \(Q_n^0\) are fluctuated, possibly in an iterative process, in an optimal direction for removing bias efficiently. To do so, one defines a fluctuation function \(\varepsilon \rightarrow Q(\varepsilon |g_n)\) and a loss function \(L(\dots )\), where we fluctuate \(Q_n^0\) to \(Q_n^0(\varepsilon |g_n)\) by solving for fluctuation \(\varepsilon = \text {argmin}_\varepsilon \ \frac{1}{n} \sum _{i=1}^n L(Q^0_n(\varepsilon |g_n), g_n)(O_i)\). For example, the loss function might be the mean squared error or the negative log likelihood function.

  3. 3.

    Finally, one evaluates the statistical target parameter on the updated relevant components \(Q^*_n\) and arrives at estimate \(\psi ^*_n= \varPsi (Q^*_n)\).

The key requirement is to choose the fluctuation and loss functions so that, upon convergence of the components to their final estimate \(Q^*_n\) and \(g^*_n\), the efficient influence curve equation is solved:

$$ P_n \ D^*(Q^*_n, g^*_n) = 0 $$

Above, \(P_n\) denotes the empirical distribution \((O_1, \ldots , O_n)\), and we use the shorthand notation \(P_n f = \frac{1}{n} \sum _{i=1}^n f(O_i) \cdot D^*\) denotes the efficient influence curve.

1.2 Prior Work

Luedtke and van der Laan (2016a) is a recent work that gives a TMLE estimator for the mean outcome under optimal treatment given a cost constraint. That problem is very similar to the one we solve in Sect. 1.6, with the main difference being that we allow a more general non-unit cost constraint which results in a different closed-form solution to the optimal rule. Luedtke and van der Laan (2016b) tackles the issue of possible non-unique solutions and resulting violations of pathwise differentiability. The conditions we require in assumptions (A2)–(A4) are adopted from these works.

A large body of work focuses on the case of optimal treatment regimes in the unconstrained case, such as Robins (2004). More recently, various approaches tackle the constrained ODT problem: Zhang et al. (2012) describe a solution that assumes the optimal treatment regime is indexed by a finite-dimensional parameter, while Chakraborty et al. (2013) describe a bootstrapping method for learning ODT regimes with confidence intervals that shrink at a slower than root-n rate. Chakraborty and Moodie (2013) give a review of recent work on the constrained case.

1.3 Model and Problem

We consider the problem of estimation and inference under an optimal intervention, in the context of an instrumental variable model. We take an iid sample of n data points \((W,Z,A,Y) \sim \mathscr {M}\), where \(\mathscr {M}\) is a semiparametric model. Z is assumed to be a valid instrument for identifying the effect of treatment A on outcome Y, when one has to account for unmeasured confounding. In applications, instrument Z is often a randomized encouragement mechanism or randomized assignment to treatment which may or may not be followed. In other cases, Z is not perfectly randomized but nevertheless promotes or discourages individuals in receiving treatment. \(V \subseteq W \) is an arbitrary fixed subset of the baseline covariates, and \(F_V(W)\) gives the mapping \(W \rightarrow V \cdot d(V)\) refers to a decision rule as a function of V, where \(Z=d(V)\) is used to denote the optimal intervention on the instrument Z, in other words, the optimal assignment to treatment or the optimal intent-to-treat. \(A=d(V)\) refers to the optimal treatment rule. We are interested in estimating the mean counterfactual outcome under an optimal rule \(Z=d(V)\) or \(A=d(V)\). Figure 1.1 shows a diagram.

Fig. 1.1
figure 1

Causal diagram

There are no restrictions on the type of data. However, the case of categorical or continuous Z or A are both dealt with separately in Toth (2016).

Further, we let \(c_A(A,W)\) be a cost function that gives the cost associated with assigning an individual with covariates W to a particular A value. We let \(c_T(Z,W)\) be a cost function that gives the total cost associated with assigning an individual with covariates W to a particular Z value. We can think of \(c_T(Z,W)\) as the sum of \(c_Z(Z,W)\), a cost incurred directly from setting Z, and \(E_{A|W,Z}c_A(A,W)\), an average cost incurred from the actual treatment A.Footnote 3 We need to find optimal rule \(Z=d(V)\) under cost constraint \(E{\ c_T}(Z,W) \le K\), for a fixed cost K, and optimal rule \(Z=d(V)\) under constraint \(E\ c_A(A,W) \le K\).

Notation. Let \(P_W \equiv \Pr (W)\) and \(\rho (Z,W)\equiv \Pr (Z=1|W)\). Also let \(\varPi (Z,W)\equiv E(A\mid Z,W)\) be the conditional mean of A given ZW, and \({\mu }(Z,W)\equiv E(Y\mid Z,W)\).

We also define \(\mu _b(V) \triangleq E_{W|V}\big [ \mu (Z=1,W)-\mu (Z=0,W) \big ]\), which gives the mean difference in outcome between setting \(Z=1\) and \(Z=0\) given V. Similarly, \(c_{b,Z}(V) \triangleq E_{W|V}\big [ c_T(Z=1,W)-c_T(Z=0,W) \big ]\), and \(c_{b,A}(V) \triangleq E_{W|V}\big [ c_A(A=1,W)-c_A(A=0,W) \big ]\). We also use notation \(m(V) \triangleq E_{W|V}m(W)\), where m is the causal effect function defined in the causal assumptions.

We further assume wlog that intent-to-treat \(Z=0\) has lower cost for all V: \(E_{W|V}c_T(0,W) \le E_{W|V}c_T(1,W)\).Footnote 4 Let \(\underline{K_Z} \triangleq E_W c_T(0,W)\) be the total cost of not assigning any individuals to intent-to-treat, and \(\overline{K_Z} \triangleq E_W c_T(1,W)\) be the total cost of assigning everyone, and we assume a non-trivial constraint \(\underline{K_Z}<K < \overline{K_Z}\). Define \(\underline{K_A} \triangleq E_W c_A(0,W)\), and \(\overline{K_A}\) similarly.

Causal model.

Using the structural equation framework of (Pearl 2000), we assume that each variable is a function of other variables that affect it and a random term (also called error term). Let U denote the error terms. Thus, we have

\(W = f_W(U_W), Z = f_Z(W,U_Z), A = f_A(W,Z,U_A), Y = f_Y(W,Z,A, U_Y)\)

where \(U = (U_W,U_Z,U_A,U_Y) \sim \) \(P_{U,0}\) is an exogenous random variable, and \(f_W\), \(f_Z\), \(f_A\), \(f_Y\) may be unspecified or partially specified (for instance, we might know that the instrument is randomized). \(U_Y\) is possibly confounded with \(U_A\).

We use notation that a subscript of 0 denotes the true distribution, in expressions such as \(E_0, \ P_0\).

Assumption (A1) Assumptions ensuring that Z is a valid instrument:

  1. 1.

    Exclusion restriction. Z only affects outcome Y through its effect on treatment A. Thus, \(f_Y(W,Z,A,U_Y)=f_Y(W,A,U_Y)\).

  2. 2.

    Exogeneity of the instrument. \(E(U_Y|W,Z)=0\) for any WZ.

  3. 3.

    Z induces variation in A. \(\text {Var}_0[E_0(A|Z,W)|W] > 0\) for all W.

    Structural equation for outcome Y:

  4. 4.

    \(Y = Am(W)+\theta (W)+U_Y\) for continuous Y, and

    \(\Pr (Y=1|W,A,\tilde{U}_Y) = Am(W)+\theta (W)+\tilde{U}_Y\) for binary Y,

    where \(U_Y=(\tilde{U}_Y,U'_Y)\) for an exogenous r.v. \(U'_Y\),Footnote 5 and m, \(\theta \) are unspecified functions.

Assumptions 2 and 4 yield that, whether Y is binary or continuous,

$$ E(Y|W,Z)=m_0(W)\varPi _0(W,Z)+\theta _0(W) $$

We use \(Y(A=a)\) to denote the counterfactual from setting treatment to \(A=a\). These assumptions guarantee that \(E(Y(A=a))\) equals \(E_W m(W)a+\theta (W)\) for identifiable functions \(m, \theta \).

It should be noted that we do not require the instrument to be randomized with respect to treatment (\(U_Z \perp \! \! \! \perp U_A | \ W\) is not necessary).

It is simple to see from the above instrumental variable assumptions that Z is randomized with respect to Y, so we have:

Corollary 1

(Randomization of Z.) \(U_Z \perp U_Y | W \).

This implies \(E(Y(Z)|W) = E(Y|W,Z)\).

Statistical model. The above-stated causal model implies the statistical model \(\mathscr {M}\) consisting of all distributions P of \(O=(W,Z,A,Y)\) satisfying \(E_P (Y|W,Z) = m_P(W)\cdot \varPi _P(W,Z)+\theta _P(W)\). Here, \(m_P\) and \(\theta _P\) are unspecified functions and \(\varPi _P(W,Z)=E_P(A|W,Z)\) such that \(Var_P(\varPi _P(Z,W)|W)>0\) for all W. Note that the regression equation \(E_P (Y|W,Z) = m_P(W)\cdot \varPi _P(W,Z)+\theta _P(W)\) is always satisfied for some choice of m(W), \(\theta (W)\) when Z is binary. The distribution for the instrument \(\rho (W)\) may or may not be known, and we generally think of all other components \(P_W, \varPi , m, \theta \) as unspecified.

1.3.1 Parameter of Interest, with Optimal Intent-to-Treat

Causal parameter of interest.

$$ \varPsi _Z(P_0) \triangleq \text {Max}_{d} \ E_{P_0} Y(Z=d(V)) \ \text {s.t.} \ E_{P_0} [c_T(Z=d(V),W)] \le K $$

Statistical target parameter.

$$\begin{aligned} \varPsi _{Z,0} = E_{P_{0}}\mu _0(Z=d_0(V),W) \end{aligned}$$
(1.1)

where \(d_0\) is the optimal intent-to-treat rule:

\(d_0 = \text {argmax}_{d} \ E_{P_0} \mu _0(Z=d(V),W) \text { s.t.} \ E_{P_0} [c_T(Z=d(V),W)] \le K \)

We also use the notation \(\varPsi _Z(P_0)=\varPsi _Z(P_{W,0},\mu _0)\).

1.3.2 Parameter of Interest, with Optimal Treatment

Causal parameter of interest.

$$\begin{aligned} \varPsi _A(P_0) \triangleq \text {Max}_{d} \ E_{0} Y(A=d(V)) \text { s.t.} \ E_{0} [c_A(A=d(V),W)] \le K \end{aligned}$$
(1.2)

Identifiability. m(W) is identified as \(\big [ (\mu (Z=1,W)-\mu (Z=0,W))/(\varPi (Z=1,W)-\varPi (Z=0,W)) \big ]\). \(\theta (W)\) is identified as \(\big [ \mu (Z,W)-\varPi (Z,W)\cdot m(W) \big ]\).

Statistical target parameter.

Lemma 1

The causal parameter given in Eq. (1.2) is identified by the statistical target parameter:

$$\begin{aligned} \varPsi _{A,0} = E_{P_{W,0}} \big [ m_0(W)d_0(V)+\theta _0(W) \big ] \end{aligned}$$
(1.3)

Note that optimal decision rule\(d_0\) is a function of \(m_0, P_{W,0}\). For \(\varPsi _{A,0}\) we also use the notation \(\varPsi _A(P_{W,0},m_0, \theta _0)\), or alternately \(\varPsi _A(P_{W,0}, \varPi _0, \mu _0)\), using the above identifiability results.

This lemma follows from our causal assumptions:

\(\varPsi _A(P_0)=EY(A=d_0(V)) = E_{W}E_{U_Y|W}EY(A=d_0(V)|W,U_Y)\)

The right hand side becomes \(E_W E_{U_W|Y}(m(W)d_0(V)+\theta (W)+U_Y)\) for a continuous Y, and \(E_W E_{U_W|Y}(m(W)d_0(V)+\theta (W)+\tilde{U}_Y)\) for a binary Y.

1.4 Closed-Form Solution for Optimal Rule \(d_0\) in the Case of Binary Treatment

The problem of finding the optimal deterministic treatment rule d(V) is NP-hard (Karp 1972). However, when allowing possible non-deterministic treatments, there is a simple closed-form solution for the optimal treatment or the optimal intent-to-treat. The optimal rule is to treat all strata with the highest marginal gain per marginal cost, so that the total cost of the policy equals the cost constraint.

This section introduces key quantities and notation used in the rest of the chapter. We present the solution in detail for the case of intervening on the instrument, when \(Z=d_0(V)\). Recall that wlog we think of \(Z=0\) as the ‘baseline’ intent-to-treat (ITT) value having lower cost. We define a scoring function \(T(V)= \frac{\mu _b(V)}{c_b(V)}\) for ordering subgroups (given by V) based on the effect of setting \(Z=1\) per unit cost. In the optimal intent-to-treat policy, all groups with the highest T(V) values deterministically have Z set to 1, up to cost K and assuming \(\mu _b \ge 0\). We write \(T_P(V)\) to make explicit the dependence on \(P_W,\mu (Z,W)\) from distribution P.

Define a function \(S_P\): \([-\infty ,+\infty ]\rightarrow \mathbb {R}\) as

$$ S_P(x)= E_V [I(T_P(V)\ge x)(c_b(V)] $$

In other words, \(S_P(x)\) gives the expected (additional above baseline) cost of setting \(Z=1\) for all subgroups having \(T_P(V)\ge x\). We use \(S_0(\cdot )\) to denote \(S_{P_0}\) from here on.

Define cutoff \(\eta _P\) as

$$ \eta _P=S^{-1}_P(K-\underline{K_{A,P}}) $$

The assumptions below in Sect. 1.5 guarantee that \(S^{-1}_P(K-\underline{K_{A,P}})\) exists and \(\eta _P\) is well defined. \(\eta \) is set so that there is a total cost K of treating with \(Z=1\) everyone having \(T(V)\ge \eta \). Further let:

$$ \tau _P = \text {max}\{\eta _P,0\} $$

Thus, \(\tau \) gives the cutoff for the scoring function T(V), so the optimal rule is

$$ d_P(V) = 1 \text { iff } T_P(V) \ge \tau _P $$

Lemma 2

Assume (A2)–(A4). Then, the optimal decision rule\(d_0\) for parameter \(\varPsi _{Z,0}\) as defined in Eq. 1.1 is the deterministic solution \(d_0(V) = 1 \text { iff } T_0(V) \ge \tau _0 \), with \(T_0, \ \tau _0\) as defined above.

The proof is given in Toth (2016). That work also describes modifications to the optimal solution for \(d_0\) when Z is continuous or categorical.

1.4.1 Closed-Form Solution for Optimal Treatment Rule \(A=d_0(V)\)

The solution given above goes through for the case of intervening on the treatment, with the two main modifications that: (1) replace intervention variable Z with A, and (2) replace \(\mu _b(W)\) with m(W). These latter quantities represent the effect on Y of applying the intervention versus the baseline treatment (at Z or A, respectively).

1.5 Assumptions for Pathwise Differentiability of \(\varPsi _{Z,0}\) and \(\varPsi _{A,0}\)

We use notation \(d_0=d_{P_0}\), \(\tau _0=\tau _{P_0}\), etc. We state these assumptions for \(\varPsi _{Z,0}\). The exact same assumptions apply for \(\varPsi _{A,0}\), replacing Z with A in a few places.

These three assumptions are needed to ensure pathwise differentiability and prove the form of the canonical gradient (Theorem 1).

Assumptions (A2)–(A4).

(A2) Positivity assumption: \(0<\rho _0(W) <1\).

(A3) There is a neighborhood of \(\eta _0\) where \(S_0(x)\) is Lipschitz continuous, and a neighborhood of \(S_0(\eta _0)=K-\underline{K_Z}_0\) where \(S^{-1}_0(y)\) is Lipschitz continuous.

(A4) \(Pr_0(T_0(V)=\tau )=0\) for all \(\tau \) in a neighborhood of \(\tau _0\).

Note that (A3) implies that \(S^{-1}_0(K-\underline{K_Z}_0)\) exists. Note also that (A3) actually implies \(Pr_0(T_0(V)=\eta )=0\) for \(\eta \) in a neighborhood of \(\eta _0\), and thus, (A3) implies (A4) when \(\eta _0>0\) and \(\tau _0=\eta _0\).

Need for (A4) (Guarantee of non-exceptional law).

If (A4) does not hold and there is positive probability of individuals being at the threshold for being treated or not under the optimal rule, then the solution d(V) is not unique, and \(\varPsi _{Z,0}\) is no longer pathwise differentiable. It is easy to see that under (A4), the optimal d(V) over the broader set of non-deterministic decision rules is a deterministic rule. Toth (2016) describes why (A4) is a reasonable assumption in practice when we have a constraint \(\underline{K_Z}<K < \overline{K_Z}\) that allows for only a strict subset of the population to be treated.

1.6 TMLE for Optimal Intent-to-Treat Problem (\(\varPsi _{Z,0}\))

All proofs and derivations for what follows are given in Toth (2016).

1.6.1 Canonical Gradient for \(\varPsi _{Z,0}\)

For \(O = (W,Z,A,Y)\), and deterministic rule d(V), define

$$\begin{aligned} D_1({d},P)(O) \triangleq \frac{I(Z={d}(V))}{\rho _P(W)}(Y-\mu _P(Z,W)) \end{aligned}$$
(1.4)
$$\begin{aligned} D_2({d},P)(O) \triangleq \mu _P({d}(V),W)-E_P\mu _P({d}(V),W) \end{aligned}$$
(1.5)
$$\begin{aligned} D_3(d, \tau , P)(O) = -\tau (c_T(d(V),W)-K) \end{aligned}$$
(1.6)

Define

$$ D^*(d,\tau , P)(O)\triangleq D_1(d,P)(O)+D_2(d,P)(O) +D_3(d, \tau , P)(O) $$

Theorem 1

Assume (A1)–(A4) above. Then \(\varPsi _Z\) is pathwise differentiable at \(P_0\) with canonical gradient \(D_0 = D^*(d_0,\tau _0, P_0)\).

1.6.2 TMLE

The relevant components for estimating \(\varPsi _Z= E_W\mu (Z=d(V),W)\) are \(Q=(P_W,\mu (Z,W))\). Decision rule d is also part of \(\varPsi \), but it is a function of \(P_W, \ \mu (Z,W)\). The nuisance parameter is \(g=\rho (W)\). First convert Y to the unit interval via a linear transformation \(Y \rightarrow \tilde{Y}\), so that \(\tilde{Y}=0\) corresponds to \(Y_{\min }\) and \(\tilde{Y}=1\) to \(Y_{\max }\). We assume \(Y \in [0,1]\) from here.

  1. 1.

    Use the empirical distribution \(P_{W,n}\) to estimate \(P_W\). Make initial estimates of \(\mu _n(Z,W)\) and \(g_n = \rho _n(W)\) using any strategy desired. Data-adaptive learning using Super Learner is recommended.

  2. 2.

    The empirical estimate \(P_{W,n}\) gives an estimate of \(Pr_{V,n}(V)=E_{W,n}I(F_V(W)=V)\), \(\underline{K_{Z,n}}=E_{W,n}c_T(0,W)\), \(\overline{K_{Z,n}}=E_{W,n}c_T(1,W)\), and \(c_{b,Z,n}(V)=E_{W,n|V}(c_T(1,W)-c_T(0,W))\).

  3. 3.

    Estimate \(\mu _{b,0}\) as \(\mu _{b,n}(V)=E_{W,n|V}(\mu _n(1,W)-\mu _n(0,W))\).

  4. 4.

    Estimate \(T_0(V)\) as \(T_n(V)=\frac{\mu _{b,n}(V)}{c_{b,Z,n}(V)}\).

  5. 5.

    Estimate \(S_0(x)\) using \(S_n(x) = E_{V,n} [I(T_{n}(V)\ge x)(c_{b,Z,n}(V)]\).

  6. 6.

    Estimate \(\eta _0\) as \(\eta _n\) using \(\eta _n= S_n^{-1}(K-\underline{K_{Z,n}})\) and \(\tau _n = \max \{0,\eta _n\}\).

  7. 7.

    Estimate the decision rule as \(d_n(V) = 1 \text { iff } T_n(V) \ge \tau _n\).

  8. 8.

    Now fluctuate the initial estimate of \(\mu _n(Z,W)\) as follows: For \(Z \in [0,1]\), define covariate \(H(Z,W)\triangleq \frac{I(d_n(V)=Z)}{g_n(W)}\). Run a logistic regression using:

    \(\begin{array}{rl} \text {Outcome: } &{} (Y_i: i =1,\dots ,n)\\ \text {Offset: } &{} (\text {logit} \ \mu _n(Z_i,W_i),i=1,\ldots ,n) \\ \text {Covariate: } &{} (H(Z_i,W_i):i=1,\ldots ,n) \\ \end{array}\)

    Let \(\varepsilon _n\) represent the level of fluctuation, with

    \(\varepsilon _n = \text {argmax}_{\varepsilon } \frac{1}{n}\sum _{i=1}^n[\mu _n(\varepsilon )(Z_i,W_i)\log Y_i + (1-\mu _n(\varepsilon )(Z_i,W_i))\log (1-Y_i)]\) and \(\mu _n(\varepsilon )(Z,W)= \text {logit}^{-1}(\text {logit} \ \mu _n(Z,W)+\varepsilon H(Z,W))\).

  9. 9.

    Set the final estimate of \(\mu (Z,W)\) to \(\mu ^*_n(Z,W)=\mu _n(\varepsilon _n)(Z,W)\).

  10. 10.

    Finally, form final estimate of \(\varPsi _{Z,0}=\varPsi _{Z,d_0}(P_0)\) using the plug-in estimator

    $$ \varPsi ^*_Z = \varPsi _{Z,d_n}(P_n^*) = \frac{1}{n}\sum _{i=1}^{n}\mu _n^*(Z=d_n(V_i),W_i) $$

    We have used the notation \(\varPsi _{Z,d}(P)\) referring to mean outcome under decision rule \(Z=d(V)\), and \(\varPsi _n^*\) the final estimate of the data-generating distribution.

It is easy to see that \(P_n D^*(d_n, \tau _n, P_n^*)=0\): We have \(P_n D_1(d_n,P_n^*) =\) \(P_n \frac{d}{d\varepsilon } L(Q_n(\varepsilon |g_n),g_n,(O_1, \ldots , O_n))|_{\varepsilon =0}=0\); \(P_n D_2(d_n, P_n^*)=0\) when we are using the empirical distribution \(P_{W,n}\); and \(P_n D_3(d_n, \tau _n,P_n^*) = 0\) is described in the proof of optimality of the closed-form solution in Toth (2016).

1.6.3 Theoretical Results for \(\varPsi _{Z}^*\)

1.6.3.1 Conditions for Efficiency of \(\varPsi _{Z}^*\)

These six conditions are needed to prove asymptotic efficiency (Theorem 2). As discussed in Toth (2016), when all relevant components and nuisance parameters (\(P_{W,n}, \ \rho _n, \ \mu _n\)) are consistent, then (C3) and (C4) hold, while (C6) holds by construction of the TMLE estimator.

(C1) \(\rho _0(W)\) satisfies the strong positivity assumption: \(Pr_0(\delta< \rho _0(W)< 1-\delta )=1\) for some \(\delta >0\).

(C2) The estimate \(\rho _n(W)\) satisfies the strong positivity assumption, for a fixed \(\delta >0\) with probability approaching 1, so we have \(Pr_0(\delta< \rho _n(W)< 1-\delta ) \rightarrow 1\).

Define second-order terms as follows:

$$ R_{1}(d,P)\triangleq E_{P_0}\Big [ \Big ( 1-\frac{Pr_{P_0}(Z=d|W)}{Pr_{P}(Z=d|W)} \Big ) \big ( \mu _P(Z=d,W)-\mu _0(Z=d,W) \big ) \Big ] $$
$$ R_2(d,\tau _0,P) \triangleq E_{P_0} \Big [(d-d_0)(\mu _{b,0}(V) - \tau _0 c_{b,0}(V) ) \Big ] $$

Let \(R_0(d,\tau _0,P) = R_1(d,P)+R_2(d,\tau _0,P)\).

(C3) \(R_0(d_n,\tau _0,P_n^*)=o_{P_0}(n^{-\frac{1}{2}})\).

(C4) \({P_0}[(D^*(d_n,\tau _0, P_n^*)-D_0)^2] = o_{P_0}(1)\).

C5) \(D^*(d_n,\tau _0,P_n^*)\) belongs to a \(P_0\)-Donsker class with probability approaching 1.

(C6) \(\frac{1}{n} \sum _{i=1}^n D^*(d_n,\tau _0,P_n^*)(O_i)=o_{P_0}(n^{-\frac{1}{2}})\).

1.6.3.2 Sufficient Conditions for Lemma 3

(E1) GC-like property for \(c_{b,Z}(V)\), \(\mu _{b,n}(V)\):

\(\sup _V|(E_{W,n|V}-E_{W,0|V})c_{b,T}(W)| = \sup _V(|c_{b,Z,n}(V)-c_{b,Z,0}(V)|)=o_{P_0}(1)\)

(E2) \(\sup _V|E_{W,0|V}\mu _{b,n}(W)-E_{W,0|V}\mu _{b,0}(W)|=o_{P_0}(1)\)

(E3) \(S_n(x)\), defined as \(x\rightarrow E_{V,n} [I(T_n(V)\ge x)c_{b,Z,n}(V)]\) is a GC-class.

(E4) Convergence of \(\rho _n\), \(\mu _n\) to \(\rho _0\), \(\mu _0\), respectively, in \(L^2(P_0)\) norm at a \(O(n^{-1/2})\) rate in each case.

When all relevant components and nuisance parameters are consistent, as is the case when Theorem 2 below holds and our estimator is efficient, we also expect conditions (E1)–(E4) to hold.

Toth (2016) discusses the assumptions and conditions above in detail.

1.6.3.3 Efficiency and Inference

Theorem 2

(\(\varPsi ^*_Z\) is asymptotically linear and efficient.) Assume assumptions (A1)–(A4) and conditions (C1)–(C6). Then, \(\varPsi ^*_Z = \varPsi _Z(P_n^*)=\varPsi _{Z,d_n}(P_n^*)\) as defined by the TMLE procedure is a RAL estimator of \(\varPsi _Z(P_0)\) with influence curve \(D_0\), so

$$ \varPsi _Z(P_n^*) -\varPsi _Z(P_0) = \frac{1}{n}\sum _{i=1}^n D_0(O_i)+ o_{P_0}(n^{-\frac{1}{2}}). $$

Further, \(\varPsi ^*_Z\) is efficient among all RAL estimators of \(\varPsi _Z(P_0)\).

Inference. Let \(\sigma _0^2= Var_{O \sim P_0}D_0(O)\). By Theorem 2 and the central limit theorem, \(\sqrt{n}(\varPsi _Z(P_n^*)-\varPsi _Z(P_0))\) converges in distribution to a \(N(0, \sigma _0^2)\) distribution. Let \(\sigma _n^2 = \frac{1}{n}\sum _{i=1}^n D^*(d_n, \tau _n, P^*_n)(O_i)^2\) be an estimate of \(\sigma _0^2\).

Lemma 3

Under the assumptions (C1) and (C2), and conditions (E1)–(E4), we have \(\sigma _n \rightarrow _{P_0} \sigma _0\). Thus, an asymptotically valid 2-sided \(1-\alpha \) confidence interval is given by

$$ \varPsi ^*_Z \pm z_{1-\frac{\alpha }{2}}\frac{\sigma _n}{\sqrt{n}} $$

where \(z_{1-\frac{\alpha }{2}}\) denotes the \((1-\frac{\alpha }{2})\)-quantile of a N(0, 1) r.v.

1.6.3.4 Double Robustness of \(\varPsi _{Z,n}^*\)

Theorem 2 demonstrates consistency and efficiency when all relevant components and nuisance parameters are consistently estimated. Another important issue is under what cases of partial misspecification we still get a consistent estimate of \(\varPsi _{Z,0}\), albeit an inefficient one. Our TMLE-based estimate \(\varPsi ^*_Z\) is a consistent estimate of \(\varPsi _{Z,0}\) under misspecification of \(\rho _n(W)\) in the initial estimates, but not under misspecification of \(\mu _n(W,Z)\). However, it turns out there is still an important double robustness property. If we consider \(\varPsi ^*_Z=\varPsi _{Z,d_n}(P_n^*)\) as an estimate of \(\varPsi _{Z,d_n}(P_0)\), where the optimal decision rule\(d_n(V)\) is estimated from the data, then we have that \(\varPsi ^*_Z\) is double robust to misspecification of \(\rho _n\) or \(\mu _n\) in the initial estimates.

Lemma 4

(\(\varPsi ^*_Z\) is a double robust estimator of \(\varPsi _{Z,d_n}(P_0)\).) Assume assumptions (A1)–(A4) and conditions (C1)–(C2). Also assume the following version of (C4): \(Var_{O \sim P_0}(D_1(d_n, P_n^*)(O)+D_2(d_n,P_n^*)(O)) < \infty \).

Then, \(\varPsi ^*_Z=\varPsi _{Z,d_n}(P_n^*)\) is a consistent estimator of \(\varPsi _{Z,d_n}(P_0)\) when either \(\mu _n\) is specified correctly, or \(\rho _n\) is specified correctly.

The proof of this lemma is based on the equation

\(\varPsi _{Z,d_n}(P^*_n)-\varPsi _{Z,d_n}(P_0)=-P_0 \big [D_1(d_n,P_n^*)+D_2(d_n,P_n^*) \big ]+R_1(d_n,P^*_n)\)

where \(D_1, \ D_2\), and \(R_1\) are as defined in Sects. 1.6.1 and 1.6.3.1.

1.7 TMLE for Optimal Treatment Problem (\(\varPsi _{A,0}\))

We now present results for the case of intervening on the treatment, setting \(A=d(V)\).

1.7.1 Efficient Influence Curve \(D^*_A(\varPsi _0)\)

Lemma 5

Let

$$ J_0(Z,W) = \frac{I(Z=1)}{\rho _0(W)}+ \frac{\big ( \frac{I(Z=1)}{\rho _0(W)} -\frac{I(Z=0)}{1-\rho _0(W)} \big ){\big (d_0(V)-\varPi _0(W,Z=1)\big )} }{\varPi _0(W,Z=1)-\varPi _0(W,Z=0)} $$

The efficient influence curve\(D^*_A(\varPsi _0)\) is

$$\begin{aligned}&D^*_A(\varPsi _0) = -\tau _0 E_{P_0}[c_T(d_0(V),W)-K] \end{aligned}$$
(1.7)
$$\begin{aligned}&+ m_0(W)d_0(V)+\theta _0(W)-\varPsi _0 \ \end{aligned}$$
(1.8)
$$\begin{aligned}&- J_0(Z,W) m_0(W)\big [ A-\varPi _0(W,Z) \big ] \end{aligned}$$
(1.9)
$$\begin{aligned}&+ J_0(Z,W) \big [ Y-(m_0(W)\varPi _0(W,Z)-\theta _0(W)) \big ] \end{aligned}$$
(1.10)

We also write \(D^*(d_0, \tau _0, P_0)\). For convenience, denote lines (1)–(4) of \(D^*\) above as \(D^*_c, \ D^*_W, \ D^*_\varPi \), and \(D^*_\mu \), respectively. Finally, let \(D^*_{A,d_n}\) denote the efficient influence curve for \(\varPsi _{A,d_n}(P_0) \triangleq E_{P_{W,0}} m_0(W)d_n(V)+\theta _0(W)\), which is the mean counterfactual estimate when the decision rule is estimated from the data. We have \(D^*_{ A,d_n} = D^*_W+D^*_\varPi +D^*_\mu \) (see Toth 2016).

1.7.2 Iterative TMLE Estimator

We have derived two different TMLE-based estimators for \(\varPsi _{A,0}\). We present an iterative estimator here, which involves a standard, numerically well-behaved, and easily understood likelihood maximization operation at each step. The other estimator uses a logistic fluctuation in a single non-iterative step and has the advantage that the estimate \(\mu \) respects the bounds of Y found in the data (see Toth 2016; Toth and van der Laan 2016).

The relevant components for estimating \(\varPsi _A= E_W[m(W)d(V)+\theta (W)]\) are \(Q=(P_W,m, \theta )\). The nuisance parameters are \(g=(\rho ,\varPi )\). d(V) and \(\tau \) can be thought of as functions of \(P_W, m\) here. Let

\(h_1(W) \triangleq \frac{1}{\rho (W)(\varPi (W,1)-\varPi (W,0))}+\frac{d(V)-\varPi (W,1)}{(\varPi (W,1)-\varPi (W,0))^2}\frac{1}{\rho (W)(1-\rho (W))}\). Also, let \(h_2(W) \triangleq \frac{1}{\rho }\big [ 1-\frac{\varPi (W,1)}{\varPi (W,1)-\varPi (W,0)} + \frac{d-\varPi (W,1)}{\varPi (W,1)-\varPi (W,0)}(1-\frac{\varPi (W,1)}{\varPi (W,1)-\varPi (W,0)}\frac{1}{1-\rho }) \big ]\).

Then, we have that \(D^*_\mu = (h_1\varPi +h_2)(Y-m\varPi -\theta )\).

If A is not binary, convert A to the unit interval via a linear transformation \(A \rightarrow \tilde{A}\) so that \(\tilde{A}=0\) corresponds to \(A_{\min }\) and \(\tilde{A}=1\) to \(A_{\max }\). We assume \(A \in [0,1]\) from here.

  1. 1.

    Use the empirical distribution \(P_{W,n}\) to estimate \(P_W\). Make initial estimates of \(Q=\{m_n(W), \ \theta _n(W)\}\) and \(g_n = \{\rho _n(W), \varPi _n(W,Z)\}\) using any strategy desired. Data-adaptive learning using Super Learner is recommended.

  2. 2.

    The empirical estimate \(P_{W,n}\) gives an estimate of \(Pr_{V,n}(V)=E_{W,n}I(F_V(W)=V)\), \(\underline{K_{A,n}}=E_{W,n}c_A(0,W)\), \(\overline{K_{A,n}}=E_{W,n}c_A(1,W)\), and \(c_{b,A,n}(V)=E_{W,n|V}(c_A(1,W)-c_A(0,W))\).

  3. 3.

    Estimate \(m_n(V)\) as \(E_{W,n|V}m(W)\).

  4. 4.

    Estimate \(T_0(V)\) as \(T_n(V)=\frac{m_{n}(V)}{c_{b,A,n}(V)}\).

  5. 5.

    Estimate \(S_0(x)\) using \(S_n(x) = E_{V,n} [I(T_{n}(V)\ge x)(c_{b,A,n}(V))]\).

  6. 6.

    Estimate \(\eta _0\) as using \(\eta _n = S_n^{-1}(K-\underline{K_{A,n}}) \) and \(\tau _n = \max \{0,\eta _n\}\).

  7. 7.

    Estimate the decision rule as \(d_n(V) = 1 \text { iff } T_n(V) \ge \tau _n\) (the decision rule is not updated iteratively).

    ITERATE STEPS (8)–(9) UNTIL CONVERGENCE:

  8. 8.

    Fluctuate the initial estimate of \(m_n(W), \ \theta _n(W)\) as follows: Using \(\mu _n(Z,W)=m_n(W)\varPi _n(Z,W)+\theta _n(W)\), run an OLS regression: \(\begin{array}{rl} \text {Outcome: } &{} (Y_i: i =1,\ldots ,n)\\ \text {Offset: } &{} (\mu _n(Z_i,W_i),i=1,\ldots ,n) \\ \text {Covariate: } &{} (h_1(W_i)\varPi _n(Z_i,W_i)+h_2(W_i):i=1,\ldots ,n) \\ \end{array} \)

    Let \(\varepsilon _n\) represent the level of fluctuation, with

    \(\varepsilon _n = \text {argmax}_{\varepsilon } \frac{1}{n}\sum _{i=1}^n(Y_i-\mu _n(\varepsilon )(Z_i,W_i))^2\) and

    \(\mu _n(\varepsilon )(Z,W)= \mu _n(Z,W)+\varepsilon (h_1(W)\varPi _n(Z,W)+h_2(W))\).

    Note that \(\mu _n(\varepsilon ) = (m_n+\varepsilon h_1 )\varPi _n+(\theta _n+\varepsilon h_2)\) stays in the semiparametric regression model.

    Update \(m_n\) to \(m_n(\varepsilon )=m_n+\varepsilon h_1\), \(\theta _n\) to \(\theta _n(\varepsilon )=\theta _n+\varepsilon h_2\).

  9. 9.

    Now fluctuate the initial estimate of \(\varPi _n(Z,W)\) as follows: Use covariate J(ZW) as defined in Lemma 5. Run a logistic regression using:

    \(\begin{array}{rl} \text {Outcome: } &{} (A_i: i =1,\dots ,n)\\ \text {Offset: } &{} (\text {logit} \ \varPi _n(Z_i,W_i),i=1,\dots ,n) \\ \text {Covariate: } &{} (J(Z_i,W_i)m(W_i):i=1,\dots ,n) \\ \end{array} \)

    Let \(\varepsilon _n\) represent the level of fluctuation, with

    \(\varepsilon _n = \text {argmax}_{\varepsilon } \frac{1}{n}\sum _{i=1}^n[\varPi _n(\varepsilon )(Z_i,W_i)\log A_i + (1-\varPi _n(\varepsilon )(Z_i,W_i)) \log (1-A_i)]\) and \(\varPi _n(\varepsilon )(Z,W)= \text {logit}^{-1}(\text {logit} \ \varPi _n(Z,W)+\varepsilon J(Z,W)m(W))\).

    Update \(\varPi _n\) to \(\varPi _n(\varepsilon )\). Also update \(h_1(W), \ h_2(W)\) to reflect the new \(\varPi _n\).

  10. 10.

    Finally, form final estimate of \(\varPsi _{A,0}=\varPsi _{A,d_0}(P_0)\), using a plug-in estimator with the final estimates upon convergence \(m_n^*\) and \(\theta _n^*\):

    $$ \varPsi ^*_A = \varPsi _{A,d_n}(P_n^*) = \frac{1}{n}\sum _{i=1}^{n} \Big [ m_n^*(W_i)\cdot d_n(V_i)+\theta _n^*(W_i) \Big ] $$

As for \(\varPsi _Z\), it is straightforward to check that the efficient influence equation \(P_n D^*(d_n, \tau _n, P_n^*)=0\).

1.7.3 Double Robustness of \(\varPsi _A^*\)

As in Sect. 1.6.3.4, \(\varPsi ^*_A\) is not a double robust estimator of \(\varPsi _{A,0}\): Component m(W) must always be consistently specified as a necessary condition for consistency of \(\varPsi ^*_A\). However, if we consider \(\varPsi ^*_A=\varPsi _{A,d_n}(P_n^*)\) as an estimate of \(\varPsi _{A,d_n}(P_0)\), where the optimal decision rule\(d_n(V)\) is estimated from the data, then we have that \(\varPsi ^*_A\) is double robust:

Lemma 6

(\(\varPsi ^*_A\) is a double robust estimator of \(\varPsi _{A,d_n}(P_0)\).) Assume (A1)–(A4) and (C1)–(C2). Also assume \(Var_{O \sim P_0}(D^*_d(d_n,P_n^*)(O)) < \infty \).

Then, \(\varPsi ^*_A=\varPsi _{A,d_n}(P_n^*)\) is a consistent estimator of \(\varPsi _{A,d_n}(P_0)\) when either:

  • \(m_n\) and \(\theta _n\) are consistent

  • \(\rho _n\) and \(\varPi _n\) are consistent

  • \(m_n\) and \(\rho _n\) are consistent

Above \(D^*_d\) refers to \(D^*_\mu +D^*_\varPi +D^*_W\), the portions of the efficient influence curve that are orthogonal to variation in decision rule d. The proof is straightforward (see Toth 2016).

1.8 Simulations

1.8.1 Setup

We use two main data-generating functions:

Dataset 1 (categorical Y ).

Data is generated according to:

$$\begin{aligned} U_{AY}\sim & {} \text {Bernoulli}(1/2) \\ W1\sim & {} \text {Uniform}(-1,1) \\ W2\sim & {} \text {Bernoulli}(1/2) \\ Z\sim & {} \text {Bernoulli}(\alpha ) \\ A\sim & {} \text {Bernoulli}(W1+10\cdot Z+2\cdot U_{AY}-10) \\ Y\sim & {} \text {Bernoulli}((1-A)*(\text {plogis}(W2-2- U_{A,Y}))+ (A)*(\text {plogis}(W1+4)) \\ \end{aligned}$$

\(U_{A,Y}\) is the confounding term. For the simulations where \(V \subset W\), we take \(V = (1(W1\ge 0)+-1(W1<0), W2)\). We have \(c_T(Z=1,W)=1\), \(c_T(Z=0,W)=0\) for all W here.

Dataset 2 (continuous Y .)

We use three-dimensional W and distribution

$$\begin{aligned} U_{AY}\sim & {} \text {Normal}(0,1) \\ W\sim & {} \text {Normal}(\mu _\beta , \Sigma ) \\ Z\sim & {} \text {Bernoulli}(0.1) \\ A\sim & {} -2\cdot W1+W2^2+ 4\cdot W3 \cdot Z+U_{AY} \\ Y\sim & {} 0.5 \cdot W1\cdot W2- W3+3\cdot A \cdot W2+ U_{AY} \\ \end{aligned}$$

When \(V \subset W\), we use either V equals W1 rounded to the nearest 0.2, or alternately, V is W3 rounded to the nearest 0.2. We also have \(c_T(0,W) = 0\) for all W, and \(c_T(1,W) =1+b\cdot W1\), and varying \(\mu _\beta , \ \Sigma , \) and b.

Forming initial estimates.

We use the empirical distribution \(P_{W,n}\) for the distribution of W. For learning \(\mu _n\), we use Super Learner, with the following libraries of learners (the names of learners are as specified in the SuperLearner package (van der Laan et al. 2007):

For continuous Y: glm, step, randomForest, nnet, svm, polymars, rpart, ridge, glmnet, gam, bayesglm, loess, mean.

For categorical Y: glm, step, svm, step.interaction, glm.interaction, nnet.4, gam, randomForest, knn, mean, glmnet, rpart.

Further, we included different parameterizations of some of the learners given above, such as ntree = 100, 300, 500, 1000 for randomForest.

Finally, for learning \(\rho _n\), we use a correctly specified logistic regression, regressing Z on W (except for simulation (C) as described below).

Estimators used.

For both parameters of interest \(\varPsi _Z\) and \(\varPsi _A\), we report results on the TMLE estimator \(\varPsi _Z^*\) (or \(\varPsi _A^*\)), and the initial substitution estimator \(\varPsi _{Z,n}^0\) (or \(\varPsi _{A,n}^0\)). The latter is the plug-in estimate, for instance \(\varPsi _{Z,n}^0 \triangleq \varPsi _Z(P_{W.n}, \mu _n)\), that uses the same initial estimates of relevant components and the nuisance parameter as TMLE. Thus, the initial substitution estimator gives a comparison of TMLE to a straightforward semiparametric, machine learning-based approach. 1000 repetitions are done of each simulation.

Table 1.1 (Simulation A.) Consistent estimation of \(\varPsi _{Z,0}\) using machine learning, categorical \(Y \cdot \varPsi _{Z,0}=0.3456\), \(K=0.3\), and \(V \subset W \cdot \sigma _n^2 =\) Var\(_{O \sim P_n} D^*_Z(d_n,\tau _n,P_n^*)(O)\)
Table 1.2 (Simulation B.) Consistent estimation of \(\varPsi _{A,0}\) using machine learning, continuous \(Y \cdot \varPsi _{A,0}=336.2\), \(K=0.8\), and \(V \subset W \cdot \sigma _n^2 = \text {Var}_{O \sim P_n} D^*_Z(d_n,\tau _n,P_n^*)(O)\)

Simulations (A–B): using a large library of learning algorithms for consistent initial estimates.

Tables 1.1 and 1.2 show the behavior of our estimators when machine learning is used to consistently estimate all relevant components and nuisance parameters. Table 1.1 deals with estimating \(\varPsi _Z\) when Y is categorical. In this case, bias is very low with or without the TMLE fluctuation step. \(\sigma _n^2/n\) gives a consistent estimate of the variance of \(\varPsi _Z^*\), in this case where efficiency holds. We see that both estimators have very low variance that converges to \(\sigma _n^2/n\) by \(n=1000\). Coverage of 95% confidence intervals is also displayed, with intervals calculated as \(\varPsi ^*_n \pm 1.96\frac{\sigma _n}{\sqrt{n}}\), as in Lemma 3. The coverage is given in parentheses for the initial substitution estimator, as \(\sigma _n^2\) is not necessarily the right variance. The TMLE estimators show better coverage, even though, in this example, the width of the confidence intervals was accurate for all estimators for \(n \ge 1000\). This may be due to the asymptotic linearity property of the TMLE-based estimators, ensuring that they follow a normal distribution as n becomes large.

Y is continuous in Table 1.2. TMLE convincingly outperforms the initial substitution estimator in both bias and variance here. Only the TMLE estimator is guaranteed to be efficient, and we see a significant improvement in variance. The estimated asymptotic variance \(\sigma _n^2/n\) approximates the variance seen in \(\varPsi _A^*\) fairly well for \(n \ge 1000\). The coverage of confidence intervals for TMLE seems to converge to 95% more slowly than for the previous case of categorical Y.

Simulation (C): double robustness under partial misspecification.

As described in Sect. 1.7.3, \(\varPsi ^*_A=\varPsi ^*_{A,d_n}\) is a double robust estimator of \(\varPsi _{A,d_n}(\varPsi _0)\), but not necessarily of \(\varPsi _{A,0}\).

Table 1.3 (Simulation C.) Robustness of \(\varPsi _A^*\) to partial misspecification, \(\mu _n\) is misspecified. \(\varPsi _{A,0}=0.63\), \(K=0.5\), and \(V=W\)
Table 1.4 (Simulation D.) Estimation of true mean outcome \(\varPsi _{Z,d_n}(P_0)\), under rule \(d_n \cdot \varPsi _{Z,d_n}(P_0)= 162.8\) when \(K=0.2\), and \(\varPsi _{Z,d_n}(P_0)=289.1\) when \(K=0.8\). Sample size is \(N=1000\) and \(V=W\)

Table 1.3 verifies consistency of \(\varPsi _A^*\) when the initial estimate for \(\mu _n\) is grossly misspecified as \(\mu _n=\text {mean}(Y)\). This creates a discrepancy of \(\sim \)0.1 points between \(\varPsi _{A,d_n}(P_0)\) and \(\varPsi _{A,0}\). The initial substitution estimator retains a bias of around −0.09 in estimating \(\varPsi _{A,d_n}(P_0)\), while TMLE demonstrates practically zero bias by \(n=1000\). TMLE is not efficient in this setting of partial misspecification. It has significantly larger variance than the initial substitution estimator for smaller sample sizes, but the variances are similar by \(n=4000\). For confidence intervals, the width was calculated by estimating Var(\(\varPsi _{A,d_n}\)) as \(\sigma _n^2 = \text {Var}_{O \sim P_n} D^*_{d_n}(P_n^*)(O) \), where \(D^*_{d_n}(P)\) is the efficient influence curve of \(\varPsi _{A,d_n}(P)\) as defined in Sect. 1.7.1. It provides a conservative (over)-estimate of variance for confidence intervals, as discussed in Toth and van der Laan (2016). We see that TMLE’s coverage converges to just above 95%. On the other hand, coverage is very low for the initial substitution estimator due to its bias. This is despite the fact that the intervals are too wide in this case.

Simulation (D): quality of the estimate of \(d_n\) versus the true mean outcome attained under rule \(d_n\).

We study how more accurate estimation of the decision rule \(d_n\) can lead to a higher objective obtained. The objective maximized here is the mean outcome under rule \(d_n\), where \(d_n\) must satisfy a cost constraint. We use the known true distributions for \(P_{W,0}\) and \(\mu _0\) in calculating the value of mean outcome under \(d_n\) as \(\varPsi _{d_n}(P_0) = E_{P_0}\mu _0(W,Z=d_n(V))\). The highest the true mean outcome can be under a decision rule that satisfies \(E_{P_0}c_T(W,Z=d(V)) \le K\) is \(\varPsi _0\) using optimal rule \(d=d_0\). Therefore, the discrepancy between \(\varPsi _{d_n}(P_0)\) and \(\varPsi _0\) gives a measure of how inaccurate estimation of the decision rule diminishes the objective.

We compare \(\varPsi _{d_n}(P_0)\) when estimating \(\mu _n\) using the usual large library of learners; when using a smaller library of learners consisting of mean, loess, nnet.size = 3, nnet.size = 4, nnet.size = 5; and finally when we set \(\mu _n= \text {mean}(Y) \cdot d_n\) is estimated using \(\mu _n\) as usual (note that it is the same between the initial substitution and TMLE-based estimates). Table 1.4 confirms the importance of forming a good fit with the data for achieving a high mean outcome. For \(K=0.2\) when roughly 20% of the population could be assigned \(Z=1\), the mean outcome was only a few points below the true optimal mean outcome \(\varPsi _{d_0}\), when using the full library of learners (158.9 vs. 162.8). However, it was about 15 points lower when using a much smaller library of learners. In fact, even when using machine learning with several nonparametric methods in the case of the smaller library, the objective \(\varPsi _{d_n}(P_0)\) attained was not far from that attained with the most uninformative \(\mu _n=\text {mean}(Y)\). Very similar results hold for the less constrained case of \(K=0.8\).

1.9 Discussion

We considered the resource-allocation problem of finding the optimal mean counterfactual outcome given a general cost constraint, in the setting where unmeasured confounding is a possibility and an instrumental variable is available. This work dealt with both problems of finding an optimal treatment regime, and finding the optimal intent-to-treat regime. For both cases, we gave closed-form solutions of the optimal intervention and derived estimators for the optimal mean counterfactual outcome. Our model allows the individualized treatment (or intent-to-treat) rules to be a function of an arbitrary subset of baseline covariates. Estimation is done using the targeted maximum likelihood (TMLE) methodology, which is a semiparametric approach having a number of desirable properties (efficiency, robustness to misspecification, asymptotic normality, and being a substitution estimator). Simulation results showed that TMLE can simultaneously demonstrate both finite-sample bias reduction and lower variance than straightforward machine learning approaches. The empirical variance of TMLE estimators appears to converge to the semiparametric efficiency bound, and confidence intervals are accurate for sample sizes of a few thousand. Consistency in the case of partial misspecification was confirmed, in the sense of Lemmas 4 and 6. Our simulations also addressed the important question of to what extent improved statistical estimation can lead to better optimization results. We were able to demonstrate significant increases in the value of the mean outcome under the estimated optimal rule, when a larger library of data-adaptive learners achieved a closer fit.