Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The bootstrap is used to obtain statistical inference (confidence intervals, hypothesis tests) in a wide variety of settings (Efron and Tibshirani 1993; Davison and Hinkley 1997). Bootstrap-based confidence intervals have been shown in some settings to have higher-order accuracy compared to Wald-style intervals based on the normal approximation (Hall 19881992; DiCiccio and Romano 1988). For this reason it has been widely adopted as a method for generating inference in a range of contexts, not all of which have theoretical support. One setting in which it fails to work in the manner it is typically applied is in the framework of targeted learning. We describe the reasons for this failure in detail and present a solution in the form of a targeted bootstrap, designed to be consistent for the first two moments of the sampling distribution.

Suppose we want to estimate a particular pathwise differentiable parameter using a targeted learning approach. The typical workflow is to obtain initial estimates for relevant factors of the likelihood using super learner, and then generate a targeted estimate using TMLE. By using super learner and TMLE, we can generate correct inference for our parameter of interest without assuming that the likelihood can be modeled by simple parametric models. Relying on the fact that TMLE is an asymptotically linear estimator, we can use the normal approximation to generate Wald-style confidence intervals where the standard error is based on the influence curve. These confidence intervals are first-order accurate. It is tempting to instead obtain higher-order correct confidence intervals by applying the nonparametric bootstrap. However, in the case of TMLE with initial estimates obtained via the super learner algorithm, naïve application of the nonparametric bootstrap is not justified and which we will show to have poor performance, because super learner and therefore TMLE behaves differently on nonparametric bootstrap samples than it does on samples from the true data generating distribution. It is therefore important to develop a bootstrap method that works in the context of targeted learning.

We illustrate the reason for this difference in super learner’s behavior, and present a solution in the form of the targeted bootstrap, a novel model based bootstrap that samples from a distribution targeted to the parameter of interest and to the asymptotic variance of its influence curve. In the process, we outline a TMLE that targets both a parameter of interest and its asymptotic variance. This TMLE can be used to generate another Wald-style confidence interval, by directly using the targeted estimate of the asymptotic variance. Additionally, it can be used to generate a confidence of interval for the asymptotic variance itself. We demonstrate the practical performance of the targeted bootstrap confidence intervals relative to the Wald-type confidence intervals as well as confidence intervals generated by other bootstrap approaches.

1 Problem Statement

Suppose that we observe n independent and identically distributed copies of O with probability distribution P 0 known to be an element of the statistical model \(\mathcal{M}\). In addition, assume we are concerned with statistical inference of the target parameter value ψ 0 = Ψ(P 0) for a given parameter mapping \(\varPsi: \mathcal{M}\rightarrow \mathbb{R}\). Consider a given estimator \(\hat{\varPsi }: \mathcal{M}_{np} \rightarrow \mathbb{R}\) that maps an empirical distribution P n of O 1, , O n into an estimate of ψ 0, and assume that this estimator \(\psi _{n} =\hat{\varPsi } (P_{n})\) is asymptotically linear at P 0 with influence curve OD(P 0)(O) at P 0, so that we can write:

$$\displaystyle\begin{array}{rcl} \psi _{n} -\psi _{0}& =& (P_{n} - P_{0})D(P_{0}) + o_{P}(1/\sqrt{n}). {}\\ \end{array}$$

In that case, we have that \(\sqrt{n}(\psi _{n} -\psi _{0})\) converges in distribution to a normal distribution N(0, Σ 2(P 0)), where \(\varSigma ^{2}: \mathcal{M}\rightarrow \mathbb{R}\) is defined by Σ 2(P) = PD(P)2 as the variance of the influence curve D(P) under P.

We wish to estimate a confidence interval for ψ n . A one-sided confidence interval is defined by a quantity ψ n, [α] such that P 0(ψ 0 < ψ n, [α]) = α. Two sided confidence intervals are typically equal-tailed intervals, having the same error in each tail: P 0(ψ 0 < ψ n, [α∕2]) = P 0(ψ 0 > ψ n, [1−α∕2]) = α∕2. These can be constructed using a pair of one-sided intervals. A one-sided Wald confidence interval can be generated using the asymptotic normality discussed above: defining variance estimator \(\hat{\sigma }_{n}^{2} =\varSigma ^{2}(P_{n})\), the endpoint is \(\psi _{n,[\alpha ],\text{Wald}} =\psi _{n} - n^{-1/2}\hat{\sigma }_{n}\phi ^{-1}(1-\alpha )\), where ϕ −1(1 −α) is the 1 −αth quantile of the standard normal distribution. Similarly, a two-sided Wald confidence interval is given by \(\psi _{n} \pm \phi ^{-1(1-\alpha /2)}n^{-1/2}\hat{\sigma }_{n}\). This approach ignores the remainder term \(o_{P}(1/\sqrt{n})\) and is therefore said to be first order correct.

To provide a concrete motivating example, suppose we observe n i.i.d. observations of O = (W, A, Y ) ∼ P 0, for baseline covariates W, treatment A ∈ {0, 1}, and outcome Y ∈ {0, 1}, and suppose that \(\mathcal{M}\) is the nonparametric model, making no assumptions about the distribution from which O is sampled. The target parameter \(\varPsi: \mathcal{M}\rightarrow \mathbb{R}\), is a treatment-specific mean defined as Ψ(P) = E P E P (YA = 1, W). Let \(\bar{Q}(P)(W) = E_{P}(Y \mid A = 1,W)\) and \(\bar{g}(P)(W) = E_{P}(A\mid W)\).

2 TMLE

As previously discussed elsewhere in this book, TMLE produces asymptotically linear substitution estimators of target parameters. TMLE fluctuates an initial estimate of the target parameter, resulting in an estimate which makes the correct bias-variance trade-off. TMLEs are asymptotically linear with a known influence curve, even when the components of the likelihood are estimated using data-adaptive methods (like these).

2.1 TMLE for Treatment Specific Mean

The efficient influence curve of Ψ at P is given by:

$$\displaystyle{D^{{\ast}}(P)(O) = \frac{A} {\bar{g}(W)}(Y -\bar{ Q}(W)) +\bar{ Q}(W) -\varPsi (P)}$$

(van der Laan and Robins 2003). Note that Ψ(P) only depends on P through \(\bar{Q}(P)\) and the probability distribution Q W (P) of W. Let \(Q(P) = (Q_{W}(P),\bar{Q}(P))\) and let \(Q(\mathcal{M}) =\{ Q(P): P \in \mathcal{M}\}\) be its model space. We will also denote the target parameter as \(\varPsi: Q(\mathcal{M}) \rightarrow \mathbb{R}\) as a mapping that maps a Q in the parameter space \(Q(\mathcal{M})\) into a numeric value, abusing notation by using the same notation Ψ for this mapping. Similarly, we will also denote D (P) with D (Q, G). The efficient influence curve D (P) satisfies the expansion Ψ(P) −Ψ(P 0) = −P 0 D (P) + R ψ (P, P 0), where

$$\displaystyle{R_{\psi }(P,P_{0}) = P_{0}\frac{\bar{g} -\bar{ g}_{0}} {\bar{g}} (\bar{Q} -\bar{ Q}_{0}).}$$

Let ψ n = Ψ(Q n ) be a TMLE of ψ 0 so that it is asymptotically linear at P 0 with influence curve D (P 0). This TMLE can be defined by letting \(\bar{Q}_{n}^{0}\) being an initial estimator of \(\bar{Q}_{0}\), \(\bar{g}_{n}\) an estimator of g 0, \(L(\bar{Q})(O) = -I(A = 1)(Y \log \bar{Q}(W) + (1 - Y )\log (1 -\bar{ Q}(W))\) being the log-likelihood loss function for \(\bar{Q}_{0}\), the submodel \(\text{Logit}\bar{Q}_{n}^{0}(\epsilon )) = \text{Logit}\bar{Q}_{n}^{0} +\epsilon H(\bar{g}_{n})\) with \(H(\bar{g}_{n}) = A/\bar{g}_{n}(W)\), \(\bar{Q}_{n}^{1} =\bar{ Q}_{n}^{0}(\epsilon _{n}^{0})\) with \(\epsilon _{n}^{0} =\arg \min _{\epsilon }P_{n}L(\bar{Q}_{n}^{0}(\epsilon ))\), and ψ n = Ψ(Q n 1), where \(Q_{n}^{1} = (\bar{Q}_{n}^{1},Q_{W,n})\) and Q W, n is the empirical distribution of W 1, , W n . Let P n be a probability distribution compatible with Q n .

2.2 TMLE of the Variance of the Influence Curve

Let \(O \sim P_{0} \in \mathcal{M}\), and we have two target parameters \(\varPsi: \mathcal{M}\rightarrow \mathbb{R}\) and \(\varSigma ^{2}: \mathcal{M}\rightarrow \mathbb{R}\). We are given an estimator ψ n that is asymptotically linear at P 0 with influence curve D(P 0). For simplicity, we will consider the case that ψ n = Ψ(P n ) is an efficient targeted maximum likelihood estimator so that D(P) = D (P) and D (P) is the efficient influence curve of Ψ at P. In this case, Σ 2(P) = P{D (P)}2. Let D Σ (P) be the efficient influence curve of Σ 2 at P. Suppose that Σ 2(P) = Σ 1 2(Q Σ (P)) for some parameter Q Σ (P) that can be defined by minimizing the risk of a loss function L Σ (Q Σ ) so that \(Q_{\varSigma }(P) =\arg \min _{Q_{\varSigma }}PL_{\varSigma }(Q)\). In addition, we assume that D Σ (P) only depends on P through Q Σ(P) and some other parameter g Σ (P). For notational convenience, we will denote these to alternative representations of the asymptotic variance parameter and its efficient influence curve with Σ 2(Q Σ ) and D Σ (Q Σ (P), g Σ (P)) respectively.

We now develop a TMLE of Σ 2(P 0) as follows. First, let Q Σ, n 0 be an initial estimator of Q Σ (P 0), which could be based on the super learner ensemble algorithm using the loss function L Σ (). Similarly, let g Σ, n be an estimator of g Σ, 0. Set k = 0. Consider now a submodel \(\{Q_{\varSigma,n}^{k}(\epsilon \mid g_{\varSigma,n}):\epsilon \}\subset Q_{\varSigma }(\mathcal{M})\) so that the linear span of the components of the generalized score \(\frac{d} {d\epsilon }L_{\varSigma }(Q_{\varSigma,n}^{k}(\epsilon \mid g_{\varSigma,n}))\) at ε = 0 spans D Σ (Q Σ, n k, g Σ, n). Define ε n k = argmin ε P n L Σ (Q Σ, n k(εg Σ, n)) as the MLE and define the update Q Σ, n k+1 = Q Σ, n k(ε n kg Σ, n). We iterate this updating process until convergence at which step K we have ε n K ≈ 0. We denote this final update with Q Σ, n and we call that the TMLE of Q Σ (P 0), while Σ 2(Q Σ, n ) is the TMLE of the asymptotic variance Σ 2(Q 0) of the TMLE ψ n of ψ 0. Let \(\tilde{P}_{n}^{{\ast}}\) be a probability distribution compatible with Q Σ, n .

Application to the Treatment Specific Mean. The asymptotic variance of \(\sqrt{n}(\psi _{ n}^{{\ast}}-\psi _{ 0})\) is given by:

$$\displaystyle\begin{array}{rcl} \varSigma ^{2}(P_{ 0})& =& E_{P_{0}}\{D^{{\ast}}(P_{ 0})\}^{2} {}\\ & =& Q_{0,W}\left (\frac{\bar{Q}_{0}(1 -\bar{ Q}_{0})} {\bar{g}_{0}} + (\bar{Q}_{0} - Q_{0,W}\bar{Q}_{0})^{2}\right ) {}\\ \end{array}$$

The following lemma presents its efficient influence curve D Σ (P).

Lemma 28.1

The efficient influence curve D Σ (P) of Σ 2 at P is given by:

$$\displaystyle\begin{array}{rcl} D_{\varSigma }^{{\ast}}(P)(W,A,Y )& =& D_{\varSigma ^{ 2},Q_{W}}(P)(W) + D_{\varSigma ^{2},\bar{Q}}(P)(O) + D_{\varSigma ^{2},\bar{g}}(P)(O), {}\\ \end{array}$$

where

$$\displaystyle\begin{array}{rcl} D_{\varSigma ^{2},Q_{W}}(P)(W)& =& \frac{\bar{Q}(1 -\bar{ Q})} {\bar{g}} (W) - Q_{W}\frac{\bar{Q}(1 -\bar{ Q})} {\bar{g}} {}\\ & & +(\bar{Q}(W) -\varPsi (Q))^{2} - Q_{ W}(\bar{Q} -\varPsi (Q))^{2} {}\\ D_{\varSigma ^{2},\bar{Q}}(P)(O)& =& \frac{I(A = 1)} {\bar{g}(W)} \left (\frac{1 - 2\bar{Q}(W)} {\bar{g}(W)} + 2(\bar{Q}(W) -\varPsi (Q))\right )(Y -\bar{ Q}(W)) {}\\ D_{\varSigma ^{2},\bar{g}}(P)(O)& =& -\frac{\bar{Q}(1 -\bar{ Q})(W)} {\bar{g}^{2}(W)} (A -\bar{ g}(W)). {}\\ \end{array}$$

This allows us to develop a TMLE \(\varSigma ^{2}(Q_{W,n},\bar{Q}_{n}^{{\ast}},\bar{g}_{n}^{{\ast}})\) of \(\varSigma ^{2}(Q_{W,0},\bar{Q}_{0},\bar{g}_{0})\). Define the clever covariates:

$$\displaystyle\begin{array}{rcl} C_{Y }(\bar{g},Q)(A,W)& \equiv & \frac{I(A = 1)} {\bar{g}(W)} \left (\frac{1 - 2\bar{Q}(W)} {\bar{g}(W)} + 2(\bar{Q}(W) -\varPsi (Q))\right ) {}\\ C_{A}(\bar{g},\bar{Q})(W)& \equiv & \frac{\bar{Q}(1 -\bar{ Q})(W)} {\bar{g}^{2}(W)}. {}\\ \end{array}$$

Let \(Q_{n}^{0} = (Q_{W,n},\bar{Q}_{n}^{0})\) for an initial estimator \(\bar{Q}_{n}^{0}\) of \(\bar{Q}_{0}\), where Q W, n is the empirical distribution which will not be changed by the TMLE algorithm. Let k = 0. Consider the submodels

$$\displaystyle\begin{array}{rcl} \text{Logit}\bar{Q}_{n}^{k}(\epsilon _{ 1})& =& \text{Logit}\bar{Q}_{n}^{k} +\epsilon _{ 1}C_{Y }(\bar{g}_{n}^{k},Q_{ n}^{k}) {}\\ \text{Logit}\bar{g}_{n}^{k}(\epsilon _{ 2})& =& \text{Logit}\bar{g}_{n}^{k} +\epsilon _{ 1}C_{A}(\bar{g}_{n}^{k},\bar{Q}_{ n}^{k}). {}\\ \end{array}$$

In addition, consider the log-likelihood loss functions

$$\displaystyle\begin{array}{rcl} L_{1}(\bar{Q})& =& -I(A = 1)\left \{Y \log \bar{Q}(W) + (1 - Y )\log (1 -\bar{ Q}(W))\right \} {}\\ L_{2}(\bar{g})& =& -\left \{A\log \bar{g}(W) + (1 - A)\log (1 -\bar{ g}(W))\right \}. {}\\ \end{array}$$

Define the MLEs \(\epsilon _{1n}^{k} =\arg \min _{\epsilon }P_{n}L_{1}(\bar{Q}_{n}^{k}(\epsilon ))\) and \(\epsilon _{2n}^{k} =\arg \min _{\epsilon }P_{n}L_{2}(\bar{g}_{n}^{k}(\epsilon ))\). This defines the first step update \(\bar{Q}_{n}^{k+1} =\bar{ Q}_{n}^{k}(\epsilon _{1n}^{k})\) and \(\bar{g}_{n}^{k+1} =\bar{ g}_{n}^{k}(\epsilon _{2n}^{k})\). Now set k = k + 1 and iterate this process until convergence defined by (ε 1n , ε 2n ) being close enough to (0, 0). Let \(\bar{g}_{n}^{{\ast}},\bar{Q}_{n}^{{\ast}}\) denote these limits of this TMLE procedure, and let \(Q_{n}^{{\ast}} = (Q_{W,n},\bar{Q}_{n}^{{\ast}})\). The TMLE of Σ 2(P 0) is given by \(\varSigma ^{2}(\tilde{P}_{n}^{{\ast}})\) where \(\tilde{P}_{n}^{{\ast}}\) is defined by \((Q_{W,n},\bar{Q}_{n}^{{\ast}},\bar{g}_{n}^{{\ast}})\). We note that at (ε 1n , ε 2n ) = (0, 0), we have

$$\displaystyle{0 = P_{n}D_{\varSigma ^{2}}(\tilde{P}_{n}^{{\ast}}) = 0,}$$

and if the algorithm stops earlier at step K, and one defines \(\tilde{P}_{n}^{{\ast}} = P_{n}^{K}\), one just needs to make sure that

$$\displaystyle{P_{n}D_{\varSigma ^{2}}(\tilde{P}_{n}^{{\ast}}) = o_{ P}(1/\sqrt{n}).}$$

2.3 Joint TMLE of Both the Target Parameter and Its Asymptotic Variance

We could also define a TMLE targeting both parameters Ψ and Σ 2. This is defined exactly as above, but now using a submodel \(\{P_{n}^{k}(\epsilon ):\epsilon \}\subset \mathcal{M}\) that has a score \(\frac{d} {d\epsilon }L(P_{n}^{k}(\epsilon ))\) at ε = 0 whose components span both efficient influence curves (D ψ (P), D Σ (P)). In this manner, one obtains a TMLE \(\tilde{P}_{n}^{{\ast}}\) that solves \(P_{n}D_{\psi }^{{\ast}}(\tilde{P}_{n}^{{\ast}}) = P_{n}D_{\varSigma }^{{\ast}}(\tilde{P}_{n}^{{\ast}}) = 0\) and, under regularity conditions, yields an asymptotically efficient estimator of both ψ 0 and σ 0 2. In this case our TMLE of ψ 0 could just be \(\varPsi (\tilde{P}_{n}^{{\ast}})\): so in this special case, we have \(P_{n}^{{\ast}} =\tilde{ P}_{n}^{{\ast}}\). In particular, we could estimate both ψ 0 and σ 0 2 with a bivariate TMLE \((\varPsi (\tilde{P}_{n}^{{\ast}}),\varSigma ^{2}(\tilde{P}_{n}^{{\ast}}))\) where \(\tilde{P}_{n}^{{\ast}}\) is a TMLE that targets both ψ 0 and σ 0 2 = Σ 2(P 0). In this case, \(P_{n}^{{\ast}} =\tilde{ P}_{n}^{{\ast}}\) and thus \(\psi _{n}^{{\ast}} =\varPsi (\tilde{P}_{n}^{{\ast}})\), \(\sigma _{n}^{{\ast}} =\varSigma ^{2}(\tilde{P}_{n}^{{\ast}})\). This TMLE can be defined as the above iterative TMLE of Σ 2(P 0) but now using the augmented submodel:

$$\displaystyle\begin{array}{rcl} \mbox{ Logit}\bar{Q}_{n}^{k}(\epsilon _{ 1})& =& \mbox{ Logit}\bar{Q}_{n}^{k} +\epsilon _{ 0}H(\bar{g}_{n}^{k}) +\epsilon _{ 1}C_{Y }(\bar{g}_{n}^{k},Q_{ n}^{k}), {}\\ \end{array}$$

where \(H(\bar{g})(A,W) = I(A = 1)/\bar{g}(W)\). Conditions under which \((\varPsi (\tilde{P}_{n}^{{\ast}}),\varSigma ^{2}(\tilde{P}_{n}^{{\ast}}))\) is an asymptotically efficient estimator of (Ψ(P 0), Σ 2(P 0)) are essentially the same as needed for efficiency of Ψ(P 0).

3 Super Learner

TMLE requires initial estimates of factors of the likelihood. For the treatment-specific mean example we need estimates of \(\bar{Q}(A,W)\) and \(\bar{g}(W)\). In the targeted learning framework, these factors are typically estimated with super learner, discussed earlier in Chap. 3 and elsewhere. For the purposes of this chapter, we applied discrete super learner (also referred to as the cross-validation selector), which selects the risk-minimizing individual algorithm.

Loss based estimation allows us to objectively evaluate the quality of estimates and select amongst competing estimators based on this evaluation. Super learner is a particular implementation of this framework, and understanding this framework is important to understanding the behavior of super learner on nonparametric bootstrap samples. In the context of super learner, we will refer to estimators of the factors of the likelihood as “learners”. This exposition will focus on the example of learning an estimate of \(\bar{Q}_{0}(P_{0}) = E_{P_{0}}[Y \vert A,W]\), but it applies equally well to other factors of the likelihood. Here, \(\bar{Q}(P)\) indicates an estimate of \(\bar{Q}\) based on P. Consider the problem of selecting an estimate \(\bar{Q}\) from a class of possible distributions \(\mathbf{\bar{Q}}\). In the context of discrete super learner, this becomes selecting from a set of candidate learners: \(\bar{Q}_{k}: k = 1,\ldots,K\).

The key ingredient in this framework is a loss function, \(L(\bar{Q},O)\), that describes the severity of the difference between a value predicted by a learner and the true observed value. For example, the squared error loss: \(L(\bar{Q},O) = (\bar{Q} - Y )^{2}\). This leads to the risk, which is the expected value of the loss with respect to distribution P: \(\theta (\bar{Q},P) = PL(\bar{Q},O) = E_{P_{0}}[L(\bar{Q},O)]\). Evaluated at the truth, P 0, we get the true risk \(\theta _{0}(\bar{Q}) =\theta (\bar{Q},P_{0})\), which provides a criteria by which to select a learner: \(\bar{Q}_{0} =\arg \min _{\bar{Q}\in \mathbf{\bar{Q}}}\theta _{0}(\bar{Q}) =\bar{ Q}(P_{0})\), the learner we want is the one that minimizes the true risk. Crucially, this should be equal to the parameter we’re trying to estimate, here \(\bar{Q}_{0}(A,W)\). The value of this risk at the minimizer is called the optimal risk, or the irreducible error: \(\theta _{0}(\bar{Q}_{0}) =\min _{\bar{Q}\in \mathbf{\bar{Q}}}\theta _{0}(\bar{Q})\). Then, in the context of discrete super learner, we can define oracle selector as \(\tilde{k}_{n} =\arg \min _{k\in \{1,\ldots,K_{n}\}}\theta _{0}(\bar{Q}_{k})\), which selects the learner that minimizes true risk amongst set of candidates. This is the learner we would select if we knew P 0

The empirical or resubstitution risk estimate, \(\hat{\theta }_{P_{n}}(\bar{Q}(P_{n})) =\theta (\bar{Q}(P_{n}),P_{n})\), estimates the risk on the same dataset used to train the learner. This is known to be optimistic (biased downwards) in most circumstances, with the optimism increasing as a function of model complexity. Therefore, the resubstitution selector \(\arg \min _{k\in \{1,\ldots,K_{n}\}}\hat{\theta }_{P_{n}}(\bar{Q}_{k}(P_{n}))\) selects learners which “overfit” the data, selecting learners which are unnecessarily complex, and therefore have a higher risk than models which make the correct bias-variance trade-off. Hastie et al. (2001) discusses this phenomenon in more detail.

Cross-validation allows more accurate risk estimates that are not biased towards more complex models. Our formulation relies on a split vector B n ∈ {0, 1}n, which divides the data into two sets, a training set (O i : B n (i) = 0), with the empirical distribution \(P_{n,B_{n}}^{0}\), and a validation set (O i : B n (i) = 1), with the empirical distribution \(P_{n,B_{n}}^{1}\). Averaging over the distribution of B n , yields a cross-validated risk estimate: \(\hat{\theta }_{\mbox{ CV}}(\bar{Q}) = E_{B_{n}}\theta (\bar{Q}(P_{n,B_{n}}^{0}),P_{n,B_{n}}^{1})\). This yields a cross-validated selector \(\hat{k}_{n} =\arg \min _{k\in \{1,\ldots,K_{n}\}}\theta _{\mbox{ CV}}(\bar{Q})\), which selects the learner that minimizes the cross-validated risk estimate. Because cross-validation uses separate data for training and risk estimation for each split vector B n , it has an important oracle property.

Under appropriate conditions the cross-validation selector will do asymptotically as well as the oracle selector in terms of a risk difference ratio:

$$\displaystyle{ \frac{\theta _{0}(Q_{\hat{k}}) -\theta _{0}(\bar{Q}_{0})} {\theta _{0}(Q_{\tilde{k}}) -\theta _{0}(\bar{Q}_{0})}\mathop{ \rightarrow }\limits^{ P}1. }$$
(28.1)

That is, the ratio of the risk difference between the cross-validation selector and the optimal risk and the risk difference between the oracle selector and the optimal risk approaches 1 in probability. Conditions and proofs for this result are given in Dudoit and van der Laan (2005); van der Laan and Dudoit (2003); van der Vaart et al. (2006). It is through this property that discrete super learner does asymptotically as well as the best of its candidate learners. We will soon describe how this property fails for nonparametric bootstrap samples.

4 Bootstrap

Before presenting our generalization of the bootstrap, we briefly review the bootstrap theoretical framework. The key idea of the bootstrap is as follows: we wish to estimate the sampling distribution \(G(x) = P(\hat{\varPsi }(P_{n}) \leq x)\) of an estimator \(\hat{\varPsi }: \mathcal{M}_{NP} \rightarrow \mathbf{\varPsi }\), where \(\hat{\varPsi }(P_{n})\) is viewed as a random variable through the random P n . If the estimator is asymptotically linear, then this sampling distribution could be approximated with a normal distribution so that it suffices to estimate its first and second moment. It is difficult to estimate this distribution directly because we only observe one sample from P 0, and therefore only one realization of ψ n . However, we can draw B repeated samples of size n from some estimate of P 0 and apply our estimator to those samples. Denoting such a sample O 1 #, , O n # and the empirical distribution corresponding to that sample P n # and estimate ψ n # = Ψ(P n #), we can obtain an estimate of the desired sampling distribution:

$$\displaystyle\begin{array}{rcl} \hat{G}(x)& =& \frac{1} {B}\sum _{i=1}^{B}I(\psi _{ n,i}^{\#} \leq x) {}\\ \end{array}$$

4.1 Nonparametric Bootstrap

The nonparametric bootstrap applies this approach by sampling from the empirical distribution, P n . This approach has been demonstrated to be an effective tool for estimating the sampling distribution in a wide range of settings. However, the nonparametric bootstrap is not universally appropriate for sampling distribution estimation. Because P n is a discrete distribution, repeated sampling from it will create “copied” observations—bootstrap samples will have more than one identical observation in a sample. Bickel et al. (1997a) notes that the bootstrap can fail if the estimator is sensitive to ties in the dataset. One example of a class of estimators that may be sensitive to ties are those that use cross-validation to make a bias-variance trade-off. If cross-validation is applied to a nonparametric bootstrap sample, duplicate observations have the potential to appear in both the training and testing portions of a given sample split. This creates an issue for estimators that rely on cross-validation. Hall (1992) specifically notes issue of ties for cross-validation based model selection.

The severity of this problem is determined by how many copied observations we can expect. For a bootstrap sample of size n, and validation proportion \(p_{B_{n}}\), the probability of a validation observation having a copy in the training sample is given by

$$\displaystyle\begin{array}{rcl} p(\mbox{ copy})& =& 1 -\left (1 - \frac{1} {n}\right )^{(1-p_{B_{n}})n} \approx 1 - e^{-(1-p_{B_{n}})} {}\\ \end{array}$$

For ten-fold cross-validation \(p_{B_{n}} = 0.1\), so we can expect ≈ 59% of validation observations to also be in the training sample. An average of 59% of a cross-validated risk estimate on a bootstrap sample is therefore something like a resubstitution risk estimate, thus having suboptimal properties. One ad-hoc solution to the problem of duplicate observations for cross-validation is to do “clustered” cross-validation where a cluster is defined as a set of identical bootstrap observations, and then split the clusters between training and validation. This way, no observation will appear in both the training and testing sets. Although we lack rigorous justification for this approach, it was evaluated in the simulation study below. It appears in the results as “NP Bootstrap + CVID”.

4.2 Model-Based Bootstrap

In contrast, the parametric bootstrap draws samples from an estimate of P 0 based on an assumed parametric model: P n, β. The parametric bootstrap can be generalized to a “model-based” bootstrap that using semi- or nonparametric estimates of factors of the likelihood. In the context of our treatment-specific mean example, this means using estimates of P(Y | A, W) and \(\bar{g}\). If Y is binary, as is the case in our simulation, \(P(Y = 1\vert A,W) = E(Y \vert A,W) =\bar{ Q}(A,W)\). Otherwise, we need an estimate of the conditional distribution of ε(A, W), given A, W, where Y = E(Y | A, W) + ε(A, W). To be explicit, observations are drawn from an estimate \(\tilde{P_{n}} = Q_{W,n}g_{n}Q_{Y,n}\) according to the algorithm below.

The targeted bootstrap, described in the next section, is a particular model-based bootstrap using estimates of \(\bar{Q}_{n}^{{\ast}}\) \(\bar{g}_{n}^{{\ast}}\) targeted to ensure correct asymptotic performance.

4.3 Targeted Bootstrap

The idea of targeted bootstrap is to construct a TMLE \(\tilde{P}_{n}^{{\ast}}\) so that \(\varSigma ^{2}(\tilde{P}_{n}^{{\ast}})\) is a TMLE of σ 0 2 = Σ 2(P 0). Then we know that under regularity conditions, \(\varSigma ^{2}(\tilde{P}_{n}^{{\ast}})\) is an asymptotically linear and efficient estimator of σ 0 2 at P 0 so that we can construct a confidence interval for σ 0 2. In addition, since it is a substitution estimator of σ 0 2 it is often more reliable in finite samples resulting in potential finite sample improvements in coverage of the confidence interval. In addition, we will show that under appropriate regularity conditions, due to the consistency of \(\varSigma ^{2}(\tilde{P}_{n}^{{\ast}})\), the bootstrap distribution of \(\sqrt{n}(\hat{\varPsi }(P_{n,\#}) -\varPsi (\tilde{P}_{n}^{{\ast}}))\) based on sampling \(O_{1}^{\#},\ldots,O_{n}^{\#} \sim _{iid}\tilde{P}_{n}^{{\ast}}\), given almost every (P n : n), \(\sqrt{n}(\hat{\varPsi }(P_{ n}) -\psi _{0})\) converges to the desired limit distribution N(0, σ 0 2), even when \(\tilde{P}_{n}^{{\ast}}\) is misspecified. Thus, we can show that the \(\tilde{P}_{n}^{{\ast}}\)-bootstrap is consistent almost everywhere for the purpose of estimating the limit distribution of \(\sqrt{n}(\psi _{ n} -\psi _{0})\), but the bootstrap has the advantage of also obtaining an estimate of the finite sample sampling distribution of the estimator under this bootstrap distribution. Normally, the consistency of a model based bootstrap that samples from an estimate \(\tilde{P}_{n}^{{\ast}}\) of P 0 relies on consistency of the density of \(\tilde{P}_{n}^{{\ast}}\) as an estimator of the density of P 0. In this case, however, the consistency of the bootstrap only relies on the conditions under which the TMLE \(\varSigma ^{2}(\tilde{P}_{n}^{{\ast}})\) is a consistent estimator of Σ 2(P 0). This in turn can allow that parts of P 0 are inconsistently estimated within \(\tilde{P}_{n}^{{\ast}}\). We refer to this bootstrap as the targeted bootstrap.

The TMLE of Σ 2(P 0) is typically represented as Σ 2(Q Σ, n ) for a smaller parameter PQ Σ (P) utilizing a possible nuisance parameter estimator g Σ, n of a g Σ (P 0). As a consequence, \(\tilde{P}_{n}^{{\ast}}\) can be defined as any distribution for which \(Q_{\varSigma }(\tilde{P}_{n}^{{\ast}}) = Q_{n}^{{\ast}}\), without affecting the consistency of the targeted bootstrap. This demonstrates that the targeted bootstrap is robust to certain types of model misspecification. For the best finite sample performance in estimating the actual sample sampling distribution of \(\hat{\varPsi }(P_{n})\), it might still be helpful that also the remaining parts of P 0, beyond Q 0, are well approximated; however that contribution will be asymptotically negligible.

4.4 Bootstrap Confidence Intervals

Once a bootstrap sampling distribution is obtained, a number of methods have been proposed to generate confidence interval endpoints from them. Hall (1988) presents a framework by which to evaluate bootstrap confidence intervals. We follow that approach here. We are interested in studying the accuracy of various confidence intervals. For a given confidence interval endpoint, ψ n, [α], we say that it’s jth order accurate if we can write P 0(ψ 0 < ψ n, [α]) = α + O(n j∕2). Coverage probability of a one-sided interval is closely related to its accuracy. We also discuss the coverage error of a two-sided confidence interval.

The most general theoretical results for bootstrap confidence interval accuracy for the nonparametric bootstrap come from the smooth functions setting of Hall (1988). This setting is for parameters that can be written as f(P 0 Y ) where Y is a vector generated from a set of transformations of O, (i.e. h j(O)), and where f is a smooth function. This setting accommodates many common parameters such as means and other moments but also leaves out other common parameters like quantiles. Notably, it does not include the treatment-specific mean or other kinds of targeted learning parameters. Below we present some bootstrap confidence interval methods and state the relevant theoretical results in this setting.

Bootstrap Wald Interval. A Wald interval using the bootstrap estimate of variance:

$$\displaystyle\begin{array}{rcl} \hat{\sigma }_{n,\text{boot}}^{2}& =& \frac{1} {B}\sum _{i=1}^{B}\left (\varPsi (P_{ n,\#,i}) -\bar{\varPsi } (P_{n,\#,i})\right )^{2} {}\\ \end{array}$$

with \(\bar{\varPsi }(P_{n,\#,i}) = \frac{1} {B}\sum _{i=1}^{B}\varPsi (P_{ n,\#,i})\). As before:

$$\displaystyle\begin{array}{rcl} \psi _{n,[\alpha ],\text{Wald}}& =& \psi _{n} - n^{-1/2}\hat{\sigma }_{ n,boot}\phi ^{-1}(1-\alpha ) {}\\ \end{array}$$

The Wald interval method is first order accurate in the smooth functions setting (Hall 1988).

Percentile Interval. Efron’s percentile interval directly using the α quantile of \(\hat{G}(x)\):

$$\displaystyle\begin{array}{rcl} \psi _{n,[\alpha ],\text{Percentile}}& =& \hat{G}^{-1}(\alpha ) {}\\ \end{array}$$

The percentile interval is also first order accurate in the smooth functions setting (Hall 1988).

Bootstrap-t Interval. The bootstrap-t interval can be thought of as an improvement to the Wald-style interval. It relies on the following “studentized” distribution function.

$$\displaystyle\begin{array}{rcl} K(x)& =& P\left (\frac{n^{1/2}(\hat{\psi }_{n} -\psi _{0})} {\hat{\sigma }_{n}} <x\right ) {}\\ \end{array}$$

The bootstrap estimate of this distribution is as follows:

$$\displaystyle\begin{array}{rcl} \hat{K}(x)& =& \frac{1} {B}\sum _{i=1}^{B}I\left (\frac{n^{1/2}(\hat{\psi }_{ n}^{\#} -\hat{\psi }_{ n})} {\hat{\sigma }_{n}} <x\right ) {}\\ \end{array}$$

Defining \(\hat{y}_{\alpha } =\hat{ K}^{-1}(\alpha )\) as the estimate of the α quantile of this distribution, we modify the Wald interval as follows:

$$\displaystyle\begin{array}{rcl} \psi _{n,[\alpha ],\text{bootstrap-}t}& =& \psi _{n} + n^{-1/2}\hat{\sigma }_{ n}\hat{y}_{\alpha } {}\\ \end{array}$$

A commonly cited drawback of this method is it requires a reliable estimate of σ (Hall 1988). However, in our setting we have access to estimates of σ both from influence curves and targeted estimates of variance. In our simulation study (below), we used the influence curve variance estimate except in the case of the targeted and joint targeted bootstraps, where we used the targeted estimate. The bootstrap-t interval is second-order accurate in the smooth functions setting (Hall 1988).

BC a Interval. The BC a (bias-corrected, accelerated) interval, first presented in Efron (1987), accounts for bias and skew in a sampling distribution when forming a confidence interval. Its development was motivated by the practice of employing monotone transformations to normalize the sampling distribution of an estimator. It depends on two additional parameters. The bias constant z 0 captures the bias in the sampling distribution, while the acceleration constant a captures the skewness of the sampling distribution.

Given both of these quantities, the BC a defines a new quantile to look up:

$$\displaystyle\begin{array}{rcl} \beta _{z_{0},a,\alpha }& =& \varPhi \left (z_{0} + \frac{z_{0} + z_{\alpha }} {1 - a(z_{0} + z_{\alpha })}\right ) {}\\ \psi _{n,[\alpha ],BC_{a}}& =& \hat{G}^{-1}(\beta _{ z_{0},a,\alpha }), {}\\ \end{array}$$

where Φ(x) is the standard normal distribution and z α Φ −1(α) is its α quantile. To generate this interval in practice, we require estimates of z 0 and a. We estimate z 0 as the normal quantile for the proportion of the bootstrap estimates that fall below the original sample estimate:

$$\displaystyle\begin{array}{rcl} \hat{z_{0}}& =& \varPhi ^{-1}\left [\hat{G}(\hat{\psi }_{ n})\right ]. {}\\ \end{array}$$

We use our knowledge of the influence function to estimate the acceleration constant a from the original sample:

$$\displaystyle\begin{array}{rcl} \hat{a}& =& \frac{\sum _{i=1}^{n}D(O_{i})^{3}} {6\left (\sum _{i=1}^{n}D(O_{i})^{2}\right )^{3/2}}. {}\\ \end{array}$$

The BC a interval is also second order accurate in the smooth functions setting (Hall 1988).

5 Simulation

To evaluate the practical performance of the targeted bootstrap, we simulate data from the following P 0:

$$\displaystyle\begin{array}{rcl} W_{1}& \sim & U(-1,1), {}\\ W_{2}& \sim & U(-1,1), {}\\ W^{{\ast}}& =& W_{ 2} - W_{1}, {}\\ A\vert W& \sim & \text{Bernoulli}\left (\text{inv.logit}(-0.5W^{{\ast}})\right ), {}\\ Y \vert A,W& \sim & \text{Bernoulli}\left (\text{inv.logit}(A(1 - 0.5W^{{\ast}} + sin(W^{{\ast}})))\right ). {}\\ \end{array}$$

Positivity, (\(\bar{g}(P_{0})(W) = P_{0}(A = 1\vert W)> 0\), is met: \(0.26 <\bar{ g}(P_{0})(W) <0.74\). Samples of size n = 1000 were generated for each of B = 1000 Monte Carlo simulations.

In our simulation, we estimated \(\bar{Q}(P_{0})(W) = E[Y \vert A = 1,W]\) using kernel regression with bandwidth selected by 10-fold cross-validation (i.e., the discrete super learner). We estimate \(\bar{g}(P_{0})(W)\) using a correctly specified logistic regression. For each simulation iteration, we estimated Q, and fit three TMLEs: a TMLE for the treatment-specific mean, a TMLE for its asymptotic variance, and a joint TMLE for both the treatment-specific mean and its asymptotic variance. After fitting the TMLEs, we generated 1000 repeated bootstrap samples from five different methods: nonparametric bootstrap, clustered nonparametric bootstrap, model-based bootstrap based on the initial super learner fit, the targeted bootstrap sampling from the TMLE distribution targeting the asymptotic variance, and the joint targeted bootstrap sampling from the joint targeted TMLE distribution.

The three TMLEs fit to the simulated dataset generated different confidence interval estimates: one Wald-style interval based on the influence curve from the first TMLE, and direct estimates of the variance for the remaining two TMLEs. For the five bootstrap methods, we estimated intervals for all four methods described in Sect. 28.4.4. We evaluated coverage and interval lengths for all estimated confidence intervals. We also compared the performance of the super learner on full samples and samples from all the bootstrap approaches. To evaluate the performance of super learner on bootstrap samples, we compared which bandwidths were selected on the various sample types, as well as the risk difference ratios for those selections.

Results. As described above, super learner behaves differently on nonparametric bootstrap samples than on full samples, behaving more like a resubstitution estimator. While on full samples, the super learner (cross-validation) often selects bandwidths close to those selected by the oracle selector (minimizing the true risk), on nonparametric bootstrap samples, super learner most often selects the lowest available bandwidth, over-fitting the data. On other kinds of bootstrap samples, including targeted bootstrap, super learner behaves more like it does on full samples, suggesting that these bootstrap methods don’t have the same problem. This difference in the selection behavior impacts the performance of the resulting super learner in terms of the risk difference ratio (Eq. (28.1)). This can be seen in Fig. 28.1. Again, other bootstrap samples behave more like full samples in terms of the risk difference ratio.

Fig. 28.1
figure 1

Risk difference ratio (defined in Eq. (28.1)) of super learner on different sample types

Figures 28.2, 28.3, and 28.4 show how super learner performance impacts confidence interval performance for the resulting TMLE estimate. In general, the over-fit super learner being used in TMLE on nonparametric bootstrap samples is more variable than the well-fit super learner being used in full samples. Therefore, nonparametric bootstrap confidence intervals are unnecessarily long and over-cover. The effect of this over-coverage on length is modest at n = 1000, with the Wald intervals estimated from the nonparametric bootstraps are on average just 4% longer than the standard influence-curve based confidence intervals. At smaller sample sizes, the effect is more severe: nonparametric bootstrap intervals are 21% longer than influence curve intervals at n = 250, and 39% longer at n = 100. This substantial increase in length will negatively impact the power of nonparametric bootstrap confidence intervals. In our simulation, the set of bandwidths from which super learner could select was fixed with respect to sample size. We expect that, if smaller bandwidths had been available, super learner on nonparametric bootstrap samples would have chosen them, increasing the impact of over-fitting on larger sample sizes.

Fig. 28.2
figure 2

Confidence interval coverage and length for n = 1000

Fig. 28.3
figure 3

Confidence interval coverage and length for n = 250

Fig. 28.4
figure 4

Confidence interval coverage and length for n = 100

These figures also show the importance of a bootstrap that jointly targets both the parameter of interest and its asymptotic variance. For interval types other than Wald, the (variance-only) targeted bootstrap intervals have very poor coverage. This is because these intervals are not centered on the treatment-specific mean estimate from the full dataset, and are instead centered on the average estimate from the bootstrap intervals. In the case of targeted bootstrap samples, these estimates are biased, because the targeted bootstrap is targeting only the variance, and not the actual parameter of interest.

Figure 28.4 shows that at small sample sizes, the asymptotic Wald intervals have lower than nominal coverage, with all methods under-covering by at least 2.5%. Small sample sizes such as this are where the bootstrap has the most potential to improve upon asymptotic confidence intervals. At larger sample sizes, the second-order terms become relatively unimportant. However, even at this small sample size, asymptotic intervals are only modestly anti-conservative in this simulation. This may be due to the fact that even at n = 100, our simulated sampling distribution is already very close to normal.

Focusing only on the joint targeted bootstrap, we can compare the performance of different bootstrap confidence interval types. Figure 28.5 shows this comparison. At modest sample sizes, Bootstrap-t intervals over-cover and are longer than other interval types. The other bootstrap methods generate intervals of similar length. Of the three, BC a has the closest to nominal coverage over the range of sample sizes tested. Therefore, it is recommended that this interval type be used with the Joint targeted bootstrap going forward.

Fig. 28.5
figure 5

Joint targeted bootstrap interval performance comparison

6 Conclusion

We have shown the effectiveness of the targeted bootstrap for estimating properties of the sampling distribution theoretically and through simulation results. Our simulation illustrates the problems of applying the nonparametric bootstrap to a TMLE estimate with initial estimates based on super learner. Specifically, ties in nonparametric bootstrap samples sabotage the sample splitting that occurs in cross-validation, causing cross-validating risk estimates to behave more like resubstitution estimates. This leads super learner to select over-fit models on nonparametric bootstrap samples. By sampling from a continuous distribution estimate, and one that is targeted to the parameters of interest, targeted bootstrap does not create the ties that break cross-validation, and therefore generates confidence intervals with acceptable performance. We have demonstrated the superiority of the targeted bootstrap over the nonparametric bootstrap in the context of targeted learning.

Additional work is necessary to further explore the issue of bootstrap confidence intervals for targeted learning. A simulation study with a continuous outcome variable, especially one with a skewed error distribution, would be interesting in several ways. First, it would validate the targeted bootstrap approach for continuous outcomes, which theory tells us should be consistent even when the error distribution is misspecified. Secondly, it would allow us to investigate the magnitude of second-order terms in a setting with a sampling distribution that might be more skewed at smaller sample sizes. Another extension would be to investigate additional bootstrap confidence interval types, especially the tilted and automatic percentile interval types (DiCiccio and Romano 1990).

We also want to highlight another type of bootstrap that is asymptotically consistent in great generality. In this bootstrap method one estimates the sampling distribution of the TMLE with the sampling distribution, under sampling from the empirical probability distribution P n , of the TMLE that fixes the initial estimator and only recomputes the TMLE update step in the TMLE algorithm. It follows that this is a consistent bootstrap method as long as the TMLE itself is asymptotically efficient. Of course, this type of bootstrap fails to pick up second-order terms due to the estimation of the nuisance parameters.