1 Introduction

In the last two decades there has been a remarkable progress in cancer treatments, which led to longer patient survival and improved their quality of life. Consequently, a spate of statistical research to develop cure models arose. These models are a useful tool to analyze and describe cancer survival data, since they express and predict the prognosis of a patient considering, as a novelty, the real possibility that the subject may never experience the event of interest. Importantly, cure models should not be used in an indiscriminate way (see Farewell 1986). They generally require long-term follow-up and large sample sizes, as well as empirical and biological evidence of a nonsusceptible subpopulation. More specifically, they are used to estimate the probability of cure, also known as incidence, and the survival function of the uncured population denoted by latency.

Cure models can be split into two major types: the mixture and the nonmixture models. Mixture cure models were proposed by Boag (1949). They consider the survival function as a mixture of two groups of subjects: the susceptible group and the cured group. An important benefit of the mixture cure model is that it allows covariates to have different influence on patients who will experience the event of interest (e.g., death by the cancer under study) and on those who will not.

In the literature, the covariate effect is generally expressed parametrically or semiparametrically (see, among others, Farewell 1982; Goldman 1984; Kuk and Chen 1992; Maller and Zhou 1996; Sposto et al. 1992; Chappell et al. 1995; Taylor 1995; Peng and Dear 2000; Sy and Taylor 2000; Peng 2003; Yu and Peng 2008). Recently Louzada and Cobre (2012) considered recurrent event data in the presence of a cure fraction. Very few papers exist that use a nonparametric view to deal with the problem (see Maller and Zhou 1992; Laska and Meisner 1992; Wang et al. 2012). In the discussion by Van Keilegom to the paper González-Manteiga and Crujeiras (2013), the problem of goodness-of-fit tests for regression models with cured data is briefly considered. A completely nonparametric approach to the mixture cure model was firstly addressed by Xu and Peng (2014), proposing a nonparametric incidence estimator which works with continuous covariates, and proving its consistency and asymptotic normality. This nonparametric incidence estimator was studied later by López-Cheda et al. (2017), who obtained an i.i.d. representation, the asymptotically optimal bandwidth and proposed a bootstrap bandwidth selector. Regarding the latency function, a nonparametric estimator was proposed by López-Cheda et al. (2017), but no further properties were studied. The present paper contributes to this lacuna studying the asymptotic properties of that nonparametric latency estimator and proposing a bootstrap bandwidth selector. This enables the mixture cure model with covariates to be addressed in a completely nonparametric way.

The rest of the paper is organized as follows. Section 2 introduces the notation and presents the nonparametric mixture cure model and the nonparametric latency estimation. The asymptotic results for this estimator as well as the required assumptions are also introduced in Sect. 2. An i.i.d. representation is presented, an asymptotic expression for the mean squared error is found and the asymptotic normality is established for the nonparametric latency estimator. The problem of choosing the smoothing parameter is addressed in Sect. 3, where a bootstrap bandwidth selector is presented. The practical performance of this bootstrap bandwidth selector is assessed by a simulation study in Sect. 4. The application of these methods to a colorectal cancer data set is considered in Sect. 5. A final Appendix contains the proofs of the theoretical results stated in Sect. 2.

2 Main results

2.1 Notation and nonparametric estimators

To distinguish between cured and uncured subjects we use a binary indicator: \(\nu \). If the subject belongs to the susceptible group we set \(\nu =0\). This means that the individual will experience the event of interest if followed during enough time. If the subject is cured we set \(\nu =1\). In such a case the event will never be experienced by that subject. The probability of being cured and the survival function in the group of uncured patients may depend on a vector of covariates, \(\mathbf {X}\), measured on the subject. Let us consider \(p(\mathbf {x}) = P(\nu =0 | \mathbf {X}=\mathbf {x})\), the conditional probability of not being cured, and let Y be the time to the event of interest. If \(\nu =1\), we set \(Y =\infty \).

We define \(F(t|\mathbf {x})=P(Y\le t |\mathbf {X}=\mathbf {x})\), the conditional distribution function of Y. When the cure probability is positive, then the corresponding survival function, \(S(t|\mathbf {x})\), is improper. In other terms, \(\lim _{t \rightarrow \infty } S(t|\mathbf {x}) = 1-p(\mathbf {x}) >0\).

Using the conditional survival function for susceptible subjects, \(S_0(t|\mathbf {x})=P(Y > t | \mathbf {X}=\mathbf {x}, \nu =0)\), the mixture cure model can be written as:

$$\begin{aligned} S(t|\mathbf {x})=1 - p(\mathbf {x}) + p(\mathbf {x}) S_0 (t|\mathbf {x}). \end{aligned}$$
(1)

The function \(1 - p(\mathbf {x})\) is called the incidence and \(S_0(t|\mathbf {x})\) is the latency.

Random right censoring is assumed. The censoring time is denoted by C and G denotes its distribution function (\(\bar{G}\) is its survival function). The variable C is assumed to be independent of Y given the covariates \(\mathbf {X}\). The observed time is defined as \(T=\min \{Y,C \}\) and \(\delta =1 \{Y \le C\}\) is the uncensoring indicator. We denote by H the distribution function of T. It is clear that \(\delta =0\) for all the cured patients, and also for uncured patients with censored lifetime (\(T=C\)). From now on we restrict ourselves to the case where \(\mathbf {X}\) is a univariate continuous covariate X with density function m(x). As a consequence of the previous definitions and assumptions, the sample is denoted by \(\{(X_i, T_i, \delta _i), i=1, \dots , n \}\), which collects i.i.d. observations of the random vector \((X, T, \delta )\). Whenever is needed \((X_{(i)},T_{(i)},\delta _{(i)})\) will denote the observation corresponding to the ith order statistic with respect to the sample \((T_1, T_2, \ldots , T_n)\), where \(X_{(i)}\) and \(\delta _{(i)}\) are the concomitants of the X and \(\delta \)-samples.

The conditional distribution, survival and subdistribution functions are denoted by \(G(t|x)=P\left( C \le t|X=x \right) \), \(\bar{G}(t|x)= 1-G(t|x)\), \(H(t|x)=P\left( T \le t | \right. \) \(\left. X=x \right) \), \(H^1(t|x)=P\left( T \le t, \delta =1|X=x \right) \) and \(H_{c,1}(t)=P\left( T < t|\delta = \right. \) \(\left. 1 \right) \).

We will consider the nonparametric approach in mixture cure models by López-Cheda et al. (2017). It departs from the generalized Kaplan–Meier estimator by Beran (1981) to estimate the conditional survival function:

$$\begin{aligned} \hat{S}_{h}(t|x) = \prod _{T_{(i)} \le t} \left( 1 - \frac{\delta _{(i)} B_{h(i)} (x) }{\sum _{r=i}^n B_{h(r)}(x)}\right) , \end{aligned}$$
(2)

where \(B_{h(i)}(x) = K_h (x - X_{(i)}) / \sum _{j=1}^n K_h(x - X_{(j)}) \) are the Nadaraya–Watson (NW) weights and \(K_h(\cdot )=\frac{1}{h}K \left( \frac{ \cdot }{h} \right) \) the rescaled kernel with bandwidth \(h>0\). We denote by \(\hat{F}_h(t|x) = 1 - \hat{S}_h(t|x)\) the Beran estimator of F(t|x).

Departing from the Beran estimator, Xu and Peng (2014) introduced a kernel type estimator for the incidence function:

$$\begin{aligned} 1-\hat{p}_h(x)= \prod _{i=1}^{n} \left( 1 - \frac{\delta _{(i)} B_{h(i)}(x)}{\sum _{r=i}^n B_{h(r)}(x) } \right) = \hat{S}_h(T_{\max }^1|x), \end{aligned}$$
(3)

where \(T_{\max }^1=\max \nolimits _{i:\delta _i=1}(T_i)\) is the largest uncensored failure time. These authors proved the consistency and asymptotic normality of \(\hat{p}_h(x)\). López-Cheda et al. (2017) obtained an i.i.d. representation and the asymptotically optimal bandwidth, proposed a bootstrap bandwidth selector for \(\hat{p}_h(x)\), and introduced the following nonparametric latency estimator:

$$\begin{aligned} \hat{S}_{0,h} (t|x) = \frac{\hat{S}_{h}(t|x) - (1 - \hat{p}_h(x))}{\hat{p}_h(x)}, \end{aligned}$$
(4)

with \(\hat{S}_{h}(t|x)\), in (2), the Beran estimator of S(t|x) and \(1 - \hat{p}_h(x)\) the estimator by Xu and Peng (2014) in (3). They also addressed identifiability of model (1). Note that the optimal bandwidth for \(\hat{S}_{0,h}(t|x)\) is not necessarily the optimal bandwidth for \(\hat{p}_h(x)\). A more general function than (4) using different bandwidths for the incidence and for the improper survival function:

$$\begin{aligned} \hat{S}_{0,h_1,h_2} (t|x) = \frac{\hat{S}_{h_1}(t|x) - (1 - \hat{p}_{h_2}(x))}{\hat{p}_{h_2}(x)} \end{aligned}$$
(5)

could be considered as an estimator of the latency. However, it does not yield necessarily a proper survival function since its limit as t tends to infinity needs not to be zero. In fact, it is not even guaranteed to be non-negative. On the other hand, as it will be shown in Sect. 4.1, the optimal values for \(h_1\) and \(h_2\) in (5) are nearly equal. As a consequence, in this work only the asymptotic properties of the nonparametric latency estimator in (4), that depends on one unique bandwidth h, will be studied. Similar theoretical results, not included in this paper, are easily extended to the estimator in (5).

Let us define: \(\tau _{S_{0}}(x) = \sup \left\{ t:S_{0}(t|x)> 0\right\} \). Since S(t|x) is an improper survival function and \(1-H(t|x)=S(t|x) \bar{G}(t|x)\), then \(\tau _{H}(x)=\tau _{G}(x)\), where \(\tau _{H}(x) = \sup \left\{ t:H(t|x) < 1\right\} \) and \(\tau _{G}(x) = \sup \left\{ t:G(t|x) < 1\right\} \).

Let \(\tau _{0}=\sup _{x \in D} \tau _{S_{0}}(x) \), where D is the support of X. As in Xu and Peng (2014), we consider

$$\begin{aligned} \tau _{0}<\tau _{G}\left( x\right) , \quad \forall x \in D. \end{aligned}$$
(6)

The rationale of this condition has been discussed by López-Cheda et al. (2017), Xu and Peng (2014) and Maller and Zhou (1992, 1996). Note that if the censoring variable takes values always below a time \(\tau _G < \tau _0\), the largest uncensored observation may occur at a time not larger than \(\tau _G\) and therefore always before \(\tau _0\). Laska and Meisner (1992) stated that, for a large sample size, the nonparametric incidence estimator in (3) is an estimator of \(1-p(x)+p(x)S_0(\tau _G)\), which is strictly larger than \(1-p(x)\). Specifically, as it is mentioned in Maller and Zhou (1992), consistent estimates of the incidence are possible if and only if there is zero probability of a susceptible individual surviving longer than the largest possible censoring time. That is, condition (6) guarantees that censored subjects beyond the largest observable failure time are cured, since the support of the censoring variable, C, is not contained in the support of Y, the time to occurrence of the event. Therefore, the nonparametric estimator does not overestimate the true cure rate. A nonparametric test for this condition on the censoring mechanism was proposed by Maller and Zhou (1992) in an unconditional setting, and by López-Cheda et al. (2017) with covariates.

2.2 Theoretical results

The following assumptions are needed to prove the asymptotic results in this section.

  1. (A1)

    X, Y and C are absolutely continuous random variables.

  2. (A2)

    Condition (6) holds.

  3. (A3)
    1. (a)

      Let \(I=[x_{1},x_{2}]\) be an interval contained in the support of m, and \(I_{\delta }=[x_{1}-\delta ,x_{2}+\delta ]\) for some \(\delta >0\) such that \(0<\gamma =\inf [m\left( x\right) :x\in I_{\delta }]<\sup [m\left( x\right) :x\in I_{\delta }]=\varGamma <\infty \) and \(0<\delta \varGamma <1\). Then for all \(x\in I_{\delta }\) the random variables Y and C are conditionally independent given \(X=x\).

    2. (b)

      There exist \(a,b\in \mathbb {R}\), with \(a<b\) satisfying \(1 - H(t|x) \ge \theta > 0\) for \((t,x) \in [a,b] \times I_{\delta }\).

  4. (A4)

    The first derivative of the function m(x) exists and is continuous in \(x\in I_{\delta }\) and the first derivatives with respect to x of the functions H(t|x) and \(H^{1}(t|x)\) exist and are continuous and bounded in \(\left( t,x\right) \in [0,\infty )\times I_{\delta }\).

  5. (A5)

    The second derivative of the function m(x) exists and is continuous in \(x\in I_{\delta }\) and the second derivatives with respect to x of the functions H(t|x) and \(H^{1}(t|x) \) exist and are continuous and bounded in \(\left( t,x\right) \in [0,\infty )\times I_{\delta }\).

  6. (A6)

    The first derivatives with respect to t of the functions G(t|x), H(t|x), \(H^{1}(t|x)\) and \(S_0(t|x)\) exist and are continuous in \(\left( t,x\right) \in [a,b] \times D\).

  7. (A7)

    The second derivatives with respect to t of the functions H(t|x) and \(H^{1}(t|x) \) exist and are continuous in \(\left( t,x\right) \in [a,b]\times D\).

  8. (A8)

    The second partial derivatives with respect to t and x of the functions H(t|x) and \(H^1(t|x)\) exist and are continuous and bounded for \((t,x)\in [0,\infty ) \times D\).

  9. (A9)

    The first and second derivatives of the distribution and subdistribution functions H(t) and \(H_{c,1}(t)\) are bounded away from zero in [ab]. Moreover, \(H^{\prime }_{c,1}(\tau _0) > 0\).

  10. (A10)

    The functions H(t|x), \(S_{0}(t|x)\) and G(t|x) have bounded second-order derivatives with respect to x for any given value of t.

  11. (A11)

    The kernel function, K, is a symmetric density vanishing outside \(\left( -1,1\right) \) and the total variation of K is less than some \(\lambda<\) \(\infty \).

  12. (A12)

    The density function of T, \(f_T\), is bounded away from 0 in \([0,\infty )\).

The proof of Theorem 1 is based on Theorem 2 in Iglesias-Pérez and González-Manteiga (1999), where the assumptions (A1), (A3)–(A9) and (A11)–(A12) are required. Assumptions (A2) and (A10) ensure that Theorem 2 in Iglesias-Pérez and González-Manteiga (1999), stated for a fixed t such that \(1-H(t|x) \ge \theta > 0 \in [a,b]\times I_\delta \), can be applied to the random value \(t=T_{\max }^1\). Assumptions (A4)–(A8) and (A10) are regularity conditions for the functions involved in the proofs and in the asymptotic results.

In Theorem 1 we obtain an i.i.d. representation for \(\hat{S}_{0,h}(t|x)\) in (4).

Theorem 1

Suppose that conditions (A1)–(A12) hold, together with \(\frac{\ln n}{nh} \rightarrow 0\) and \(h=O\left( \left( \frac{\ln n}{n}\right) ^{{1}/{5}} \right) \), then we have an i.i.d. representation for the nonparametric latency estimator for any \(t \in [a,b]\):

$$\begin{aligned} \hat{S}_{0,h}(t|x)-S_{0}(t|x)=\sum _{i=1}^{n}\eta _{h}(T_{i},\delta _{i},X_{i},t,x) + O \left( \left( \frac{\ln n}{nh} \right) ^{3/4}\right) \,\mathrm{a.s.}, \end{aligned}$$

where

$$\begin{aligned} \eta _{h}(T_{i},\delta _{i},X_{i},t,x)= & {} -\frac{S(t|x)}{p(x)} \tilde{B}_{h, i}(x)\xi (T_{i},\delta _{i},t,x) \nonumber \\&- \frac{(1-p(x))(1-S(t|x))}{p^{2}(x)}\tilde{B}_{h,i}(x)\xi (T_{i},\delta _{i},\infty ,x),\nonumber \\ \xi \left( T_{i},\delta _{i},t,x\right)= & {} \frac{1\{T_{i}\le t,\delta _{i}=1\} }{1-H(T_{i}|x)} -\int _{0}^{t}\frac{1\{u\le T_{i}\}dH^{1}(u|x)}{ \left( 1-H(u|x)\right) ^{2}} \end{aligned}$$
(7)

and

$$\begin{aligned} \tilde{B}_{h,i}(x)=\frac{\frac{1}{nh} K\left( \frac{x-X_{i}}{h}\right) }{m(x) }. \end{aligned}$$

From Theorem 1, important properties of the nonparametric latency estimator can be obtained: the first one is the asymptotic expression of the mean squared error (MSE) given in Theorem 2, and the second one is the asymptotic normality, shown in Theorem 3. But first some notation will be introduced. Let us define

$$\begin{aligned} \varPhi (y,t,x)= & {} E\left[ \xi (T,\delta ,t,x)|X=y\right] , \end{aligned}$$
(8)
$$\begin{aligned} \varPhi _{1}(y,t,x)= & {} E\left[ \xi ^{2}(T,\delta ,t,x)|X=y\right] \end{aligned}$$
(9)

and

$$\begin{aligned} \varPhi _{2}(y,t,x)=E\left[ \xi (T,\delta ,t,x)\xi (T,\delta ,\infty ,x)|X=y \right] , \end{aligned}$$

with \(\xi \) in (7). The asymptotic bias and variance of the latency estimator will be expressed in terms of the following functions:

$$\begin{aligned} B_{1}\left( t,x\right)= & {} \frac{S(t|x)}{p(x)m(x)}\left( \varPhi ^{\prime \prime }\left( x,t,x\right) m(x)+2\varPhi ^{\prime }\left( x,t,x\right) m^{\prime }(x)\right) , \end{aligned}$$
(10)
$$\begin{aligned} B_{2}\left( t,x\right)= & {} \frac{(1-p(x))(1-S(t|x))}{p^{2}(x)m(x)} \nonumber \\&\times \left( \varPhi ^{\prime \prime }\left( x,\infty ,x\right) m(x)+2\varPhi ^{\prime }\left( x,\infty ,x\right) m^{\prime }(x)\right) , \end{aligned}$$
(11)

where

$$\begin{aligned} \varPhi \left( y,t,x\right) =\int _{0}^{t}\frac{\mathrm{d}H^{1}\left( v|y\right) }{1-H(v|x)}-\int _{0}^{t}(1-H(v|y))\frac{\mathrm{d}H^{1}(v|x)}{\left( 1-H(v|x)\right) ^{2}}, \end{aligned}$$

and \(\varPhi ^{\prime }\) and \(\varPhi ^{\prime \prime }\) are the derivatives of \(\varPhi (y,t,x)\) with respect to y. Furthermore,

$$\begin{aligned} V_{1}\left( t,x\right)= & {} \left( \frac{S(t|x)}{p(x)}\right) ^{2}\frac{\varPhi _{1}(x,t,x)}{m(x)}, \end{aligned}$$
(12)
$$\begin{aligned} V_{2}\left( t,x\right)= & {} \left( \frac{(1-p(x))(1-S(t|x))}{p^{2}(x)}\right) ^{2}\frac{\varPhi _{1}(x,\infty ,x)}{m(x)}, \end{aligned}$$
(13)
$$\begin{aligned} V_{3}\left( t,x\right)= & {} \frac{(1-p(x))S(t|x)(1-S(t|x))}{p^{3}(x)m\left( x\right) }\varPhi _{2}(x,t,x), \end{aligned}$$
(14)

respectively, where

$$\begin{aligned} \varPhi _{1}(x,t,x)=\varPhi _{2}(x,t,x)=\int _{0}^{t}\frac{\mathrm{d}H^{1}\left( v|x\right) }{\left( 1-H(v|x)\right) ^{2}}. \end{aligned}$$

Note that, except for some constants, \(B_1(t,x)\) in (10) and \(B_2(t,x)\) in (11) are the dominant terms of the asymptotic bias of the estimators \(\hat{S}_h\) and \(1-\hat{p}_h\) in (2) and (3), respectively. Similarly, the terms \(V_1(t,x)\) in (12) and \(V_2(t,x)\) in (13) are the dominant terms of the corresponding asymptotic variances of \(\hat{S}_h\) and \(1-\hat{p}_h\). Finally, \(V_3(t,x)\) in (14) accounts for the covariance of both estimators.

Theorem 2

Under assumptions (A1)–(A10), if \(\frac{\ln n}{nh} \rightarrow 0\) and \(h=O\left( \left( \frac{\ln n}{n}\right) ^{{1}/{5}} \right) \), then the asymptotic mean squared error of the latency estimator is

$$\begin{aligned} \mathrm{AMSE}(\hat{S}_{0,h}(t|x))=\frac{h^{4}}{4}d_{K}B^{2}\left( t,x\right) +\frac{ c_{K}}{nh}V\left( t,x\right) +o(h^{4})+O\left( \frac{1}{n}\right) , \end{aligned}$$
(15)

where \(d_{K}=\int v^{2}K(v)\mathrm{d}v\), \(c_{K}=\int K^{2}(v)\mathrm{d}v\),

$$\begin{aligned} B\left( t,x\right)= & {} B_{1}\left( t,x\right) +B_{2}\left( t,x\right) , \end{aligned}$$
(16)
$$\begin{aligned} V\left( t,x\right)= & {} V_{1}\left( t,x\right) +V_{2}\left( t,x\right) +2V_{3}\left( t,x\right) , \end{aligned}$$
(17)

with \(t \in [a,b]\), \(B_{1}\), \(B_{2}\), \(V_{1}\), \(V_{2}\) and \(V_{3}\) in (10)–(14).

Theorem 3

Under assumptions (A1)–(A10), if \(h \rightarrow 0\) and \(\frac{(\ln n)^3}{nh} \rightarrow 0\), it follows that, for any \(t \in [a,b],\)

(a) If \(nh^{5} \rightarrow 0\), then

$$\begin{aligned} \sqrt{nh} \left( \hat{S}_{0,h}(t|x)-S_{0}(t|x)\right) \xrightarrow {d} N\left( 0,V\left( t,x\right) c_{K} \right) . \end{aligned}$$

(b) If \(n h^5 \rightarrow C^5 > 0\), then

$$\begin{aligned} \sqrt{nh}\left( \hat{S}_{0,h}(t|x)-S_{0}(t|x)\right) \xrightarrow {d}N\left( B\left( t,x\right) C^{5/2} d_{K} ,V\left( t,x\right) c_{K} \right) . \end{aligned}$$

3 Bandwidth selection

From Theorem 2, the asymptotic mean integrated squared error of the latency estimator is:

$$\begin{aligned} \mathrm{AMISE}(\hat{S}_{0,h}(\cdot |x))=\frac{1}{4}d_{K}^{2}h^{4}\int B^{2}\left( t,x\right) \mathrm{d}t+\frac{c_{K}}{nh}\int V\left( t,x\right) \mathrm{d}t+o(h^{4})+O\left( \frac{1}{n}\right) , \end{aligned}$$

where B(tx) and V(tx) are defined in (16) and (17). The bandwidth which minimizes the asymptotic mean integrated squared error is

$$\begin{aligned} h_\mathrm{AMISE}(x)=\left( \frac{c_{K}\int V(t,x)\mathrm{d}t}{d_{K}^{2}\int B^{2}(t,x)\mathrm{d}t} \right) ^{1/5}n^{-1/5}, \end{aligned}$$

which depends on plenty of unknown functions that are very hard to estimate. Consequently, we propose to select the bandwidth using the bootstrap method.

3.1 Bootstrap bandwidth selector

The bootstrap bandwidth selector is the minimizer of the bootstrap version of the mean integrated squared error (MISE), that can be approximated, using Monte Carlo, by:

$$\begin{aligned} \mathrm{MISE}_{x,g}^*(h) \simeq \frac{1}{B} \sum _{j=1}^B \int \left( \hat{S}_{0,h}^{*(j)}(t|x) - \hat{S}_{0,g}(t|x) \right) ^2 w(t) \mathrm{d}t, \end{aligned}$$
(18)

where w is an appropriate weight function, \(\hat{S}_{0,h}^{*(j)}(t|x)\) is the kernel estimator of \(S_0(t|x)\) in (4) using bandwidth h and based on the jth bootstrap resample, and \(\hat{S}_{0,g}(t|x)\) is the same estimator computed with the original sample and pilot bandwidth g.

We consider an unconditional censoring bootstrap resampling, assuming that \(G(t|x)=G(t),\;\forall x,t\):

  1. 1.

    For \(i= 1, 2, \ldots , n\), generate \(C_i^*\) from the product-limit estimator \(\hat{G}^{KM}\).

  2. 2.

    For \(i= 1, 2, \ldots , n\), fix the bootstrap covariates \(X_i^*=X_i\) and generate \(Y_i^*\) from \(\hat{S}_{0,g}(\cdot |X_i^*)\) with probability \( \hat{p}_g(X_i^*)\), and \(Y_i^* = \infty \) otherwise.

  3. 3.

    Finally, define \(T^*_i=\min \{Y^*_i, C^*_i \}\) and \(\delta ^*_i = 1\{Y^*_i \le C^*_i\}\) for \(i= 1, 2, \ldots , n\).

  4. 4.

    Repeat Steps 1–3 above B times to generate bootstrap resamples of the form \(\big \{ ( X_1^{(b)}, T_1^{*(b)}, \delta _1^{*(b)}), \dots , ( X_n^{(b)}, T_n^{*(b)}, \delta _n^{*(b)} ) \big \}\), \(b=1, \dots , B\).

  5. 5.

    For the bth bootstrap resample \((b = 1, 2 \ldots ,B)\), compute \(\hat{S}_{0,h}^{*(b)}(t|x)\) with bandwidth \(h_l \in \{h_1,\ldots ,h_L\}\) .

  6. 6.

    With the original sample and pilot bandwidth g, compute \(\hat{S}_{0,g}(t|x)\).

  7. 7.

    For each bandwidth \(h_l\) in \(\{h_1, \dots , h_L\}\), compute the Monte Carlo approximation of \(\mathrm{MISE}_{x,g}^*(h_l)\) as in (18).

  8. 8.

    Find \(h^*_x=\mathop {{{\mathrm{arg\,min}}}}\nolimits _{h_l \in \{h_1,\ldots ,h_L\}}\mathrm{MISE}^{*}_{x,g}(h_l)\).

4 Simulation study

Good practical behavior of the nonparametric latency estimator has been preliminarily reported by López-Cheda et al. (2017). The purpose of this simulation study is to assess the performance of the bootstrap bandwidth selector for the nonparametric latency estimator. We will work with the same two models considered by López-Cheda et al. (2017). For both models, the censoring times are generated according to an exponential distribution with mean 10 / 3 and the covariate X has a \(U(-20,20)\) distribution.

Model 1 The probability of not being cured is a logistic function and the latency is close to fulfill the proportional hazards model, truncated to guarantee condition (6):

$$\begin{aligned} p(x) = \frac{ \exp ( \beta _0 + \beta _1 x ) }{ 1 + \exp (\beta _0 + \beta _1 x) } \text { and } S_0(t|x) =\dfrac{ \exp (-\lambda (x) t) - \exp (-\lambda (x) \tau _0) }{1 - \exp (-\lambda (x) \tau _0)}1\{t \le \tau _0 \}, \end{aligned}$$

with \(\beta _0 = 0.476\) and \(\beta _1 = 0.358\), \(\tau _0 = 4.605\) and \(\lambda \left( x\right) =\exp \left( (x+20) /40 \right) \). A percentage of 54% of the patients are censored and 47% are cured.

Model 2 The probability of not being cured is

$$\begin{aligned} p(x) = \frac{\exp \left( \beta _0 + \beta _1 x + \beta _2 x^2 + \beta _3 x^3 \right) }{ 1 + \exp \left( \beta _0 + \beta _1 x + \beta _2 x^2 + \beta _3 x^3 \right) }, \end{aligned}$$

with \(\beta _0 = 0.0476\), \(\beta _1 = - 0.2558 \), \(\beta _2 = - 0.0027\) and \(\beta _3 = 0.0020\), and \(S_0 (t|x) = \frac{1}{2} \left( \exp (-\alpha (x) t^5 ) + \exp (-100 t^5) \right) \) with \(\alpha (x) = \frac{1}{5} \exp ((x+20)/40)\). In this case, the percentages of cure and censoring are slightly higher than for Model 1: around 62% of the individuals are censored and 53% are cured.

In order to approximate the bootstrap version of the \(MISE_x\) of the nonparametric latency estimator, \(m=1000\) trials and \(B=200\) bootstrap resamples of sizes \(n=50\), \(n=100\) and \(n=200\) were drawn and used the Epanechnikov kernel. We considered a grid of 35 bandwidths (from 5 to 100) equispaced on a logarithmic scale. Note that, although the covariate \(X\in U(-20,20)\), we only work with \(x \in [-10,20]\). The reason is that \(p(x)\simeq 0\) for \(-20\le x\le -10\). This implies that almost all the subjects are cured, and therefore the estimation of the survival function of the uncured population can not be obtained. Similarly as for the nonparametric incidence estimator (see López-Cheda et al. 2017), the effect of the choice of the pilot bandwidth, g, on the bootstrap bandwidth, \(h^*_x\), is very weak. In this simulation study, we considered the same naive pilot bandwidth selector, \(g = C(X_{[n]} - X_{[1]}) \cdot n^{-1/9}\), as in López-Cheda et al. (2017), with \(C=0.75\), and where \(X_{[n]}\) (\(X_{[1]}\)) is the maximum (minimum) value of the observed values of the covariate X.

In Fig. 1 the density of the bootstrap bandwidths, \(h^*_x\), is compared with the optimal \(h_{\mathrm{MISE},x}\) bandwidth. The MISE values obtained considering these bandwidths are also shown. It is noteworthy that \(\mathrm{MISE}(\hat{S}_{0,h}(\cdot |x))\), and consequently \(\mathrm{MISE}^*_{x,g}(h)\), is almost constant in a very wide interval around its minimizer. This feature implies that very different bandwidths could yield very similar good estimates in terms of MISE. We can appreciate how the bootstrap bandwidth might be larger (smaller) than \(h_\mathrm{MISE}\) in Model 1 (Model 2), for most of the covariate values, reflected in a very little difference in terms of MISE between the estimates with the optimal and the bootstrap bandwidths.

Fig. 1
figure 1

MISE contour plot depending on the bandwidth and on the covariate, for Model 1 (left) and Model 2 (right), with sample sizes \(n=50\) (top), \(n=100\) (center) and \(n=200\) (bottom). The density of the bootstrap bandwidth is displayed in grayscale and the \(h_\mathrm{MISE}\) bandwidth, for each covariate value, is represented with crosses

4.1 Results when using two bandwidths to estimate \(S_0\)

We will present some results for the latency estimator in (5), that is, if two different bandwidths are considered: \(h_1\) for the incidence and \(h_2\) for the improper survival function S. Note that, for the sake of brevity, we only work with Model 1 and sample size \(n=100\), considering \(m=1000\) samples. Figure 2 (left) shows the MISE, approximated by Monte Carlo, of the nonparametric latency estimator \(\hat{S}_{0,h_1,h_2}(t|x)\) in (5) as a function of \((h_1,h_2)\) for the covariate value \(x=5\) (the MISE for other values of x is similar, not shown). We can see that the minimum MISE (dark-grey color) is reached around the diagonal, that is, when \(h_1=h_2\). Figure 2 (right) provides the optimal bandwidths \((h_1,h_2)\) as a function of x. Note that for most of the covariate values both optimal bandwidths are very similar, being even equal for the values of x larger than 5.

Therefore, as pointed out in Sect. 2, little efficiency is lost when considering one only bandwidth \(h_1 = h_2\) to estimate \(S_0\), while this guarantees that the resulting estimator is a proper survival function.

Fig. 2
figure 2

\(\mathrm{MISE}(h_1, h_2)\) of \(\hat{S}_{0,h_1,h_2}\) for \(x=5\) and the grid of bandwidths (equispaced on a logarithmic scale) where \(h_1=h_2\) are represented with black dots (left), and optimal \((h_1,h_2)\) bandwidths, in terms of MISE (right)

5 Application to colorectal cancer data

The proposed method was applied to the dataset used in López-Cheda et al. (2017), composed of 414 colorectal cancer patients from CHUAC (Complejo Hospitalario Universitario de A Coruña), Spain. The variable of interest is the follow-up time, in months, since the diagnosis until death. Two covariates are considered: the stage (from 1 to 4) and the age (from 23 to 103). The percentage of censoring varies from 30% to almost 71%, depending on the stage. In Table 1 we show a summary of the data set.

Table 1 Colorectal cancer patients from CHUAC
Fig. 3
figure 3

Latency estimation for patients in Stages 1–2 (left) and 3–4 (right) with ages 35 (solid line), 50 (dashed line) and 80 (dotted line), computed using the nonparametric estimator, \(\hat{S}_{0,h}(t|x)\), with the bootstrap bandwidth, \(h_x^*\)

Due to the small sample sizes in each stage, the results are presented in two groups: Stages 1–2 and Stages 3–4. Note that \(B=200\) bootstrap resamples are drawn. Similar to the simulation study in Sect. 4, we considered a grid of 35 bandwidths from \(h_1=5\) to \(h_{35}=100\) equispaced on a logarithmic scale.

The latency estimation computed with the bootstrap bandwidth, \(\hat{S}_{0,h^*}(t|x)\), for different ages (35, 50 and 80) is shown in Fig. 3. We can observe that for Stages 1–2 the covariate age does not seem to be determining for the latency estimation, since all the estimated latency functions are very similar for the whole grid of ages. On the contrary, for Stages 3–4 the latency estimation varies considerably depending on the age. For example, the probability that the follow-up time since the diagnosis until death is larger than 4.5 years (54 months) is around 0.2 for patients with ages 35 and 50, whereas for 80- year-old patients, that probability is larger than 0.4.