1 Introduction

Medical research frequently yields multiple event times that correspond to landmarks in disease progression. The events may be non-terminal events, such as disease recurrence, defined as progressive disease, or a terminal event, such as death. Semi-competing risks data (Fine et al. 2001; Peng et al. 2007) are encountered when the terminal event censors the non-terminal event, but the non-terminal event does not prevent subsequent observation of the terminal event. An example of this can be found in a retrospective study of data collected from women diagnosed with stage I or II breast cancer, and maintained as the early stage breast cancer repository (ESBCR) at the University of Texas MD Anderson Cancer Center. For this study, breast cancer relapse is the non-terminal event and death is the terminal event. It is essential to evaluate the survival distribution for women who experience breast cancer relapse following the initial treatment regimen in order to predict the risk of relapse for ongoing patients and make further treatment decisions. However, this endeavor is a challenge because of the informative censoring issue in the semi-competing risks data. Specifically, the occurrence of death is expected to be positively correlated with cancer relapse and cannot be considered as independent censoring. Consequently, the Kaplan–Meier estimator (Kaplan and Meier 1958) will tend to overestimate the true survival probabilities associated with cancer relapse in the presence of death.

One approach to semi-competing risks data is to use crude quantities, such as the cause-specific hazard and cumulative incidence function due to the identifiability issue. The naive method ignores the information available after the non-terminal event by only considering the time and type of the first event. Xu et al. (2010) considers the crude quantity (cause-specific hazard) and developed an extension of the illness-death model with a shared frailty, which can take into account the information after the non-terminal event. Another approach focuses on net quantities, such as the marginal distribution of the non-terminal event time, which is not identifiable without further assumption on the dependence structure between the non-terminal and terminal event times (similar to how competing risks are handled). This approach can fully utilize the observed data and provide useful information on the non-terminal event, but is subject to theoretical and computational challenges due to informative censoring (Fine et al. 2001; Peng and Fine 2007; Hsieh et al. 2008; Ding et al. 2009; Hsieh and Huang 2012; Chen 2012). This paper investigates the semi-competing risks data based on the net quantities.

To handle informative censoring, it is common to assume that the bivariate distribution of non-terminal and terminal event times is a known copula. For example, a Gamma frailty copula (Clayton 1978), which assumes a constant association, has been used for modeling dependence structure (Fine et al. 2001; Jiang et al. 2005). The estimating procedure proposed by Fine et al. (2001) requires a separate estimation for the association parameter using a concordance estimating function at the first stage. These inferences were extended to some other parametric copulas by Wang (2003). Peng and Fine (2007) and Hsieh and Huang (2012) developed copula-based models to account for the time-varying effects of a treatment on the marginal distribution of a non-terminal event. Chen (2012) proposed a nonparametric maximum likelihood estimation approach for the semiparametric transformational models for marginal regressions, where dependence is modeled under an assumed copula. However, in reality, a specific copula model specification does not fit most applications, so its use will probably lead to incorrect statistical inference. Thus, an immediate question is how to choose an appropriate copula model for the dependence structure based on observed bivariate survival data. Chen and Bandeen-Roche (2005) provided a diagnostic method to infer the choice of frailty distribution (copula) family for bivariate survival data, which can be implemented by exploiting the relationship between a local version of Kendall’s tau and the frailty distribution. Specifically, a constant relationship may suggest the Gamma frailty copula, and an increasing one may prefer a positive stable copula (Gumbel 1960).

Instead of restricting to the Gamma frailty copula for modeling dependence between breast cancer relapse and death in the ESBCR data and estimating its association parameter separately at an earlier stage, we first use the exploratory diagnostic method (Chen and Bandeen-Roche 2005) to check the association pattern between cancer relapse and death. Then, we develop a semiparametric method for simultaneous inference on the marginal distribution of breast cancer relapse and the association parameter, based on a chosen copula model.

The remainder of the article is organized as follows. In Sect. 2, we introduce the notation and copula model for semi-competing risks data. In Sect. 3, a suitable copula model is selected by exploring the association pattern in the ESBCR data. Section 4 presents the simultaneous inference on the association parameter and the marginal distribution of the non-terminal event time. In Sect. 5, simulation studies are conducted to assess the performance of the proposed method. An application to the ESBCR data is presented as an illustration in Sect. 6. Some concluding remarks are given in Sect. 7. The proof of the asymptotic properties is provided in the Appendix.

2 Notation and model

Let \(T_{1}\) be the time from the initial event (e.g., initial treatment) to a non-terminal event (e.g., disease relapse), and \(T_{2}\) be the time from the initial event to the terminal event (e.g., death). The random variable \(T_{2}\) may dependently censor \(T_{1}\), but not vice versa. Let C be the independent censoring time of both \(T_{1}\) and \(T_{2}\), such as an administrative loss to follow-up. Define \(Z=\min (T_{1},T_{2})\), \(X=\min (Z,C)\), \(Y=\min (T_{2},C)\), \(\delta _{1}=I(T_{1}\le Y)\), \(\delta _{2}=I(T_{2}\le C)\), and \(\delta _{0}=I(Z\le C)=\delta _{1}+\delta _{2}-\delta _{1}\delta _{2}\). The observed data can be denoted as \(\{(X_{i},\delta _{1i}, Y_{i},\delta _{2i}, Z_i, \delta _{0i}), i=1,\ldots ,n\}\).

With semi-competing risks data, the dependence structure is usually of biological interest and is required to ensure the validity for the estimation of the marginal survival function of \(T_{1}\). As a commonly used model, a copula \(C_{\alpha }\) is a bivariate distribution function with density \(c_{\alpha }\) on \([0,1]^{2}\) for \(\alpha \in \mathfrak {R}^{1}\)(Oakes 1982). Under a copula, the bivariate survival function and density function of \((T_{1},T_{2})\) can be written as

$$\begin{aligned} S(t_{1},t_{2})= & {} \Pr (T_{1}>t_1, T_{2}>t_2)=C_{\alpha } \left\{ S_1(t_{1}),S_2(t_{2})\right\} , \quad 0\le t_{1}\le t_{2},\\ f(t_{1},t_{2})= & {} c_{\alpha }\left\{ S_1(t_{1}),S_2(t_{2})\right\} f_1(t_{1})f_2(t_{2}), \quad 0\le t_{1}\le t_{2}, \end{aligned}$$

where \((S_1,S_2)\) and \((f_1, f_2)\) are the corresponding marginal survival and density functions, respectively. One reason for the popularity of copulas is that the association parameter \(\alpha \) is related to a local dependence measure, the cross ratio (Oakes 1989), which is defined as

$$\begin{aligned} \theta (t_{1},t_{2})=\frac{\lambda (t_{1}|T_{2}=t_{2} )}{\lambda (t_{1}|T_{2}>t_{2})}, \end{aligned}$$
(1)

where \(\lambda (t_{1}|A)\) is a conditional hazard function,

$$\begin{aligned} \lambda (t_{1}|A)=lim_{\epsilon \rightarrow 0}\frac{\Pr (T_{1}<t_{1}+\epsilon |T_{1}\ge t_{1},T_{2}\in A)}{\epsilon }. \end{aligned}$$

The choice of the copula \(C_{\alpha }\) determines the expression of \(\theta (t_{1},t_{2})\). Particularly, under a popular class of copulas, namely the Archimedean copula, the bivariate survival function can be expressed as

$$\begin{aligned} S(t_{1},t_{2})=\phi _{\alpha }\big [\phi _{\alpha }^{-1} \{S_1(t_{1})\}+\phi _{\alpha }^{-1}\{S_2(t_{2})\}\big ], \end{aligned}$$

where \(\phi _{\alpha }\) is a non-increasing convex function defined on (0, 1] with \(\phi _{\alpha }(0)=1\). Examples of Archimedean copulas include the Clayton (Gamma frailty) copula,

$$\begin{aligned} S(t_{1},t_{2})= \left\{ S_1(t_{1})^{1-\alpha } +S_2(t_{2})^{1-\alpha }-1 \right\} ^{\frac{1}{1-\alpha }}, \quad \alpha >1, \end{aligned}$$

the positive stable copula,

$$\begin{aligned} S(t_{1},t_{2})=\exp \left( -\left[ \left\{ -\log S_1(t_{1})\right\} ^{\frac{1}{\alpha }} + \left\{ -\log S_2(t_{2})\right\} ^{\frac{1}{\alpha }} \right] ^{\alpha }\right) , \quad 0<\alpha <1, \end{aligned}$$

and the Frank copula (Frank 1979) as

$$\begin{aligned} S(t_{1},t_{2})=\log _{\alpha } \left\{ 1+\frac{(\alpha ^{S_1(t_{1})}-1)(\alpha ^{S_2(t_{2})}-1)}{\alpha -1}\right\} . \end{aligned}$$

The cross ratio under the Clayton copula is a constant \(\theta (t_{1},t_{2}) = \alpha \), while under the positive stable copula, it is a function of \(S(t_{1},t_{2})\) and decreases with time \(\theta (t_{1},t_{2})=1-(1-\alpha )\{\alpha \log S(t_{1},t_{2})\}^{-1}\). Such a connection can be used to choose an appropriate copula model in applications.

2.1 A motivating study

The ESBCR was created for a retrospective cohort study of (n = 2470) women diagnosed with American Joint Committee on Cancer pathologic stage I or II breast cancer who were treated at The University of Texas MD Anderson Cancer Center between January 1, 1985 and December 31, 2000 (Brewster et al. 2007). The inclusion criteria for the ESBCR were female patients with stage I or II breast cancer, who were residents of the state of Texas, underwent surgical treatment for breast cancer at MD Anderson, had tissue samples available, and for whom a complete assessment of the medical records was available. Among the 2470 ESBCR patients, 785 had died before the end of the study, 1684 were alive, and one record was missing the vital status. There were 2425 patients with complete information of disease relapse status and dates, vital status, and death/last contact date. Among that group, 562 had data regarding disease relapse, which included local lymph node recurrence, local breast cancer recurrence, metastasis to the contralateral breast or chest wall, metastasis to other body sites, and a second primary breast cancer.

3 Model selection by association pattern in ESBCR

Many existing estimation methods for semi-competing risks data require imposition of the Gamma frailty model assumption on \(T_{1}\) and \(T_{2}\) (Fine et al. 2001; Jiang et al. 2005), which implies a constant association and may not always be biologically reasonable in practical applications. To avoid possible misspecification on the copula and the induction of misleading conclusions, we first aim to select an appropriate copula model based on the observed bivariate survival data. Specifically, we apply an easily implemented approach (Chen and Bandeen-Roche 2005) for diagnosing the association pattern and choose the proper copula model for breast cancer relapse and death in the ESBCR data. This diagnostic approach constructs an exploratory display that plots the cross ratio and visually conveys the fit of a given Archimedean copula model.

We base our approach on the fact that the cross ratio defined in Eq. (1) is a ratio of the probabilities that one pair of bivariate failure times is concordant or discordant, conditional on the component-wise minimum failure times (Oakes 1989). For any pair of bivariate times, \(T_{i}=(T_{1i},T_{2i})\) and \(T_{j}=(T_{1j},T_{2j})\), let \((T_{1ij},T_{2ij})\) be the corresponding component-wise minimum of \((T_i, T_j)\), defined by \(T_{1ij}=\min (T_{1i},T_{1j})\) and \(T_{2ij}=\min (T_{2i},T_{2j})\). Define a pair of bivariate times \((T_i, T_j)\) as concordant if \(\left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) >0\), and discordant if \(\left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) <0\). We have

$$\begin{aligned} \theta (t_{1ij},t_{2ij})=\frac{\Pr \left\{ \left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) >0|\left( T_{1ij},T_{2ij}\right) =\left( t_{1ij},t_{2ij}\right) \right\} }{\Pr \left\{ \left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) <0| \left( T_{1ij},T_{2ij}\right) =\left( t_{1ij},t_{2ij}\right) \right\} }, \end{aligned}$$

which provides an estimator of the cross ratio \(\theta \) by using the counts of concordances and discordances. Chen and Bandeen-Roche (2005) suggested using a plot of \(\hat{\theta }(t_{1ij},t_{2ij})\) against \(\hat{S}(t_{1ij},t_{2ij})\) to approximate the functional relationship between the cross ratio and the bivariate survival function. Specifically, a horizontal line may support a Gamma frailty model selection. If \(\hat{\theta }(t_{1ij},t_{2ij})\) increases as \(\hat{S}(t_{1ij},t_{2ij})\) increases, a positive stable model may be appropriate for the dependence structure.

In the ESBCR data, the time from the initial treatment for breast cancer to disease relapse is \(T_{1}\), and the time from the initial treatment to death is \(T_{2}\). To avoid unnecessary computational complexity, we used a subgroup of 350 patients for whom the observed times are available for both cancer relapse and subsequent death in order to count the concordances and discordances of all possible pairs. Based on this subgroup, we estimated the cross ratio within 10 bins and the corresponding bivariate survival function as well. As shown in Fig. 1, the estimated cross ratio increases with an increase in the estimated bivariate survival function, which suggests that a positive stable copula is appropriate for the underlying dependent structure between breast cancer relapse and death.

Note that the exploratory diagnosis is based on a subset of the sample with both non-terminal and terminal events observed, thus its effectiveness needs to be further evaluated. We realize that it is not feasible to estimate the cross ratio without the information on the joint survival function. Nevertheless, the exploratory diagnosis provides a tool for approximating the pattern of the dependent structure. Even though there are potential biases from using the subsample, it can still help us to distinguish the positive stable copula from the Clayton copula, where the latter assumes a constant correlation and has been used in many semi-competing risks data studies. Additional simulation studies have been conducted to show that the diagnostic method can sufficiently detect the association pattern, and are included in Sect. 5.

Fig. 1
figure 1

Relationship between the estimated cross ratio and the estimated bivariate survival function using 10 bins for the ESBCR data

4 Inference on association parameter and marginal distribution

Without loss of generality, we develop simultaneous inference of the association parameter and marginal distribution of the non-terminal event under the positive stable copula, which is chosen by the visual diagnostic display of the ESBCR data. The inference procedure can be easily applied to other copulas. Under the positive stable copula, the bivariate survival function is

$$\begin{aligned} S(t_{1},t_{2}) = \mathrm{Pr}(T_{1}>t_{1}, T_{2}>t_{2}) = \exp \big ( - \big [ \{-\log S_1(t_{1})\}^{\frac{1}{\alpha }} + \{ -\log S_2(t_{2})\}^{\frac{1}{\alpha }} \big ] ^{\alpha } \big ). \end{aligned}$$
(2)

Given the observed data, the likelihood function is

$$\begin{aligned} L(\alpha , S_1, S_2)= & {} \prod _{i=1}^{n} f(X_{i},Y_{i})^{\delta _{1i}\delta _{2i}} \times \Big \{ \frac{\partial S(X_{i},Y_{i}) }{\partial X_{i}} \Big \}^{\delta _{1i} (1-\delta _{2i})} \\&\times \Big \{ \frac{\partial S(X_{i},Y_{i}) }{\partial Y_{i}} \Big \} ^{(1-\delta _{1i})\delta _{2i}} \times S(X_{i},Y_{i})^{(1-\delta _{1i})(1-\delta _{2i})}. \end{aligned}$$

The corresponding log-likelihood function can be expressed as

$$\begin{aligned} l(\alpha , S_1, S_2)= & {} \sum _{i=1}^{n} -\delta _{1i}\log S_1(X_{i}) - \delta _{2i}\log S_2(Y_{i}) +(\alpha ^{-1}-1)\big [\delta _{1i}\log \{-\log S_1(X_{i})\}\nonumber \\&+\,\delta _{2i}\log \{-\log S_2(Y_{i})\}\big ] -\big [\{-\log S_1(X_{i})\}^{\frac{1}{\alpha }}+\{-\log S_2(Y_{i})\}^{\frac{1}{\alpha }}\big ]^{\alpha }\nonumber \\&+\,\{ \alpha (\delta _{1i}+\delta _{2i}-\delta _{1i}\delta _{2i}) -\delta _{1i}-\delta _{2i}\} \log \big [\{-\log S_1(X_{i})\}^{\frac{1}{\alpha }}\nonumber \\&\quad +\{-\log S_2(Y_{i})\}^{\frac{1}{\alpha }}\big ] \nonumber \\&+\,\delta _{1i}\delta _{2i}\log \Big (\big [\{-\log S_1(X_{i})\}^{\frac{1}{\alpha }} +\{-\log S_2(Y_{i})\}^{\frac{1}{\alpha }}\big ]^{\alpha }-1+\alpha ^{-1} \Big ). \end{aligned}$$
(3)

Maximizing a pseudo-likelihood by replacing \(S_1\) and \(S_2\) in (3) with their consistent estimators provides a consistent estimator of \(\alpha \) (Shih and Louis 1995). To this end, one may employ the Kaplan–Meier estimators of \(S_1\) and \(S_2\). However, with semi-competing risks data, the Kaplan–Meier estimator of \(S_1\) does not generally converge to \(S_1\) as \(n\rightarrow \infty \), although \(S_2\) can be consistently estimated by the Kaplan–Meier estimator. In fact, a consistent estimator of \(S_1\) may not exist without an estimator of the association parameter \(\alpha \). Hereafter, we formulate an alternative pseudo-likelihood function of \(\alpha \) using the unique feature of semi-competing risks data.

Under the copula model, the marginal survival function of \(Z=\min (T_1, T_2)\) is

$$\begin{aligned} S_{Z}(t) = \mathrm{Pr}(T_{1}>t, T_{2}>t)= \exp \Big ( - \big [ \{ -\log S_1(t)\}^{\frac{1}{\alpha }} + \{ -\log S_2(t)\}^{\frac{1}{\alpha }} \big ] ^{\alpha }\Big ). \end{aligned}$$
(4)

Using Eq. (4), the marginal survival function of \(T_1\) can be expressed as

$$\begin{aligned} S_1(t) = g\{S_2(t), S_{Z}(t), \alpha \}= \exp \Big ( - \big [ \{ -\log S_{Z}(t)\}^{\frac{1}{\alpha }} - \{ -\log S_2(t)\}^{\frac{1}{\alpha }} \big ] ^{\alpha }\Big ), \end{aligned}$$
(5)

where \(g(u,v, \alpha )= \exp \big [ - \big \{ (-\log v)^{\frac{1}{\alpha }} - (-\log u)^{\frac{1}{\alpha }} \big \} ^{\alpha }\big ]\). With this, the log-likelihood function of \(\alpha \) becomes

$$\begin{aligned} l^g(\alpha , S_2, S_{Z})=l\{\alpha , g(S_2, S_{Z}, \alpha ), S_2\}. \end{aligned}$$
(6)

Note that \(S_2\) and \(S_{Z}\) can be consistently estimated by the Kaplan–Meier estimators \(\hat{S}_2\) and \(\hat{S}_{Z}\), respectively, because \(T_2\) and Z are subject to independent censoring only by C. Let \(U_{\alpha }(\alpha , S_2, S_{Z})\) be the score function of \(\alpha \), which is the derivative of (6) with respect to \(\alpha \). Then, estimator \(\hat{\alpha }\) is the solution to the estimating equation

$$\begin{aligned} U_{\alpha }(\alpha , \hat{S}_2, \hat{S}_{Z})=\frac{\partial l^g(\alpha , \hat{S}_2, \hat{S}_{Z})}{\partial \alpha }=\sum _i \frac{\partial l_0^g\{\alpha , \hat{S}_2(Y_{i}), \hat{S}_{Z}(Z_i)\}}{\partial \alpha }=0. \end{aligned}$$
(7)

We prove that \(\hat{\alpha }\) is consistent and asymptotically normally distributed under the following regularity conditions, C1 and C2.

  1. C1

    The standard regularity conditions for the maximum likelihood estimator hold.

  2. C2

    The functions \(W_{\alpha }\{\alpha , S_2(t_2), S_Z(z)\}\), \(V_{\alpha }\{\alpha , S_2(t_2), S_Z(z)\}\), \(V_{\alpha , 1}\{\alpha , S_2(t_2),S_Z(z)\}\), and \(V_{\alpha , 2}\{\alpha , S_2(t_2), S_Z(z)\}\) are continuous and bounded for \((z,t_2)\in {\mathcal {A}}=[0,t_{02}]\,\times \,[0,z_0]\), where

    $$\begin{aligned} W_{\alpha }(\alpha , u,v)= & {} \frac{\partial l_0^g(\alpha ,u, v)}{\partial \alpha },\quad V_{\alpha }(\alpha , u,v)=\frac{\partial ^2 l_0^g(\alpha ,u, v)}{\partial \alpha ^2},\\ V_{\alpha , 1}(\alpha , u,v)= & {} \frac{\partial ^2 l_0^g(\alpha , u, v)}{\partial \alpha \partial u},\quad V_{\alpha , 2}(\alpha , u,v)=\frac{\partial ^2 l_0^g(\alpha ,u, v)}{\partial \alpha \partial v}, \end{aligned}$$

    \(t_{02}=\sup \{t: \mathrm{Pr}(T_2>t, C>t)>0\}\), \(z_{0}=\sup \{t: \mathrm{Pr}(Z>t, C>t)>0\}\), and we write (uv) for \((S_2, S_Z)\).

Theorem 1 summarizes these asymptotic properties.

Theorem 1

The estimator \(\hat{\alpha }\) is a consistent estimator of the true value \(\alpha _0\), and as \(n\rightarrow \infty \), \(n^{1/2}(\hat{\alpha }-\alpha _0)\) converges weakly to a normal distribution with a mean of zero and variance \(\rho ^2 = (\rho _1^2+\rho _2^2)/\rho _1^4\).

The specification of \({\mathcal {A}}\) is required to ensure that \(\hat{S}_2(t_2)\) is consistent for \(S_2(t_2)\) with \(t_2\in [0,t_{02}]\) and \(\hat{S}_{Z}(z)\) for \(S_{Z}(z)\) with \(z\in [0,z_0]\). The proof of Theorem 1 is similar to that of Theorem 2 from Shih and Louis (1995). The formulas for \(\rho _1^2\) and \(\rho _2^2\) are specified as

$$\begin{aligned} \rho _1^2= & {} E\{-V_{\alpha }(\alpha _0, S_2,S_Z)\}=\int _{\mathcal {A}}-V_{\alpha }\{\alpha _0, S_2(t_2), S_Z(z)\} dJ_{\alpha _0}(t_2,z,\delta _{2},\delta _z),\\ \rho _2^2= & {} E\big [\{I_1(Y_{1},\delta _{21}, \alpha _0)+I_2(Z_1,\delta _{z1}, \alpha _0)\}^2\big ]\\= & {} \int _{\mathcal {A}}\{I_1(t_2, \delta _2, \alpha _0)+I_2(z,\delta _z, \alpha _0)\}^2d J_{\alpha _0}(t_2,z,\delta _2, \delta _z), \end{aligned}$$

where \(J_{\alpha _0}\) is the joint distribution of \((Y_{i},\delta _{2i})\) and \((Z_i, \delta _{zi})\). For \(i=1,\ldots , n\), \(I_1\) and \(I_2\) are defined by

$$\begin{aligned} I_1(T_{2i},\delta _{2i}, \alpha _0)= & {} \int _{\mathcal {A}}V_{\alpha ,1}\{\alpha _0, S_2(t_2), S_Z(z)\}I_1^0(Y_{i},\delta _{2i})(t_2) d J_{\alpha _0}(t_2,z,\delta _2, \delta _z),\\ I_2(Z_{i},\delta _{zi}, \alpha _0)= & {} \int _{\mathcal {A}}V_{\alpha ,2}\{\alpha _0, S_2(t_2), S_Z(z)\}I_2^0(Z_{i},\delta _{zi})(z) d J_{\alpha _0}(t_2,z,\delta _2, \delta _z), \end{aligned}$$

where

$$\begin{aligned} I_1^0(Y_{i},\delta _{2i})(t_2)= & {} -S_2(t_2)\Bigg \{\int _0^{t_2} \frac{dN_{2i}(u)}{p(T_2\ge u, C\ge u)}-\int _0^{t_2}\frac{I(Y_{i}\ge u)d\Lambda _2(u)}{p(T_2\ge u, C\ge u)}\Bigg \},\\ I_2^0(Z_{i},\delta _{zi})(z)= & {} -S_{Z}(z)\Bigg \{\int _0^z \frac{dN_{zi}(u)}{p(Z\ge u, C\ge u)}-\int _0^z\frac{I(Z_i\ge u)d\Lambda _z(u)}{p(Z\ge u, C\ge u)}\Bigg \},\\ N_{2i}(u)= & {} I(Y_{i}\le u, \delta _{2i}=1), \quad N_{zi}(u)=I(Z_{i}\le u, \delta _{zi}=1),\,\,i=1,2,\ldots ,n, \end{aligned}$$

and \(\Lambda _2(u)\) and \(\Lambda _z(u)\) are the cumulative hazard functions for \(T_2\) and Z, respectively. A variance estimator of \(\hat{\alpha }\) may be obtained as \(\hat{\rho }^2 = (\hat{\rho }_1^2+\hat{\rho }_2^2)/\hat{\rho }_1^4\), by replacing \(J_{\alpha _0}\) by its empirical distribution function \(J_n\), and replacing \(S_2\), \(S_Z\), and \(\alpha _0\) by \(\hat{S}_2\), \(\hat{S}_Z\), and \(\hat{\alpha }\), respectively, in \(\rho _1^2\) and \(\rho _2^2\). The consistency of \(J_n\), \(\hat{S}_2\), \(\hat{S}_Z\) and \(\hat{\alpha }\) implies that \(\hat{\rho }^2\) is a consistent estimator of \(\rho ^2\). Given the established weak convergence under the regularity conditions, a simple bootstrap resampling procedure is valid for estimating the asymptotic variance of \(\hat{\alpha }\).

For the marginal survival function \(S_1\), a natural estimator is obtained as \(\hat{S}_1=g\{\hat{S}_2, \hat{S}_{Z}, \hat{\alpha }\}\), where \(\hat{S}_2\) and \(\hat{S}_{Z}\) are the Kaplan–Meier estimators for \(S_2\) and \(S_{Z}\), respectively, and \(\hat{\alpha }\) is the solution to Eq. (7). We have that \(\hat{S}_2(t)\) and \(\hat{S}_{Z}(t)\) are strongly consistent with \(S_2(t)\) and \(S_{Z}(t)\), uniformly for \(t\in [0,\tau ]\) (Flemming and Harrington 2005). Since function g has bounded derivatives, a continuous mapping theorem gives the uniform convergence of \(\hat{S}_1\) to \(S_1\). The asymptotic properties of \(\hat{S}_1\) are summarized in Theorem 2, and the proof is provided in the Appendix.

Theorem 2

The estimator \(\hat{S}_1(t)\) is consistent with the true survival function \(S_1(t)\), and as \(n\rightarrow \infty \), the process \(n^{1/2}\{\hat{S}_1(t)-S_1(t)\}\) converges weakly to a zero-mean Gaussian process with covariance function \(\Big [\frac{\partial g\{S_2(t), S_{Z}(t), \alpha \}}{\partial \alpha }\Big ]^2\rho ^2+\omega ^2(t)\) for \(t\in [0,\tau ]\).

Although the asymptotic variance of \(\hat{S}_1(t)\) has a rather complicated form, the ordinary bootstrap method provides a direct and robust way to estimate the standard error. Thus, we adopt a nonparametric bootstrap resampling method to obtain the standard error of the estimated survival function in practice.

5 Simulation studies

We carried out simulation studies to evaluate the finite sample performances of the proposed estimators. We generate n pairs of \((T_{1}, T_{2})\) from the positive stable copula with a different strength of solicitation by choosing different values of parameter \(\alpha \). The marginal functions \(S_1\) and \(S_2\) are exponentials with parameter \((\lambda _{1}, \lambda _{2})=(1,1)\) or (2, 1). We generate the independent censoring time C from U(0, 4), giving \(25\%\) censoring of \(T_{2}\). For each set of \((\lambda _{1}, \lambda _{2})\), each combination of \(\alpha =0.2, 0.4, 0.6, 0.8\) and \(n=250, 500\), we simulate 500 datasets. To improve the stability of the estimation procedure, we propose the following iteration algorithm.

  1. Step 1:

    We set the initial estimator of the marginal survival function, \(\hat{S}_1^{(0)}\), by the Kaplan–Meier estimator.

  2. Step 2:

    We obtain the Kaplan–Meier estimator \(\hat{S}_2\) of the marginal survival function for the terminal event, and the Kaplan–Meier estimator \(\hat{S}_Z\) of \(S_Z\).

  3. Step 3:

    We estimate the association parameter \(\alpha \) based on the log-likelihood function (3) by replacing \(\hat{S}_1^{(0)}\) and \(\hat{S}_2\), denoted as \(\hat{\alpha }^{(0)}\).

  4. Step 4:

    Based on Eq. (5), we obtain an updated estimator \(\hat{S}_1^{(1)}\) by plugging in \(\hat{S}_{Z}\), \(\hat{S}_2\) and \(\hat{\alpha }^{(0)}\).

  5. Step 5:

    We can obtain an updated estimator \(\hat{\alpha }^{(1)}\) using \(\hat{S}_1^{(1)}\), \(\hat{S}_2\) and the log-likelihood function (3).

The iteration between Steps 4 and 5 is continued until a specified criterion for convergence is achieved.

Table 1 Simulation results of the positive stable copula association parameter \(\alpha \)

A simple nonparametric bootstrap procedure is used to obtain the standard errors for the estimators, based on 200 independent bootstrap samples. Table 1 presents simulation results of the association parameter \(\alpha \) in different simulation scenarios. As expected, the empirical biases are very small and decrease as the sample size increases, and increase with increasing censoring rates. The bootstrap standard errors agree well with the empirical standard errors, implying that the inference procedure performs reasonably well. Figure 2 shows the estimated marginal survival function of the non-terminal event obtained in the simulations when \((\lambda _{1}, \lambda _{2})=(1,1)\). The dashed curve is the initial estimator \(\hat{S}_1^{(0)}\), obtained by the Kaplan–Meier estimator. Seen from the plot, the initial estimator of \(S_1\) departs far from the true curve, suggesting that ignorance of the informative censoring due to the terminal event may seriously overestimate the true curve. After the iterations, the proposed estimator \(\hat{S}_1\) (dotted curve) is close to the true curve (solid curve) with indistinguishable empirical biases.

In addition, we conducted simulation studies to demonstrate that the exploratory diagnostic method using the subset of the sample with both non-terminal and terminal events could effectively discriminate the association patterns between the positive stable copula and the Clayton copula. We generated \(n\,=\,250\) pairs of \((T_1,T_2)\) from the positive stable copula/Clayton copula with different strength of association (\(\alpha \,=\,0.2, 0.4/\alpha \,=\,3, 5\), respectively). The marginal functions \(S_1\) and \(S_2\) were exponential with parameter \((\lambda _1,\lambda _2)\,=\,(1,1)\). We generated the independent censoring time C from U(0,4), and simulated 250 datasets. Figure 3 shows the relationship between the estimated cross ratio curves and the estimated bivariate survival function in scenarios for the positive stable copula (A) and for the Clayton copula (B). The solid lines are for the true cross ratio curve. As expected, the diagnostic method could not accurately estimate the cross ratio, but it can sufficiently detect the association pattern. The estimated cross ratio increases with the increase in the estimated bivariate survival function in the positive stable copula but remains approximately constant in the Clayton copula. This suggests that our diagnostic approach performs well even when using the subsample.

Fig. 2
figure 2

Estimation of marginal survival function of the non-terminal event time in simulations. True margin \(S_1\), the proposed estimator \(\hat{S}_1\) and the naive Kaplan–Meier estimator are plotted

Fig. 3
figure 3

Relationship between the estimated cross ratio and the estimated bivariate survival function using the upper wedge of the simulated semi-competing risks data with the positive stable copula a and the Clayton copula b

6 Application to ESBCR data

As the most common malignant cancer among women in the United States, breast cancer has received tremendous research support, and as a result is associated with an excellent overall five-year survival rate. The majority of cases are diagnosed at the early stage of disease, which is associated with 5-year relative survival rates ranging from 84 to \(99\,\%\) (Siegel et al. 2014). Many of these female patients have the greatest risk of recurrence during the initial years after diagnosis (Brewster et al. 2007). Studying the survival probability of breast cancer relapse will help clinicians to make better treatment decisions and to appropriately manage the patient’s quality of life.

We applied the proposed estimation and inference procedure to the ESBCR data to evaluate the distribution of breast cancer relapse and its association with death. In ESBCR, a cohort of 2425 patients were identified after excluding patients for whom follow-up information was missing (Brewster et al. 2007). The follow-up time ranged from 0.02 to 22.5 years, with a median time of 9.35 years. Among the 2425 patients in the study, 350 experienced both breast cancer relapse and death during follow-up; 212 experienced only cancer relapse and remained alive throughout follow-up; 413 died without experiencing cancer relapse; and 1450 patients did not have either event observed. The censoring rate for breast cancer relapse was \(76.8\,\%\) and that for death was \(68.5\,\%\).

The estimated association parameter \(\hat{\alpha }\) in the positive stable copula model was 0.578, with a bootstrap standard error of 0.018 from 200 resamples. The value of \(\hat{\alpha }\) suggests that there was a moderate positive association between breast cancer relapse and death in the ESBCR data. A further positive association measured by the cross ratio was not constant over time, but decreased with the time to breast cancer relapse and death. Figure 4 presents the estimated survival function \(\hat{S}_1\) (solid curve) of the time to breast cancer relapse with 95 % point-wise confidence intervals (dotted curve), and the naive Kaplan–Meier estimator (dashed curve) considering death as independent censoring. The substantial difference between the solid and dashed curves suggests that methods which ignore the informative censoring will provide misleading results and overestimate the survival function of breast cancer relapse. For example, the estimated disease-free survival rate at 15 years obtained by the proposed joint model is 0.636, with 95 % confidence interval (0.670, 0.705), while that obtained by the naive Kaplan–Meier estimator is 0.711, with 95 % confidence interval (0.688, 0.734).

Fig. 4
figure 4

Estimated marginal survival functions of the time to breast cancer relapse in the ESBCR data: Kaplan–Meier estimator and the proposed estimator \(\hat{S}_1\) with 95 % point-wise confidence intervals

7 Concluding remarks

In this paper, we consider semi-competing risks data for which a non-terminal event may be dependently censored by a terminal event, but not vice versa. We propose simultaneous inference on the marginal survival function of the non-terminal event time and the association between the bivariate event times. To handle the informative censoring, a key component in the modeling strategies is to select a reasonable copula model for the dependence structure, which is implemented by an effective exploratory diagnostic approach. Although we focus on a specific class of positive stable copulas suggested by the model diagnostic, the proposed estimation methods are highly applicable to general Archimedean copula models.

Different to our proposed method, the illness-death model based on crude quantities provides an alternative tool for handling semi-competing risks data. For example, Xu et al. (2010) consider the illness-death model with a shared Gamma frailty. However, the strategy of selecting an appropriate shared frailty within the illness-death model framework for the semi-competing risks data is still under investigation. Following a reviewer’s suggestion, we also consider a Cox proportional hazards model of death by including the relapse status as a time-dependent covariate to analyze the ESBCR data. The estimated hazard ratio is 3.38 (95 % CI 2.45–4.70; P value\(<\)0.001), which suggests that breast cancer relapse significantly increases the risk of death and confirms our conclusion that there is a positive association between cancer relapse and death. However, the simple Cox model analysis could not describe the trend of the association over time. In contrast, our method provides a convenient estimation of the association parameter between cancer relapse and death as well as an estimation of the survival function of the relapse, once the joint survival function is established.

In addition to marginal survival functions and the association between event times, it is of interest in many studies to determine the impact of pretreatment characteristics or treatments on disease progression and survival. This can be addressed by incorporating covariates and formulating their effects on the survival functions of the event times via regression models. An interesting topic for future investigation is to extend the current copula model framework and develop estimation methods for time-varying effect regression models for semi-competing risks data.