Semiparametric model for semi-competing risks data with application to breast cancer study

Zhou, Renke; Zhu, Hong; Bondy, Melissa; Ning, Jing

doi:10.1007/s10985-015-9344-x

Semiparametric model for semi-competing risks data with application to breast cancer study

Published: 05 September 2015

Volume 22, pages 456–471, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Lifetime Data Analysis Aims and scope Submit manuscript

Semiparametric model for semi-competing risks data with application to breast cancer study

Download PDF

Renke Zhou^1,2,
Hong Zhu³,
Melissa Bondy¹ &
…
Jing Ning⁴

618 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

For many forms of cancer, patients will receive the initial regimen of treatments, then experience cancer progression and eventually die of the disease. Understanding the disease process in patients with cancer is essential in clinical, epidemiological and translational research. One challenge in analyzing such data is that death dependently censors cancer progression (e.g., recurrence), whereas progression does not censor death. We deal with the informative censoring by first selecting a suitable copula model through an exploratory diagnostic approach and then developing an inference procedure to simultaneously estimate the marginal survival function of cancer relapse and an association parameter in the copula model. We show that the proposed estimators possess consistency and weak convergence. We use simulation studies to evaluate the finite sample performance of the proposed method, and illustrate it through an application to data from a study of early stage breast cancer.

Bayesian and Non-Bayesian Estimation for the Bivariate Inverse Weibull Distribution Under Progressive Type-II Censoring

Article 23 November 2020

Efficient Estimation of the Additive Risks Model for Interval-Censored Data

A dependent Dirichlet process model for survival data with competing risks

Article 12 October 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Medical research frequently yields multiple event times that correspond to landmarks in disease progression. The events may be non-terminal events, such as disease recurrence, defined as progressive disease, or a terminal event, such as death. Semi-competing risks data (Fine et al. 2001; Peng et al. 2007) are encountered when the terminal event censors the non-terminal event, but the non-terminal event does not prevent subsequent observation of the terminal event. An example of this can be found in a retrospective study of data collected from women diagnosed with stage I or II breast cancer, and maintained as the early stage breast cancer repository (ESBCR) at the University of Texas MD Anderson Cancer Center. For this study, breast cancer relapse is the non-terminal event and death is the terminal event. It is essential to evaluate the survival distribution for women who experience breast cancer relapse following the initial treatment regimen in order to predict the risk of relapse for ongoing patients and make further treatment decisions. However, this endeavor is a challenge because of the informative censoring issue in the semi-competing risks data. Specifically, the occurrence of death is expected to be positively correlated with cancer relapse and cannot be considered as independent censoring. Consequently, the Kaplan–Meier estimator (Kaplan and Meier 1958) will tend to overestimate the true survival probabilities associated with cancer relapse in the presence of death.

One approach to semi-competing risks data is to use crude quantities, such as the cause-specific hazard and cumulative incidence function due to the identifiability issue. The naive method ignores the information available after the non-terminal event by only considering the time and type of the first event. Xu et al. (2010) considers the crude quantity (cause-specific hazard) and developed an extension of the illness-death model with a shared frailty, which can take into account the information after the non-terminal event. Another approach focuses on net quantities, such as the marginal distribution of the non-terminal event time, which is not identifiable without further assumption on the dependence structure between the non-terminal and terminal event times (similar to how competing risks are handled). This approach can fully utilize the observed data and provide useful information on the non-terminal event, but is subject to theoretical and computational challenges due to informative censoring (Fine et al. 2001; Peng and Fine 2007; Hsieh et al. 2008; Ding et al. 2009; Hsieh and Huang 2012; Chen 2012). This paper investigates the semi-competing risks data based on the net quantities.

To handle informative censoring, it is common to assume that the bivariate distribution of non-terminal and terminal event times is a known copula. For example, a Gamma frailty copula (Clayton 1978), which assumes a constant association, has been used for modeling dependence structure (Fine et al. 2001; Jiang et al. 2005). The estimating procedure proposed by Fine et al. (2001) requires a separate estimation for the association parameter using a concordance estimating function at the first stage. These inferences were extended to some other parametric copulas by Wang (2003). Peng and Fine (2007) and Hsieh and Huang (2012) developed copula-based models to account for the time-varying effects of a treatment on the marginal distribution of a non-terminal event. Chen (2012) proposed a nonparametric maximum likelihood estimation approach for the semiparametric transformational models for marginal regressions, where dependence is modeled under an assumed copula. However, in reality, a specific copula model specification does not fit most applications, so its use will probably lead to incorrect statistical inference. Thus, an immediate question is how to choose an appropriate copula model for the dependence structure based on observed bivariate survival data. Chen and Bandeen-Roche (2005) provided a diagnostic method to infer the choice of frailty distribution (copula) family for bivariate survival data, which can be implemented by exploiting the relationship between a local version of Kendall’s tau and the frailty distribution. Specifically, a constant relationship may suggest the Gamma frailty copula, and an increasing one may prefer a positive stable copula (Gumbel 1960).

Instead of restricting to the Gamma frailty copula for modeling dependence between breast cancer relapse and death in the ESBCR data and estimating its association parameter separately at an earlier stage, we first use the exploratory diagnostic method (Chen and Bandeen-Roche 2005) to check the association pattern between cancer relapse and death. Then, we develop a semiparametric method for simultaneous inference on the marginal distribution of breast cancer relapse and the association parameter, based on a chosen copula model.

The remainder of the article is organized as follows. In Sect. 2, we introduce the notation and copula model for semi-competing risks data. In Sect. 3, a suitable copula model is selected by exploring the association pattern in the ESBCR data. Section 4 presents the simultaneous inference on the association parameter and the marginal distribution of the non-terminal event time. In Sect. 5, simulation studies are conducted to assess the performance of the proposed method. An application to the ESBCR data is presented as an illustration in Sect. 6. Some concluding remarks are given in Sect. 7. The proof of the asymptotic properties is provided in the Appendix.

2 Notation and model

Let $T_{1}$ be the time from the initial event (e.g., initial treatment) to a non-terminal event (e.g., disease relapse), and $T_{2}$ be the time from the initial event to the terminal event (e.g., death). The random variable $T_{2}$ may dependently censor $T_{1}$, but not vice versa. Let C be the independent censoring time of both $T_{1}$ and $T_{2}$, such as an administrative loss to follow-up. Define $Z=\min (T_{1},T_{2})$, $X=\min (Z,C)$, $Y=\min (T_{2},C)$, $\delta _{1}=I(T_{1}\le Y)$, $\delta _{2}=I(T_{2}\le C)$, and $\delta _{0}=I(Z\le C)=\delta _{1}+\delta _{2}-\delta _{1}\delta _{2}$. The observed data can be denoted as $\{(X_{i},\delta _{1i}, Y_{i},\delta _{2i}, Z_i, \delta _{0i}), i=1,\ldots ,n\}$.

With semi-competing risks data, the dependence structure is usually of biological interest and is required to ensure the validity for the estimation of the marginal survival function of $T_{1}$. As a commonly used model, a copula $C_{\alpha }$ is a bivariate distribution function with density $c_{\alpha }$ on $[0,1]^{2}$ for $\alpha \in \mathfrak {R}^{1}$(Oakes 1982). Under a copula, the bivariate survival function and density function of $(T_{1},T_{2})$ can be written as

$$\begin{aligned} S(t_{1},t_{2})= & {} \Pr (T_{1}>t_1, T_{2}>t_2)=C_{\alpha } \left\{ S_1(t_{1}),S_2(t_{2})\right\} , \quad 0\le t_{1}\le t_{2},\\ f(t_{1},t_{2})= & {} c_{\alpha }\left\{ S_1(t_{1}),S_2(t_{2})\right\} f_1(t_{1})f_2(t_{2}), \quad 0\le t_{1}\le t_{2}, \end{aligned}$$

where $(S_1,S_2)$ and $(f_1, f_2)$ are the corresponding marginal survival and density functions, respectively. One reason for the popularity of copulas is that the association parameter $\alpha $ is related to a local dependence measure, the cross ratio (Oakes 1989), which is defined as

$$\begin{aligned} \theta (t_{1},t_{2})=\frac{\lambda (t_{1}|T_{2}=t_{2} )}{\lambda (t_{1}|T_{2}>t_{2})}, \end{aligned}$$

(1)

where $\lambda (t_{1}|A)$ is a conditional hazard function,

$$\begin{aligned} \lambda (t_{1}|A)=lim_{\epsilon \rightarrow 0}\frac{\Pr (T_{1}<t_{1}+\epsilon |T_{1}\ge t_{1},T_{2}\in A)}{\epsilon }. \end{aligned}$$

The choice of the copula $C_{\alpha }$ determines the expression of $\theta (t_{1},t_{2})$. Particularly, under a popular class of copulas, namely the Archimedean copula, the bivariate survival function can be expressed as

$$\begin{aligned} S(t_{1},t_{2})=\phi _{\alpha }\big [\phi _{\alpha }^{-1} \{S_1(t_{1})\}+\phi _{\alpha }^{-1}\{S_2(t_{2})\}\big ], \end{aligned}$$

where $\phi _{\alpha }$ is a non-increasing convex function defined on (0, 1] with $\phi _{\alpha }(0)=1$. Examples of Archimedean copulas include the Clayton (Gamma frailty) copula,

$$\begin{aligned} S(t_{1},t_{2})= \left\{ S_1(t_{1})^{1-\alpha } +S_2(t_{2})^{1-\alpha }-1 \right\} ^{\frac{1}{1-\alpha }}, \quad \alpha >1, \end{aligned}$$

the positive stable copula,

$$\begin{aligned} S(t_{1},t_{2})=\exp \left( -\left[ \left\{ -\log S_1(t_{1})\right\} ^{\frac{1}{\alpha }} + \left\{ -\log S_2(t_{2})\right\} ^{\frac{1}{\alpha }} \right] ^{\alpha }\right) , \quad 0<\alpha <1, \end{aligned}$$

and the Frank copula (Frank 1979) as

$$\begin{aligned} S(t_{1},t_{2})=\log _{\alpha } \left\{ 1+\frac{(\alpha ^{S_1(t_{1})}-1)(\alpha ^{S_2(t_{2})}-1)}{\alpha -1}\right\} . \end{aligned}$$

The cross ratio under the Clayton copula is a constant $\theta (t_{1},t_{2}) = \alpha $, while under the positive stable copula, it is a function of $S(t_{1},t_{2})$ and decreases with time $\theta (t_{1},t_{2})=1-(1-\alpha )\{\alpha \log S(t_{1},t_{2})\}^{-1}$. Such a connection can be used to choose an appropriate copula model in applications.

2.1 A motivating study

The ESBCR was created for a retrospective cohort study of (n = 2470) women diagnosed with American Joint Committee on Cancer pathologic stage I or II breast cancer who were treated at The University of Texas MD Anderson Cancer Center between January 1, 1985 and December 31, 2000 (Brewster et al. 2007). The inclusion criteria for the ESBCR were female patients with stage I or II breast cancer, who were residents of the state of Texas, underwent surgical treatment for breast cancer at MD Anderson, had tissue samples available, and for whom a complete assessment of the medical records was available. Among the 2470 ESBCR patients, 785 had died before the end of the study, 1684 were alive, and one record was missing the vital status. There were 2425 patients with complete information of disease relapse status and dates, vital status, and death/last contact date. Among that group, 562 had data regarding disease relapse, which included local lymph node recurrence, local breast cancer recurrence, metastasis to the contralateral breast or chest wall, metastasis to other body sites, and a second primary breast cancer.

3 Model selection by association pattern in ESBCR

Many existing estimation methods for semi-competing risks data require imposition of the Gamma frailty model assumption on $T_{1}$ and $T_{2}$ (Fine et al. 2001; Jiang et al. 2005), which implies a constant association and may not always be biologically reasonable in practical applications. To avoid possible misspecification on the copula and the induction of misleading conclusions, we first aim to select an appropriate copula model based on the observed bivariate survival data. Specifically, we apply an easily implemented approach (Chen and Bandeen-Roche 2005) for diagnosing the association pattern and choose the proper copula model for breast cancer relapse and death in the ESBCR data. This diagnostic approach constructs an exploratory display that plots the cross ratio and visually conveys the fit of a given Archimedean copula model.

We base our approach on the fact that the cross ratio defined in Eq. (1) is a ratio of the probabilities that one pair of bivariate failure times is concordant or discordant, conditional on the component-wise minimum failure times (Oakes 1989). For any pair of bivariate times, $T_{i}=(T_{1i},T_{2i})$ and $T_{j}=(T_{1j},T_{2j})$, let $(T_{1ij},T_{2ij})$ be the corresponding component-wise minimum of $(T_i, T_j)$, defined by $T_{1ij}=\min (T_{1i},T_{1j})$ and $T_{2ij}=\min (T_{2i},T_{2j})$. Define a pair of bivariate times $(T_i, T_j)$ as concordant if $\left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) >0$, and discordant if $\left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) <0$. We have

$$\begin{aligned} \theta (t_{1ij},t_{2ij})=\frac{\Pr \left\{ \left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) >0|\left( T_{1ij},T_{2ij}\right) =\left( t_{1ij},t_{2ij}\right) \right\} }{\Pr \left\{ \left( T_{1i}-T_{1j}\right) \left( T_{2i}-T_{2j}\right) <0| \left( T_{1ij},T_{2ij}\right) =\left( t_{1ij},t_{2ij}\right) \right\} }, \end{aligned}$$

which provides an estimator of the cross ratio $\theta $ by using the counts of concordances and discordances. Chen and Bandeen-Roche (2005) suggested using a plot of $\hat{\theta }(t_{1ij},t_{2ij})$ against $\hat{S}(t_{1ij},t_{2ij})$ to approximate the functional relationship between the cross ratio and the bivariate survival function. Specifically, a horizontal line may support a Gamma frailty model selection. If $\hat{\theta }(t_{1ij},t_{2ij})$ increases as $\hat{S}(t_{1ij},t_{2ij})$ increases, a positive stable model may be appropriate for the dependence structure.

In the ESBCR data, the time from the initial treatment for breast cancer to disease relapse is $T_{1}$, and the time from the initial treatment to death is $T_{2}$. To avoid unnecessary computational complexity, we used a subgroup of 350 patients for whom the observed times are available for both cancer relapse and subsequent death in order to count the concordances and discordances of all possible pairs. Based on this subgroup, we estimated the cross ratio within 10 bins and the corresponding bivariate survival function as well. As shown in Fig. 1, the estimated cross ratio increases with an increase in the estimated bivariate survival function, which suggests that a positive stable copula is appropriate for the underlying dependent structure between breast cancer relapse and death.

Note that the exploratory diagnosis is based on a subset of the sample with both non-terminal and terminal events observed, thus its effectiveness needs to be further evaluated. We realize that it is not feasible to estimate the cross ratio without the information on the joint survival function. Nevertheless, the exploratory diagnosis provides a tool for approximating the pattern of the dependent structure. Even though there are potential biases from using the subsample, it can still help us to distinguish the positive stable copula from the Clayton copula, where the latter assumes a constant correlation and has been used in many semi-competing risks data studies. Additional simulation studies have been conducted to show that the diagnostic method can sufficiently detect the association pattern, and are included in Sect. 5.

4 Inference on association parameter and marginal distribution

Without loss of generality, we develop simultaneous inference of the association parameter and marginal distribution of the non-terminal event under the positive stable copula, which is chosen by the visual diagnostic display of the ESBCR data. The inference procedure can be easily applied to other copulas. Under the positive stable copula, the bivariate survival function is

$$\begin{aligned} S(t_{1},t_{2}) = \mathrm{Pr}(T_{1}>t_{1}, T_{2}>t_{2}) = \exp \big ( - \big [ \{-\log S_1(t_{1})\}^{\frac{1}{\alpha }} + \{ -\log S_2(t_{2})\}^{\frac{1}{\alpha }} \big ] ^{\alpha } \big ). \end{aligned}$$

(2)

Given the observed data, the likelihood function is

$$\begin{aligned} L(\alpha , S_1, S_2)= & {} \prod _{i=1}^{n} f(X_{i},Y_{i})^{\delta _{1i}\delta _{2i}} \times \Big \{ \frac{\partial S(X_{i},Y_{i}) }{\partial X_{i}} \Big \}^{\delta _{1i} (1-\delta _{2i})} \\&\times \Big \{ \frac{\partial S(X_{i},Y_{i}) }{\partial Y_{i}} \Big \} ^{(1-\delta _{1i})\delta _{2i}} \times S(X_{i},Y_{i})^{(1-\delta _{1i})(1-\delta _{2i})}. \end{aligned}$$

The corresponding log-likelihood function can be expressed as

$$\begin{aligned} l(\alpha , S_1, S_2)= & {} \sum _{i=1}^{n} -\delta _{1i}\log S_1(X_{i}) - \delta _{2i}\log S_2(Y_{i}) +(\alpha ^{-1}-1)\big [\delta _{1i}\log \{-\log S_1(X_{i})\}\nonumber \\&+\,\delta _{2i}\log \{-\log S_2(Y_{i})\}\big ] -\big [\{-\log S_1(X_{i})\}^{\frac{1}{\alpha }}+\{-\log S_2(Y_{i})\}^{\frac{1}{\alpha }}\big ]^{\alpha }\nonumber \\&+\,\{ \alpha (\delta _{1i}+\delta _{2i}-\delta _{1i}\delta _{2i}) -\delta _{1i}-\delta _{2i}\} \log \big [\{-\log S_1(X_{i})\}^{\frac{1}{\alpha }}\nonumber \\&\quad +\{-\log S_2(Y_{i})\}^{\frac{1}{\alpha }}\big ] \nonumber \\&+\,\delta _{1i}\delta _{2i}\log \Big (\big [\{-\log S_1(X_{i})\}^{\frac{1}{\alpha }} +\{-\log S_2(Y_{i})\}^{\frac{1}{\alpha }}\big ]^{\alpha }-1+\alpha ^{-1} \Big ). \end{aligned}$$

(3)

Maximizing a pseudo-likelihood by replacing $S_1$ and $S_2$ in (3) with their consistent estimators provides a consistent estimator of $\alpha $ (Shih and Louis 1995). To this end, one may employ the Kaplan–Meier estimators of $S_1$ and $S_2$. However, with semi-competing risks data, the Kaplan–Meier estimator of $S_1$ does not generally converge to $S_1$ as $n\rightarrow \infty $, although $S_2$ can be consistently estimated by the Kaplan–Meier estimator. In fact, a consistent estimator of $S_1$ may not exist without an estimator of the association parameter $\alpha $. Hereafter, we formulate an alternative pseudo-likelihood function of $\alpha $ using the unique feature of semi-competing risks data.

Under the copula model, the marginal survival function of $Z=\min (T_1, T_2)$ is

$$\begin{aligned} S_{Z}(t) = \mathrm{Pr}(T_{1}>t, T_{2}>t)= \exp \Big ( - \big [ \{ -\log S_1(t)\}^{\frac{1}{\alpha }} + \{ -\log S_2(t)\}^{\frac{1}{\alpha }} \big ] ^{\alpha }\Big ). \end{aligned}$$

(4)

Using Eq. (4), the marginal survival function of $T_1$ can be expressed as

$$\begin{aligned} S_1(t) = g\{S_2(t), S_{Z}(t), \alpha \}= \exp \Big ( - \big [ \{ -\log S_{Z}(t)\}^{\frac{1}{\alpha }} - \{ -\log S_2(t)\}^{\frac{1}{\alpha }} \big ] ^{\alpha }\Big ), \end{aligned}$$

(5)

where $g(u,v, \alpha )= \exp \big [ - \big \{ (-\log v)^{\frac{1}{\alpha }} - (-\log u)^{\frac{1}{\alpha }} \big \} ^{\alpha }\big ]$. With this, the log-likelihood function of $\alpha $ becomes

$$\begin{aligned} l^g(\alpha , S_2, S_{Z})=l\{\alpha , g(S_2, S_{Z}, \alpha ), S_2\}. \end{aligned}$$

(6)

Note that $S_2$ and $S_{Z}$ can be consistently estimated by the Kaplan–Meier estimators $\hat{S}_2$ and $\hat{S}_{Z}$, respectively, because $T_2$ and Z are subject to independent censoring only by C. Let $U_{\alpha }(\alpha , S_2, S_{Z})$ be the score function of $\alpha $, which is the derivative of (6) with respect to $\alpha $. Then, estimator $\hat{\alpha }$ is the solution to the estimating equation

$$\begin{aligned} U_{\alpha }(\alpha , \hat{S}_2, \hat{S}_{Z})=\frac{\partial l^g(\alpha , \hat{S}_2, \hat{S}_{Z})}{\partial \alpha }=\sum _i \frac{\partial l_0^g\{\alpha , \hat{S}_2(Y_{i}), \hat{S}_{Z}(Z_i)\}}{\partial \alpha }=0. \end{aligned}$$

(7)

We prove that $\hat{\alpha }$ is consistent and asymptotically normally distributed under the following regularity conditions, C1 and C2.

C1
The standard regularity conditions for the maximum likelihood estimator hold.
C2
The functions $W_{\alpha }\{\alpha , S_2(t_2), S_Z(z)\}$, $V_{\alpha }\{\alpha , S_2(t_2), S_Z(z)\}$, $V_{\alpha , 1}\{\alpha , S_2(t_2),S_Z(z)\}$, and $V_{\alpha , 2}\{\alpha , S_2(t_2), S_Z(z)\}$ are continuous and bounded for $(z,t_2)\in {\mathcal {A}}=[0,t_{02}]\,\times \,[0,z_0]$, where
$$\begin{aligned} W_{\alpha }(\alpha , u,v)= & {} \frac{\partial l_0^g(\alpha ,u, v)}{\partial \alpha },\quad V_{\alpha }(\alpha , u,v)=\frac{\partial ^2 l_0^g(\alpha ,u, v)}{\partial \alpha ^2},\\ V_{\alpha , 1}(\alpha , u,v)= & {} \frac{\partial ^2 l_0^g(\alpha , u, v)}{\partial \alpha \partial u},\quad V_{\alpha , 2}(\alpha , u,v)=\frac{\partial ^2 l_0^g(\alpha ,u, v)}{\partial \alpha \partial v}, \end{aligned}$$
$t_{02}=\sup \{t: \mathrm{Pr}(T_2>t, C>t)>0\}$, $z_{0}=\sup \{t: \mathrm{Pr}(Z>t, C>t)>0\}$, and we write (u, v) for $(S_2, S_Z)$.

Theorem 1 summarizes these asymptotic properties.

Theorem 1

The estimator $\hat{\alpha }$ is a consistent estimator of the true value $\alpha _0$, and as $n\rightarrow \infty $, $n^{1/2}(\hat{\alpha }-\alpha _0)$ converges weakly to a normal distribution with a mean of zero and variance $\rho ^2 = (\rho _1^2+\rho _2^2)/\rho _1^4$.

The specification of ${\mathcal {A}}$ is required to ensure that $\hat{S}_2(t_2)$ is consistent for $S_2(t_2)$ with $t_2\in [0,t_{02}]$ and $\hat{S}_{Z}(z)$ for $S_{Z}(z)$ with $z\in [0,z_0]$. The proof of Theorem 1 is similar to that of Theorem 2 from Shih and Louis (1995). The formulas for $\rho _1^2$ and $\rho _2^2$ are specified as

$$\begin{aligned} \rho _1^2= & {} E\{-V_{\alpha }(\alpha _0, S_2,S_Z)\}=\int _{\mathcal {A}}-V_{\alpha }\{\alpha _0, S_2(t_2), S_Z(z)\} dJ_{\alpha _0}(t_2,z,\delta _{2},\delta _z),\\ \rho _2^2= & {} E\big [\{I_1(Y_{1},\delta _{21}, \alpha _0)+I_2(Z_1,\delta _{z1}, \alpha _0)\}^2\big ]\\= & {} \int _{\mathcal {A}}\{I_1(t_2, \delta _2, \alpha _0)+I_2(z,\delta _z, \alpha _0)\}^2d J_{\alpha _0}(t_2,z,\delta _2, \delta _z), \end{aligned}$$

where $J_{\alpha _0}$ is the joint distribution of $(Y_{i},\delta _{2i})$ and $(Z_i, \delta _{zi})$. For $i=1,\ldots , n$, $I_1$ and $I_2$ are defined by

$$\begin{aligned} I_1(T_{2i},\delta _{2i}, \alpha _0)= & {} \int _{\mathcal {A}}V_{\alpha ,1}\{\alpha _0, S_2(t_2), S_Z(z)\}I_1^0(Y_{i},\delta _{2i})(t_2) d J_{\alpha _0}(t_2,z,\delta _2, \delta _z),\\ I_2(Z_{i},\delta _{zi}, \alpha _0)= & {} \int _{\mathcal {A}}V_{\alpha ,2}\{\alpha _0, S_2(t_2), S_Z(z)\}I_2^0(Z_{i},\delta _{zi})(z) d J_{\alpha _0}(t_2,z,\delta _2, \delta _z), \end{aligned}$$

where

$$\begin{aligned} I_1^0(Y_{i},\delta _{2i})(t_2)= & {} -S_2(t_2)\Bigg \{\int _0^{t_2} \frac{dN_{2i}(u)}{p(T_2\ge u, C\ge u)}-\int _0^{t_2}\frac{I(Y_{i}\ge u)d\Lambda _2(u)}{p(T_2\ge u, C\ge u)}\Bigg \},\\ I_2^0(Z_{i},\delta _{zi})(z)= & {} -S_{Z}(z)\Bigg \{\int _0^z \frac{dN_{zi}(u)}{p(Z\ge u, C\ge u)}-\int _0^z\frac{I(Z_i\ge u)d\Lambda _z(u)}{p(Z\ge u, C\ge u)}\Bigg \},\\ N_{2i}(u)= & {} I(Y_{i}\le u, \delta _{2i}=1), \quad N_{zi}(u)=I(Z_{i}\le u, \delta _{zi}=1),\,\,i=1,2,\ldots ,n, \end{aligned}$$

and $\Lambda _2(u)$ and $\Lambda _z(u)$ are the cumulative hazard functions for $T_2$ and Z, respectively. A variance estimator of $\hat{\alpha }$ may be obtained as $\hat{\rho }^2 = (\hat{\rho }_1^2+\hat{\rho }_2^2)/\hat{\rho }_1^4$, by replacing $J_{\alpha _0}$ by its empirical distribution function $J_n$, and replacing $S_2$, $S_Z$, and $\alpha _0$ by $\hat{S}_2$, $\hat{S}_Z$, and $\hat{\alpha }$, respectively, in $\rho _1^2$ and $\rho _2^2$. The consistency of $J_n$, $\hat{S}_2$, $\hat{S}_Z$ and $\hat{\alpha }$ implies that $\hat{\rho }^2$ is a consistent estimator of $\rho ^2$. Given the established weak convergence under the regularity conditions, a simple bootstrap resampling procedure is valid for estimating the asymptotic variance of $\hat{\alpha }$.

For the marginal survival function $S_1$, a natural estimator is obtained as $\hat{S}_1=g\{\hat{S}_2, \hat{S}_{Z}, \hat{\alpha }\}$, where $\hat{S}_2$ and $\hat{S}_{Z}$ are the Kaplan–Meier estimators for $S_2$ and $S_{Z}$, respectively, and $\hat{\alpha }$ is the solution to Eq. (7). We have that $\hat{S}_2(t)$ and $\hat{S}_{Z}(t)$ are strongly consistent with $S_2(t)$ and $S_{Z}(t)$, uniformly for $t\in [0,\tau ]$ (Flemming and Harrington 2005). Since function g has bounded derivatives, a continuous mapping theorem gives the uniform convergence of $\hat{S}_1$ to $S_1$. The asymptotic properties of $\hat{S}_1$ are summarized in Theorem 2, and the proof is provided in the Appendix.

Theorem 2

The estimator $\hat{S}_1(t)$ is consistent with the true survival function $S_1(t)$, and as $n\rightarrow \infty $, the process $n^{1/2}\{\hat{S}_1(t)-S_1(t)\}$ converges weakly to a zero-mean Gaussian process with covariance function $\Big [\frac{\partial g\{S_2(t), S_{Z}(t), \alpha \}}{\partial \alpha }\Big ]^2\rho ^2+\omega ^2(t)$ for $t\in [0,\tau ]$.

Although the asymptotic variance of $\hat{S}_1(t)$ has a rather complicated form, the ordinary bootstrap method provides a direct and robust way to estimate the standard error. Thus, we adopt a nonparametric bootstrap resampling method to obtain the standard error of the estimated survival function in practice.

5 Simulation studies

We carried out simulation studies to evaluate the finite sample performances of the proposed estimators. We generate n pairs of $(T_{1}, T_{2})$ from the positive stable copula with a different strength of solicitation by choosing different values of parameter $\alpha $. The marginal functions $S_1$ and $S_2$ are exponentials with parameter $(\lambda _{1}, \lambda _{2})=(1,1)$ or (2, 1). We generate the independent censoring time C from U(0, 4), giving $25\%$ censoring of $T_{2}$. For each set of $(\lambda _{1}, \lambda _{2})$, each combination of $\alpha =0.2, 0.4, 0.6, 0.8$ and $n=250, 500$, we simulate 500 datasets. To improve the stability of the estimation procedure, we propose the following iteration algorithm.

Step 1:
We set the initial estimator of the marginal survival function, $\hat{S}_1^{(0)}$, by the Kaplan–Meier estimator.
Step 2:
We obtain the Kaplan–Meier estimator $\hat{S}_2$ of the marginal survival function for the terminal event, and the Kaplan–Meier estimator $\hat{S}_Z$ of $S_Z$.
Step 3:
We estimate the association parameter $\alpha $ based on the log-likelihood function (3) by replacing $\hat{S}_1^{(0)}$ and $\hat{S}_2$, denoted as $\hat{\alpha }^{(0)}$.
Step 4:
Based on Eq. (5), we obtain an updated estimator $\hat{S}_1^{(1)}$ by plugging in $\hat{S}_{Z}$, $\hat{S}_2$ and $\hat{\alpha }^{(0)}$.
Step 5:
We can obtain an updated estimator $\hat{\alpha }^{(1)}$ using $\hat{S}_1^{(1)}$, $\hat{S}_2$ and the log-likelihood function (3).

The iteration between Steps 4 and 5 is continued until a specified criterion for convergence is achieved.

Table 1 Simulation results of the positive stable copula association parameter $\alpha $

Full size table

A simple nonparametric bootstrap procedure is used to obtain the standard errors for the estimators, based on 200 independent bootstrap samples. Table 1 presents simulation results of the association parameter $\alpha $ in different simulation scenarios. As expected, the empirical biases are very small and decrease as the sample size increases, and increase with increasing censoring rates. The bootstrap standard errors agree well with the empirical standard errors, implying that the inference procedure performs reasonably well. Figure 2 shows the estimated marginal survival function of the non-terminal event obtained in the simulations when $(\lambda _{1}, \lambda _{2})=(1,1)$. The dashed curve is the initial estimator $\hat{S}_1^{(0)}$, obtained by the Kaplan–Meier estimator. Seen from the plot, the initial estimator of $S_1$ departs far from the true curve, suggesting that ignorance of the informative censoring due to the terminal event may seriously overestimate the true curve. After the iterations, the proposed estimator $\hat{S}_1$ (dotted curve) is close to the true curve (solid curve) with indistinguishable empirical biases.

In addition, we conducted simulation studies to demonstrate that the exploratory diagnostic method using the subset of the sample with both non-terminal and terminal events could effectively discriminate the association patterns between the positive stable copula and the Clayton copula. We generated $n\,=\,250$ pairs of $(T_1,T_2)$ from the positive stable copula/Clayton copula with different strength of association ($\alpha \,=\,0.2, 0.4/\alpha \,=\,3, 5$, respectively). The marginal functions $S_1$ and $S_2$ were exponential with parameter $(\lambda _1,\lambda _2)\,=\,(1,1)$. We generated the independent censoring time C from U(0,4), and simulated 250 datasets. Figure 3 shows the relationship between the estimated cross ratio curves and the estimated bivariate survival function in scenarios for the positive stable copula (A) and for the Clayton copula (B). The solid lines are for the true cross ratio curve. As expected, the diagnostic method could not accurately estimate the cross ratio, but it can sufficiently detect the association pattern. The estimated cross ratio increases with the increase in the estimated bivariate survival function in the positive stable copula but remains approximately constant in the Clayton copula. This suggests that our diagnostic approach performs well even when using the subsample.

6 Application to ESBCR data

As the most common malignant cancer among women in the United States, breast cancer has received tremendous research support, and as a result is associated with an excellent overall five-year survival rate. The majority of cases are diagnosed at the early stage of disease, which is associated with 5-year relative survival rates ranging from 84 to $99\,\%$ (Siegel et al. 2014). Many of these female patients have the greatest risk of recurrence during the initial years after diagnosis (Brewster et al. 2007). Studying the survival probability of breast cancer relapse will help clinicians to make better treatment decisions and to appropriately manage the patient’s quality of life.

We applied the proposed estimation and inference procedure to the ESBCR data to evaluate the distribution of breast cancer relapse and its association with death. In ESBCR, a cohort of 2425 patients were identified after excluding patients for whom follow-up information was missing (Brewster et al. 2007). The follow-up time ranged from 0.02 to 22.5 years, with a median time of 9.35 years. Among the 2425 patients in the study, 350 experienced both breast cancer relapse and death during follow-up; 212 experienced only cancer relapse and remained alive throughout follow-up; 413 died without experiencing cancer relapse; and 1450 patients did not have either event observed. The censoring rate for breast cancer relapse was $76.8\,\%$ and that for death was $68.5\,\%$.

The estimated association parameter $\hat{\alpha }$ in the positive stable copula model was 0.578, with a bootstrap standard error of 0.018 from 200 resamples. The value of $\hat{\alpha }$ suggests that there was a moderate positive association between breast cancer relapse and death in the ESBCR data. A further positive association measured by the cross ratio was not constant over time, but decreased with the time to breast cancer relapse and death. Figure 4 presents the estimated survival function $\hat{S}_1$ (solid curve) of the time to breast cancer relapse with 95 % point-wise confidence intervals (dotted curve), and the naive Kaplan–Meier estimator (dashed curve) considering death as independent censoring. The substantial difference between the solid and dashed curves suggests that methods which ignore the informative censoring will provide misleading results and overestimate the survival function of breast cancer relapse. For example, the estimated disease-free survival rate at 15 years obtained by the proposed joint model is 0.636, with 95 % confidence interval (0.670, 0.705), while that obtained by the naive Kaplan–Meier estimator is 0.711, with 95 % confidence interval (0.688, 0.734).

7 Concluding remarks

In this paper, we consider semi-competing risks data for which a non-terminal event may be dependently censored by a terminal event, but not vice versa. We propose simultaneous inference on the marginal survival function of the non-terminal event time and the association between the bivariate event times. To handle the informative censoring, a key component in the modeling strategies is to select a reasonable copula model for the dependence structure, which is implemented by an effective exploratory diagnostic approach. Although we focus on a specific class of positive stable copulas suggested by the model diagnostic, the proposed estimation methods are highly applicable to general Archimedean copula models.

Different to our proposed method, the illness-death model based on crude quantities provides an alternative tool for handling semi-competing risks data. For example, Xu et al. (2010) consider the illness-death model with a shared Gamma frailty. However, the strategy of selecting an appropriate shared frailty within the illness-death model framework for the semi-competing risks data is still under investigation. Following a reviewer’s suggestion, we also consider a Cox proportional hazards model of death by including the relapse status as a time-dependent covariate to analyze the ESBCR data. The estimated hazard ratio is 3.38 (95 % CI 2.45–4.70; P value$<$0.001), which suggests that breast cancer relapse significantly increases the risk of death and confirms our conclusion that there is a positive association between cancer relapse and death. However, the simple Cox model analysis could not describe the trend of the association over time. In contrast, our method provides a convenient estimation of the association parameter between cancer relapse and death as well as an estimation of the survival function of the relapse, once the joint survival function is established.

In addition to marginal survival functions and the association between event times, it is of interest in many studies to determine the impact of pretreatment characteristics or treatments on disease progression and survival. This can be addressed by incorporating covariates and formulating their effects on the survival functions of the event times via regression models. An interesting topic for future investigation is to extend the current copula model framework and develop estimation methods for time-varying effect regression models for semi-competing risks data.

References

Brewster AM, Do KA, Thompson PA, Hahn KM, Sahin AA, Cao Y, Stewart MM, Murray JL, Hortobagyi GN, Bondy ML (2007) Relationship between epidemiologic risk factors and breast cancer recurrence. J Clin Oncol 25:4438–4444
Article Google Scholar
Chen MC, Bandeen-Roche K (2005) A diagnostic for association in bivariate survival models. Lifetime Data Anal 11:245–264
Article MathSciNet MATH Google Scholar
Chen YH (2012) Maximum likelihood analysis of semicompeting risks data with semiparametric regression models. Lifetime Data Anal 18:36–57
Article MathSciNet MATH Google Scholar
Clayton DG (1978) A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65:141–151
Article MathSciNet MATH Google Scholar
Ding AA, Shi G, Wang W, Hsieh JJ (2009) Marginal regression analysis for semi-competing risks data under dependent censoring. Scand J Stat 36:481–500
Article MathSciNet MATH Google Scholar
Fine JP, Jiang H, Chappell RJ (2001) On semi-competing risks data. Biometrika 88:907–919
Article MathSciNet MATH Google Scholar
Fleming TR, Harrington DP (2005) Counting processes and survival analysis. Wiley, Chichester
Book MATH Google Scholar
Frank M (1979) On the simultaneous associativity of f(x, y) and x + y - f(x, y). Aequ Math 19:194–226
Article MathSciNet MATH Google Scholar
Gill RD (1980) Censoring and stochastic integrals. Mathematisch Centrum, Amsterdam
MATH Google Scholar
Gumbel EJ (1960) Bivariate exponential distributions. J Am Stat Assoc 55:698–707
Article MathSciNet MATH Google Scholar
Hsieh JJ, Wang W, Ding AA (2008) Regression analysis based on semicompeting risks data. J R Stat Soc Series B Stat Methodol 70:3–20
MathSciNet MATH Google Scholar
Hsieh JJ, Huang YT (2012) Regression analysis based on conditional likelihood approach under semi-competing risks data. Lifetime Data Anal 103:302–320
Article MathSciNet MATH Google Scholar
Jiang H, Fine JP, Kosorok MR, Chappell RJ (2005) Pseudo self-consistent estimation of a copula model with informative censoring. Scand J Stat 32:1–20
Article MathSciNet MATH Google Scholar
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481
Article MathSciNet MATH Google Scholar
Oakes D (1982) A model for association in bivariate survival data. J R Stat Soc B 44:414–422
MathSciNet MATH Google Scholar
Oakes D (1989) Bivariate survival models induced by frailties. J Am Stat Assoc 84:487–493
Article MathSciNet MATH Google Scholar
Peng L, Fine JP (2007) Regression modeling of semicompeting risks data. Biometrics 63:96–108
Article MathSciNet MATH Google Scholar
Peng L, Jiang H, Chappell RJ, Fine JP (2007) Statistical advances in the biomedical sciences: clinical trials, epidemiology, survival analysis, and bioinformatics. Wiley, Hoboken
Google Scholar
Shih JH, Louis TA (1995) Inferences on the association parameters in copula models for bivariate survival data. Biometrics 51:1384–1399
Article MathSciNet MATH Google Scholar
Siegel R, Ma J, Zou Z, Jemal A (2014) Cancer statistics, 2014. CA Cancer J Clin 64:9–29
Article Google Scholar
Wang W (2003) Estimating the association parameter for copula models under dependent censoring. J R Stat Soc Series B Stat Methodol 65:257–273
Article MathSciNet MATH Google Scholar
Xu J, Kalbfeisch JD, Tai B (2010) Statistical analysis of illness-death processes and semicompeting risks data. Biometrics 66:716–725
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors thank the editor, the associate editor and two reviewers for their constructive comments that have greatly improved the initial version of this paper. This work was supported in part by Cancer Center Support Grants from the National Institutes of Health (CA142543 to Hong Zhu at UT Southwestern Medical Center and CA016672 to Jing Ning at UT MD Anderson Cancer Center) and by a predoctoral fellowship grant from the Cancer Prevention Research Institute of Texas (RP140103 to Renke Zhou).

Author information

Authors and Affiliations

Duncan Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA
Renke Zhou & Melissa Bondy
Division of Biostatistics, The University of Texas School of Public Health, Houston, TX, 77030, USA
Renke Zhou
Division of Biostatistics, Department of Clinical Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
Hong Zhu
Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
Jing Ning

Authors

Renke Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Bondy
View author publications
You can also search for this author in PubMed Google Scholar
Jing Ning
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Zhu.

Additional information

Renke Zhou and Hong Zhu have contributed equally to this work.

Appendix

Proof of Theorem 2

Asymptotic results of $\hat{S}_1(t)$ are proved under the following regularity condition. Function $g(u, v, \alpha )$ is continuous and differentiable at u, v, and $\alpha $, respectively, and the parameter $\alpha $ lies in a compact set.

First, we show the consistency of $\hat{S}_1(t)$. We have that $\hat{S}_2(t)$ converges in probability to $S_2(t)$ uniformly for $t\in [0,\tau ]$, and $\hat{S}_Z(t)$ converges in probability to $S_Z(t)$ uniformly for $t\in [0,\tau ]$. By Theorem 1, $\hat{\alpha }$ converges in probability to $\alpha _0$. Since the function $g(u, v, \alpha )$ is a continuous function of u, v and $\alpha $, $g\{\hat{S}_2(t), \hat{S}_{Z}(t), \hat{\alpha }\}$ converges in probability to $g\{S_2(t), S_{Z}(t), \alpha _0\}$ uniformly for $t\in [0,\tau ]$. Therefore, $\hat{S}_1(t)$ is a consistent estimator of $S_1(t)$.

Next, we illustrate the asymptotic distribution of $\hat{S}_1(t)$. Applying the functional delta method to $g\{\hat{S}_2(t), \hat{S}_{Z}(t), \hat{\alpha }\}$ around $S_2(t)$, $S_Z(t)$ and $\alpha _0$, we have

$$\begin{aligned}&n^{1/2}\big [g\{\hat{S}_2(t), \hat{S}_{Z}(t), \hat{\alpha }\}-g\{S_2(t), S_Z(t),\alpha _0\}\big ]\nonumber \\&\quad \cong n^{1/2} \frac{\partial g\{S_2(t), S_Z(t),\alpha _0\}}{\partial u}(\hat{S}_2-S_2)(t)\nonumber \\&\qquad +\,n^{1/2} \frac{\partial g\{S_2(t), S_Z(t),\alpha _0\}}{\partial v} (\hat{S}_Z-S_Z)(t) \nonumber \\&\qquad +\,n^{1/2} \frac{\partial g\{S_2(t), S_Z(t),\alpha _0\}}{\partial \alpha } (\hat{\alpha }-\alpha _0). \end{aligned}$$

(8)

Using martingale representations for $\hat{S}_2$ and $\hat{S}_Z$ (Gill 1980), the sum of the first and second terms in (8) is asymptotically equivalent to

$$\begin{aligned} n^{-\frac{1}{2}}\left[ \sum _{i=1}^n \frac{\partial g\{S_2(t), S_Z(t),\alpha _0\}}{\partial u} I_1^0(Y_{i},\delta _{2i})+\frac{\partial g\{S_2(t), S_Z(t),\alpha _0\}}{\partial v} I_2^0(Z_{i},\delta _{zi})\right] , \end{aligned}$$

(9)

which is a sum of n independent and identically distributed random variables, with $I_1^0(Y_{i},\delta _{2i})$ and $I_2^0(Z_{i},\delta _{zi})$ defined as in the previous section. Also the expectation of each term in (9) is zero. By the central limit theorem, (9) converges weakly to a normal distribution with mean zero and variance $\omega ^2(t)$. By Theorem 1, $n^{1/2}(\hat{\alpha }-\alpha _0)$ converges weakly to a normal distribution with mean zero and variance $\rho ^2$. Therefore, the third term in (8) is asymptotically equivalent to

$$\begin{aligned} n^{-1/2} \frac{\partial g\{S_2(t), S_Z(t),\alpha \}}{\partial \alpha } \rho \sum _{i=1}^n \frac{\partial l_0^g\{\alpha , \hat{S}_2(Y_{i}), \hat{S}_{Z}(Z_i)\}}{\partial \alpha }, \end{aligned}$$

(10)

which is a sum of n independent and identically distributed random variables.

Moreover, we have

$$\begin{aligned}&E \left( \Bigg [\frac{\partial l_0^g\{\alpha , \hat{S}_2, \hat{S}_{Z}\}}{\partial \alpha }\Bigg ]\{I_1^0(Y_{i},\delta _{2i})+I_2^0(Z_{i},\delta _{zi})\}\right) \\&\quad =E\left( \{I_1^0(Y_{i},\delta _{2i})+I_2^0(Z_{i},\delta _{zi})\}E\Bigg [\frac{\partial l_0^g\{\alpha , \hat{S}_2, \hat{S}_{Z}\}}{\partial \alpha }\mid Y_{i}, Z_i,\delta _{2i},\delta _{zi} \Bigg ]\right) =0,\nonumber \end{aligned}$$

(11)

which means that (9) and (10) are asymptotically orthogonal. Therefore, (9), (10) and (11) imply that as $n\rightarrow \infty $, the process $n^{1/2}\{\hat{S}_1(t)-S_1(t)\}$ converges weakly to a zero-mean Gaussian process for $t\in [0,\tau ]$ with covariance function $\Big [\frac{\partial g\{S_2(t), S_{Z}(t), \alpha \}}{\partial \alpha }\Big ]^2\rho ^2+\omega ^2(t)$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, R., Zhu, H., Bondy, M. et al. Semiparametric model for semi-competing risks data with application to breast cancer study. Lifetime Data Anal 22, 456–471 (2016). https://doi.org/10.1007/s10985-015-9344-x

Download citation

Received: 12 September 2014
Accepted: 25 August 2015
Published: 05 September 2015
Issue Date: July 2016
DOI: https://doi.org/10.1007/s10985-015-9344-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semiparametric model for semi-competing risks data with application to breast cancer study

Abstract

Similar content being viewed by others

Bayesian and Non-Bayesian Estimation for the Bivariate Inverse Weibull Distribution Under Progressive Type-II Censoring

Efficient Estimation of the Additive Risks Model for Interval-Censored Data

A dependent Dirichlet process model for survival data with competing risks

1 Introduction