Keywords

5.1 Introduction

Recent years have witnessed a rapid increase in the use of genetic covariates to build survival prediction models in biomedical research. Accurate prediction of survival is often possible by incorporating genetic covariates into prediction models, as reported in breast cancer (Jenssen et al. 2002; Sabatier et al. 2011; Zhao et al. 2011), diffuse large-B-cell lymphoma (Lossos et al. 2004; Alizadeh et al. 2011), lung cancer (Beer et al. 2002; Chen et al. 2007; Shedden et al. 2008), ovarian cancer (Popple et al. 2012; Yoshihara et al. 2010, 2012; Waldron et al. 2014), and other cancers. Evaluating predictive accuracy of the survival prediction models has been a challenging area of research due to the high-dimensionality of genes (Michiels et al. 2005; Schumacher et al. 2007; Bøvelstad et al. 2007, 2009; Witten and Tibshirani 2010; Zhao et al. 2014; Emura et al. 2017).

To overcome the difficulty of handling the high-dimensional genetic covariates, one often needs to obtain a small fraction of genes that are predictive of survival. The traditional approach, called univariate selection , is a forward variable selection method according to univariate association between each gene and survival, where the association is measured through univariate Cox regression. A predictor constructed from the selected genes has been shown to be useful for survival prediction (Beer et al. 2002; Wang et al. 2005; Matsui 2006; Chen et al. 2007; Matsui et al. 2012; Emura et al. 2017).

It is well known that Cox regression relies on the independent censoring assumption. From our discussions in Chap. 3, this assumption seems unrealistic in univariate Cox regression, where many covariates are omitted. If the independent censoring assumption is violated, univariate Cox regression may not correctly capture the effect of each gene and thus may fail to select useful genes. Accordingly, the resultant predictor based on the selected genes may have a reduced ability to predict survival.

Emura and Chen (2016) introduced a copula-based method for performing gene selection. With this method, dependence between survival and censoring times is modeled via a copula, whereby relaxing the independent censoring assumption. In the subsequent discussions, we revisit their method by providing more detailed developments than the original paper. We have made the lung cancer data publicly available in the compound. Cox R package (Emura et al. 2018) to enhance reproducibility.

The chapter is organized as follows. Section 5.2 reviews the conventional univariate selection . Sections 5.35.5 introduce the copula-based method of Emura and Chen (2016). Section 5.6 includes the analysis of the non-small-cell lung cancer data for illustration. Section 5.7 provides discussions.

5.2 Univariate Selection

Univariate selection is the traditional method for selecting a subset of genes that is predictive of survival. As the initial step, one fits the univariate Cox model for each gene, one-by-one. Then, one selects a subset of genes that are univariately associated with survival. Finally, one builds a multi-gene predictor using the subset of genes for purpose of survival prediction. The predictor is usually a weighted sum of gene expressions whose weights reflect the degree of association.

Let \( {\mathbf{x}} = (\;x_{1} ,\; \ldots ,\;x_{p} \;)^{\prime } \) be a p-dimensional vector of gene expressions, where the dimension p can be large. Let T be survival time having the hazard function \( h(t|{\mathbf{x}}) = \Pr (\;t \le T < t + dt\;|T \ge t,\;{\mathbf{x}}\;)/dt \). It is well known that the multivariate Cox model \( h(t|{\mathbf{x}}) = h_{0} (t)\exp ({\varvec{\upbeta}}^{\prime } {\mathbf{x}}) \) does not yield proper estimates of \( {\varvec{\upbeta}} \) when p is very large (Witten and Tibshirani 2010).

In biomedical research, the univariate Cox regression analysis is the traditional strategy to deal with the large number of covariates (e.g., Beer et al. 2002; Chen et al. 2007). Let \( h(t|x_{j} ) = \Pr (\;t \le T < t + dt\;|T \ge t,\;x_{j} \;)/dt \) be the hazard function given the jth gene. The univariate Cox model is specified as \( h_{j} (t|x_{j} ) = h_{0j} (t)\exp (\beta_{j} x_{j} ) \) for each gene \( j = 1,\; \ldots ,\;p \). The primary objective of using the univariate Cox model is to perform univariate selection as follows: For each \( j = 1,\; \ldots ,\;p \), the null hypothesis \( H_{0} :\beta_{j} = 0 \) is examined by the Wald test (or score test) under the univariate Cox model . Then one picks out a subset of genes that have low P-values from the tests. The genes with low P-values are then selected for further analysis.

After genes are selected, they are used to build a prediction scheme for survival. In medical studies, it is a common practice to re-fit a multivariate Cox regression model based on the selected genes (e.g., Lossos et al. 2004). However, we have reservations about this commonly used strategy due to the poor predictive performance observed in many papers (e.g., Bøvelstad et al. 2007; van Wieringen et al. 2009). Alternatively, we suggest using Tukey ’s compound covariate predictor (Tukey 1993) that combines the results of univariate analyses without going through a multivariate analysis. The compound covariate has been successfully employed in many medical studies (e.g., Beer et al. 2002; Wang et al. 2005; Chen et al. 2007) and biostatistical studies (Matsui 2006; Matsui et al. 2012; Emura et al. 2012, 2017).

The two major assumptions of univariate selection are the correctness of the univariate Cox model and the independent censoring assumption. The violation of these assumptions yields bias in estimating the true effect of genes. Emura and Chen (2016) argued that the independence of censoring is a more crucial assumption than the correctness of the univariate Cox model . The bias due to dependent censoring gets large if either the degree of dependence or the percentage of censoring increases (see Sect. 3.5). In the following sections, we shall introduce a copula-based univariate selection method that copes with the problem of dependent censoring .

5.3 Copula-Based Univariate Cox Regression

Let T be survival time, U be censoring time, and \( {\mathbf{x}} = (\;x_{1} ,\; \ldots ,\;x_{p} \;)^{\prime } \) be gene expressions . The joint distribution of T and U can have an arbitrary dependence pattern for any given x j . Sklar’s theorem (Sklar 1959; Nelsen 2006) guarantees that the joint survival function is expressed as

$$ \Pr (\;T > t\;,\;U > u|x_{j} \;) = C_{j} \{ \;\Pr (\;T > t\;|x_{j} \;)\;,\;\Pr (\;U > u\;|x_{j} \;)\;\} ,\quad j = 1,\; \ldots ,\;p, $$

where C j is a copula. The independent censoring assumption corresponds to C j (u, v) = uv for \( j = 1,\; \ldots ,\;p \), namely,

$$ \Pr (\;T > t\;,\;U > u|x_{j} \;) = \Pr (\;T > t\;|x_{j} \;) \times \,\Pr (\;U > u\;|x_{j} \;),\quad j = 1,\; \ldots ,\;p. $$
(5.1)

This is clearly a strong assumption (Chap. 3).

To relax the independent censoring assumption, Emura and Chen (2016) suggested a one-parameter copula model

$$ \Pr (\;T > t\;,\;U > u|x_{j} \,) = C_{\alpha } \{ \;\Pr (\;T > t\;|x_{j} \;),\;\Pr (\;U > u|x_{j} \;)\;\} ,\quad j = 1,\; \ldots ,\;p. $$
(5.2)

Since the same copula C is assumed for every j, this assumption may still be strong. Nevertheless, the copula relaxes the independent censoring assumption (5.1) by allowing a dependence parameter α to be flexibly chosen by users. One example is the Clayton copula

$$ C_{\alpha } (\;u,\;v\;) = \;(\;u_{{}}^{ - \alpha } + v_{{}}^{ - \alpha } - 1\;)^{ - 1/\alpha } ,\quad \quad \alpha > 0, $$

where the parameter α is related to Kendall’s tau through \( \tau = \alpha /(\alpha + 2) \). The copula model (5.2) reduces to the independent censoring model (5.1) by letting α → 0.

For marginal distributions, Emura and Chen (2016) assumed the Cox models

$$ \Pr (\;T > t\;|x_{j} \;) = \exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{j} }} \;\} ,\quad \,\Pr (\;U > u\;|x_{j} \;) = \exp \{ \; -\Gamma _{0j} (u)e^{{\gamma_{j} x_{j} }} \;\} , $$
(5.3)

where β j and \( \gamma_{j} \) are regression coefficients and \( \Lambda _{0j} \) and \( \Gamma _{0j} \) are baseline cumulative hazard functions.

For purpose of gene selection, the target parameter is β j that is the univariate effect of the jth gene on survival. Other parameters \( (\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} \;) \) are nuisance. Under the independent censoring model (5.1), one can use the partial likelihood to estimate for β j while ignoring the nuisance parameters. However, under the copula model (5.2), the partial likelihood estimator gives an inconsistent estimate of β j (Chap. 3).

The full likelihood is necessary to consistently estimate \( (\;\beta_{j} ,\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} \;) \) under the copula model (5.2) and the Cox models (5.3). Define notations

$$ \begin{aligned} D_{\alpha ,1} (u,\;v) & = \frac{{\partial C_{\alpha } (u,\;v)/\partial u}}{{C_{\alpha } (u,\;v)}} = - \frac{{\partial\Phi _{\alpha } (u,\;v)}}{\partial u}, \\ D_{\alpha ,2} (u,\;v) & = \frac{{\partial C_{\alpha } (u,\;v)/\partial v}}{{C_{\alpha } (u,\;v)}} = - \frac{{\partial\Phi _{\alpha } (u,\;v)}}{\partial v}, \\ \end{aligned} $$

where \( \Phi _{\alpha } (u,v) = - \log C_{\alpha } (u,v) \). Observed data are denoted as {(t i , δ i , x ij ), i = 1, …, n}, where t i  = min(T i , U i ) and δ i  = I(T i  ≤ U i ), where I(∙) is the indicator function. As in Chen (2010), we treat \( \Lambda _{0j} \) and \( \Gamma _{0j} \) as increasing step functions that have jumps sizes \( d\Lambda _{0j} (t_{i} ) =\Lambda _{0j} (t_{i} ) -\Lambda _{0j} (t_{i} -dt ) \) for δ i  = 1 and \( d\Gamma _{0j} (t_{i} ) =\Gamma _{0j} (t_{i} ) -\Gamma _{0j} (t_{i} - dt) \) for δ i  = 0. For any given α, the log-likelihood is defined as

$$ \begin{aligned} \ell (\beta_{j} ,\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} |\alpha ) & = \sum\limits_{i} {\delta_{i} [\;\beta_{j} x_{ij} + \log \eta_{1ij}^{{}} (t_{i} ;\;\beta_{j} ,\;\gamma_{j} ,\Lambda _{0j} ,\;\Gamma _{0j} |\alpha ) + \log d\Lambda _{0j} (t_{i} )\;]} \\ \quad & + \sum\limits_{i} {(1 - \delta_{i} )[\;\gamma_{j} x_{ij} + \log \eta_{2ij}^{{}} (t_{i} ;\;\beta_{j} ,\;\gamma_{j} ,\Lambda _{0j} ,\;\Gamma _{0j} |\alpha ) + \log d\Gamma _{0j} (t_{i} )\;]} \\ \quad & - \sum\limits_{i} {\Phi _{\alpha } [\;\exp \{ \; -\Lambda _{0j} (t_{i} )e^{{\beta_{j} x_{ij} }} \;\} ,\;\exp \{ \; -\Gamma _{0j} (t_{i} )e^{{\gamma_{j} x_{ij} }} \;\} ]} \;, \\ \end{aligned} $$
(5.4)

where,

$$ \begin{aligned} \eta_{1ij} (\;t;\;\beta_{j} ,\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} |\alpha \;) = \exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{ij} }} \;\} D_{\alpha ,1} [\;\exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{ij} }} \;\} ,\;\exp \{ \; -\Gamma _{0j} (t)e^{{\gamma_{j} x_{ij} }} \;\} ], \hfill \\ \eta_{2ij} (\;t;\;\beta_{j} ,\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} |\alpha \;) = \exp \{ \; -\Gamma _{0j} (t)e^{{\gamma_{j} x_{ij} }} \;\} D_{\alpha ,2} [\;\exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{ij} }} \;\} ,\;\exp \{ \; -\Gamma _{0j} (t)e^{{\gamma_{j} x_{ij} }} \;\} ]. \hfill \\ \end{aligned} $$

The maximizer of Eq. (5.4) given \( \alpha \) is denoted as \( (\;{\hat{\beta }}_{j} (\alpha ),\;{\hat{\gamma }}_{j} (\alpha ),\;{\hat{\Lambda }}_{0j} (\alpha ),\;{\hat{\Gamma }}_{0j} (\alpha )\;). \) The standard error \( SE\{ \;\hat{\beta }_{j} (\alpha )\;\} \) is computed from the information matrix (Chen 2010).

The log-likelihood in Eq. (5.4) can be easily computed under the Clayton copula. It can be shown that \( \Phi _{\alpha } (u,v) = \alpha^{ - 1} \log (u_{{}}^{ - \alpha } + v_{{}}^{ - \alpha } - 1) \), Dα,1(u, v) = uα−1(uα + vα − 1)−1, and Dα,2(u, v) = uα−1(uα + vα − 1)−1. Hence,

$$ \begin{aligned} \eta_{1ij} (\;t;\;\beta_{j} ,\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} |\alpha \;) & = \frac{{[\exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{ij} }} \;\} ]^{ - \alpha } }}{{[\exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{ij} }} \;\} ]^{ - \alpha } + [\exp \{ \; -\Gamma _{0j} (t)e^{{\gamma_{j} x_{ij} }} \;\} ]^{ - \alpha } - 1}}, \\ \eta_{2ij} (\;t;\;\beta_{j} ,\;\gamma_{j} ,\;\Lambda _{0j} ,\;\Gamma _{0j} |\alpha \;) & = \frac{{[\exp \{ \; -\Gamma _{0j} (t)e^{{\gamma_{j} x_{ij} }} \;\} ]^{ - \alpha } }}{{[\exp \{ \; -\Lambda _{0j} (t)e^{{\beta_{j} x_{ij} }} \;\} ]^{ - \alpha } + [\exp \{ \; -\Gamma _{0j} (t)e^{{\gamma_{j} x_{ij} }} \;\} ]^{ - \alpha } - 1}}. \\ \end{aligned} $$

One can apply these formulas to Eq. (5.4) to calculate the log-likelihood function and maximize it by optimization algorithms.

We implemented the computation of \( \hat{\beta }_{j} (\alpha ) \) and \( SE\{ \hat{\beta }_{j} (\alpha )\} \) in the compound. Cox R package (Emura et al. 2018). In the package, the maximization of Eq. (5.4) is performed by the nlm function after the log-transformations \( \log \,d\Lambda _{0j} (t_{i} ) \) and log dΓ0j(t i ). The package uses the initial values β j  = \( \gamma_{j} \) = 0 and \( d\Lambda _{0j} (t_{i} ) = d\Gamma _{0j} (t_{i} ) = 1/n \).

Technical remarks: Theoretically, if α ↓ 0, \( \hat{\beta }_{j} (\alpha ) \) approaches to the partial likelihood estimate of β j . Numerically, however, the value α too close to zero makes the likelihood optimization unstable. Hence, we set \( \hat{\beta }_{j} (\alpha ) = \hat{\beta }_{j} (0.01) \) for 0 ≤ α < 0.01 in the package. The value of \( \hat{\beta }_{j} (\alpha ) = \hat{\beta }_{j} (0.01) \) is almost the same as the partial likelihood estimate.

5.4 Copula-Based Univariate Selection

One can use the copula-based method in Sect. 5.3 to perform univariate selection adjusted for the effect of dependent censoring . The P-value for testing the null hypothesis \( H_{0} :\beta_{j} = 0 \) is computed by the Wald test based on a Z-statistic \( \hat{\beta }_{j} (\alpha )/SE\{ \hat{\beta }_{j} (\alpha )\} \). One can select a subset of genes according to the P-values. With α ≈ 0 in the Clayton copula, one has C(u, v) ≈ uv. Hence, the resultant test is approximately equal to the Wald test under univariate Cox regression. In this sense, the copula-based test is a generalization of the conventional univariate selection.

For a future subject with a covariate vector x = (x1, …, x p )′, survival prediction can be made by the prognostic index (PI) defined as \( {\hat{\varvec{\upbeta}}}(\alpha )^{\prime } {\mathbf{x}}, \) where \( {\hat{\varvec{\upbeta}}}(\alpha )^{\prime } = (\;\hat{\beta }_{1} (\alpha ),\; \cdots ,\;\hat{\beta }_{p} (\alpha )\;). \) The PI is a weighted sum of genes whose weights reflect the degree of univariate association. If \( \alpha = 0, \) one obtains PI = \( {\hat{\varvec{\upbeta}}}(0)^{\prime } {\mathbf{x}} \) which is equal to the compound covariate based on univariate Cox regression under the independent censoring assumption (Matsui 2006; Emura et al. 2012).

5.5 Choosing the Copula Parameter by the C-Index

Estimation of the copula parameter α is inherently difficult due to the non-identifiability of competing risks data (Tsiatis 1975). An estimator maximizing the profile log-likelihood for α based on Eq. (5.4) typically shows very large sampling variation (Chen 2010). In our experience, the profile likelihood often has a peak at extreme values; for instance, either \( \alpha \approx 0 \) or \( \alpha \approx \infty \) under the Clayton copula. These undesirable properties make the likelihood-based strategy less useful.

Following Emura and Chen (2016), we introduce a prediction-based strategy for choosing α. A widely used predictive measure is a cross-validated partial likelihood (Verveij and van Houwelingen 1993). Unfortunately, the partial likelihood is not a valid likelihood under dependent censoring.

A more plausible predictive measure under dependent censoring is Harrell’s c-index (Harrell et al. 1982). The interpretation of the c-index does not depend on a specific model. We adopt a cross-validated version of the c-index defined as follows.

We calculate the c-index based on a K-fold cross-validation . We first divide n patients into K groups of approximately equal sample sizes. This process can be specified by a function \( \kappa :\left\{ {1, \ldots ,n} \right\} \mapsto \left\{ {1, \ldots K} \right\} \) indicating the group to which each patient is allocated (Hastie et al. 2009). For each patient i, define the PI:

$$ {\text{PI}}_{i} (\alpha )= {\hat{\varvec{\upbeta}}}^{\prime }_{ - \kappa (i)} (\alpha ){\mathbf{x}}_{i} = \;\hat{\beta }_{1, - \kappa (i)} (\alpha )x_{i1} + \; \cdots + \hat{\beta }_{p, - \kappa (i)} (\alpha )x_{ip} , $$

where \( \hat{\beta }_{j, - \kappa (i)} (\alpha ) \) is obtained based on Eq. (5.4) with the κ(i)th group of patients removed. In this way, PI i (α) is a predictor of the survival outcome (t i , \( \delta_{j} \)) for the patient i. We define the cross-validated c-index:

$$ CV(\alpha ) = \frac{{\sum\limits_{i < j} {\{ \;{\mathbf{I}}(\;t_{i} < t_{j} \;){\mathbf{I}}(\;{\text{PI}}_{i} (\alpha ) > {\text{PI}}_{j} (\alpha )\;)\delta_{i} + {\mathbf{I}}(\;t_{j} < t_{i} \;){\mathbf{I}}(\;{\text{PI}}_{j} (\alpha ) > {\text{PI}}_{i} (\alpha )\;)\delta_{j} \;\} } }}{{\sum\limits_{i < j} {\{ \;{\mathbf{I}}(\;t_{i} < t_{j} \;)\delta_{i} + {\mathbf{I}}(\;t_{j} < t_{i} \;)\delta_{j} \;\} } }}. $$

Finally, we define \( \hat{\alpha } \) that maximizes CV(α). We recommend K = 5 that is often used when \( n \) or p is large.

It is computationally demanding to obtain a high-dimensional vector \( {\hat{\varvec{\upbeta}}}_{ - \kappa (i)} (\alpha ) \) for every group κ(i). To release the computational cost, we suggest reducing the number p by using the initial univariate selection under α = 0, e.g., based on P-value <0.2. The technique shall be applied to the subsequent data analysis.

A graphical diagnostic plot for \( CV(\alpha ) \) is informative to see how the proposed method of choosing \( \hat{\alpha } \) works. We suggest using a grid search to find the approximate value of \( \hat{\alpha } \) and plot the values of CV(α) against the grids. Figure 5.1 shows the plots of CV(α) with simulated data under our previously considered setting (Case 2 of Table 2 in Emura and Chen 2016). The figure shows that \( CV(\hat{\alpha }) \) is noticeably larger than \( CV(0) \). This suggest that \( {\text{PI}}_{i} (\hat{\alpha } ) \) has better ability to predict survival than \( {\text{PI}}_{i} (0 ) \) does.

Fig. 5.1
figure 1

Six replications of the cross-validated c-index \( CV(\alpha ) \). The maximum of CV(α) is signified as a triangle (in red color)

5.6 Lung Cancer Data Analysis

We analyze the survival data on the non-small-cell lung cancer patients of Chen et al. (2007). The data analysis was performed previously by Emura and Chen (2016) using the copula-based methods. Here, we update the analysis based on the data available in the compound. Cox R package, providing more detailed explanations than the previous one. In addition, this demonstration allows researchers to reproduce all the results easily through R.

In the lung cancer data, the primary endpoint is overall survival , i.e., time-to-death. During the follow-up, 38 patients died and the remaining 87 patients were censored. The 125 patients were split into either a training set (63 patients) or a testing set (62 patients) in the same manner as Chen et al. (2007).

The Lung object in the compound. Cox R package contains censored survival times t, censoring indicators \( \delta_{i} \), training/testing indicators, and gene expressions \( {\mathbf{x}}_{i} = (x_{i1} ,\; \ldots ,\;x_{ip} )^{\prime } \) for the 125 patients. Available are p = 97 gene expressions that satisfy P-value <0.20 under the usual univariate selection performed on the training set. All the gene expressions were coded as 1, 2, 3, or 4 according to Chen et al. (2007). In the original analysis of Chen et al. (2007), univariate selection yielded 16 genes with P-value <0.05. In our analysis, we shall apply the copula-based univariate selection to select 16 genes.

5.6.1 Gene Selection and Prediction

We applied the copula-based univariate Cox regression to the 63 patients (training set) by using the R codes available in Appendix B. Here, we used K = 5 cross-validation for examining the diagnostic plot of CV(α). The outputs are shown below:

Here, \( \$ {\text{beta}} = \hat{\beta }_{j} (\hat{\alpha }) \), \( \$ {\text{SE}} = SE\{ \;\hat{\beta }_{j} (\hat{\alpha })\;\} \), \( \$ {\text{Z}} = \hat{\beta }_{j} (\hat{\alpha })/SE\{ \;\hat{\beta }_{j} (\hat{\alpha })\;\} \), and $P is the P-value for each \( j = 1,\; \ldots ,\;97 \). Also, \( \$ {\text{alpha}} = \hat{\alpha } \) and \( \$ {\text{c}}\_{\text{index}} = CV(\hat{\alpha }) \).

Figure 5.2 displays the diagnostic plot of the cross-validated c-index CV(α) calculated on the 63 patients (training set). The c-index is maximized at the copula parameter \( \hat{\alpha } = 18 \) (Kendall’s tau = 0.90). This implies a possible gain in prediction accuracy by using the Clayton copula for dependent censoring.

Fig. 5.2
figure 2

Plot of CV(α) (the cross-validated c-index) based on the lung cancer data. The value of CV(α) is maximized at α = 18 (Kendall’s tau = 0.90)

We selected the 16 genes among the 97 genes according to the P-values. The outputs are shown below:

The resultant PI is defined as \( {\text{PI}} = \hat{\beta }_{j} (\hat{\alpha })x_{1} + \cdots + \hat{\beta }_{16} (\hat{\alpha })x_{16} \), where (x1, …, x16) are gene expressions of the 16 genes. Accordingly,

$$ \begin{aligned} {\text{PI}} & = (0. 5 1\times {\text{MMP16}}) + (0. 51 \times {\text{ZNF264}}) + (0. 50 \times {\text{HGF}}) + ( - 0. 4 9\times {\text{HCK}}) + (0. 4 7\times {\text{NF1}}) \\ & + ( 0. 4 6\times {\text{ERBB3) + (0}} . 5 7\times {\text{NR2F6) + (0}} . 7 7\times {\text{AXL}}) + ( 0. 5 1\times {\text{CDC23)}} + ( 0. 9 2\times {\text{DLG2)}} \\ & + \, ( - 0. 3 4\times {\text{IGF2}}) + (0. 5 4\times {\text{RBBP6}}) + (0. 5 1\times {\text{COX11}}) + (0. 40 \times {\text{DUSP6}}) + ( - 0. 3 7\times {\text{ENG}}) \\ & + \, ( - 0. 4 1\times {\text{IHPK1}}). \\ \end{aligned} $$

5.6.2 Assessing Prediction Performance

To validate the ability of the PI for predicting overall survival , we separate the 62 testing patients into two groups of equal sizes: 31 good prognosis patients with low PIs and 31 poor prognosis patients with high PIs. We then calculate the two survival curves for each group (Fig. 5.3).

Fig. 5.3
figure 3

Survival curves for the good and poor prognosis groups . The good (or poor) group is determined by the low (or high) values of the PI. Censored patients are indicated as the mark “+”

The prediction performance of the PI can be measured by the difference between the two survival curves in Fig. 5.3. The two survival curves were calculated by the copula-graphic estimator (Rivest and Wells 2001) that adjusts for the effect of dependent censoring with the Clayton copula at \( \hat{\alpha } = 1 8 \) (Kendall’s tau = 0.90). This approach may be better than the conventional log-rank test to measure the difference between two Kaplan–Meier estimators that are biased under dependent censoring .

Under the Clayton copula model, the copula-graphic (CG) estimator (Chap. 4) is defined as

$$ \hat{S}^{CG} (\;t\;) = \left[ {1 + \sum\limits_{{t_{i} \le t,\;\delta_{i} = 1}} {\left\{ {\left( {\frac{{n_{i} - 1}}{n}} \right)^{{ - \hat{\alpha }}} - \left( {\frac{{n_{i} }}{n}} \right)^{{ - \hat{\alpha }}} } \right\}} } \right]^{{ - 1/\hat{\alpha }}} , $$

where \( n_{i} = \sum\nolimits_{j = 1}^{n} {{\mathbf{I}}(t_{j} \ge t_{i} )} \) is the number at-risk at time t i . We computed the CG estimator by using the compound.Cox R package (Emura et al. 2018).

The separation of the two curves in Fig. 5.3 is measured by the average vertical difference between the survival curves over the study period. This statistic is considered as a scaled version of the area between the two survival curves. It is also equivalent to a special case of the weighted Kaplan–Meier statistics (Pepe and Fleming 1989). When using this statistic, the choice of the study period strongly influences the test results. The common choice is the period where at least one survivor exists in both groups (Chap. 2; Klein and Moeschberger 2003). The study period is depicted in Fig. 5.3.

The P-value for testing the difference between the two groups is obtained using the permutation test (Frankel et al. 2007). In each permutation, good prognosis group (n = 31) and poor prognosis group (n = 31) are randomly allocated from the 62 testing samples, and then, the CG estimator is computed for each group. For each permutation, the study period is determined and the average vertical difference between the two CG estimators is calculated. The P-value is computed as the proportion of 10,000 permuted test statistics exceeding the original test statistic.

The two curves are significantly separated between the good and poor prognoses (Average difference = 0.224; P-value = 0.021). This result justifies the predictive ability of the PI derived by using the copula-based approach.

5.7 Discussions

We have introduced copula-based approaches for selecting genes and making survival prediction in the presence of dependent censoring. The method can be flexibly applied to accommodate different copulas, such as the Clayton, Gumbel, and FGM copulas. Due to its mathematical simplicity, we prefer the Clayton copula to other copulas in modeling dependence structure between survival time and censoring time. However, the effect of dependent censoring on estimates can be remarkably different between different copulas (Chap. 3). Rivest and Wells (2001) theoretically explored the sensitivity of using different copulas on estimating a marginal survival function.

Due to the inherent problem of the non-identifiability of competing risks data (Tsiatis 1975), it is not easy to identify the degree of dependence (i.e., the true copula parameter) between survival and censoring times. The problem is due to the fact that the likelihood function contains little information to identify the true copula parameter. Alternatively, we choose the copula parameter by using a cross-validated c-index , a predictive measure free from the likelihood criterion. This method exhibited sound numerical performances in our numerical analyses. Unfortunately, we do not have a theoretical justification of the method, such as consistency. Recently, Emura and Michimae (2017) proposed a goodness-of-fit procedure to test the assumption of the correct copula under competing risks. According to their simulation results, their approaches have certain ability to identify the correct copula under a large number of samples. However, their approaches have not been extended to include covariates.

After relevant genes are selected, researchers often use them to stratify patients between good and poor prognosis groups in validation samples. This is a common strategy to assess prediction performance of the selected genes. Researchers typically use the log-rank test to see how well the Kaplan–Meier survival curves are separated between the good and poor groups. Note that these commonly used validation strategies may give biased results if dependent censoring exists in validation samples. Copulas are used to adjust for this bias by replacing the Kaplan–Meier estimator by the copula-graphic estimator . Since the log-rank test is no longer valid in the presence of dependent censoring, we apply the permutation test based on the average vertical difference between the copula-graphic estimators. For purpose of constructing survival forests, Moradian et al. (2017) also suggested the copula-graphic estimator to measure the difference between two groups under dependent censoring.

One potential drawback of the proposed gene selection method is that it needs to impose a proportional hazards model for the censoring distribution in Eq. (5.3). On the other hand, the traditional univariate Cox regression does not require any model assumption on the censoring distribution. This elimination of the model assumption is the consequence of the independent censoring assumption. Once the independent censoring assumption is relaxed, certain model specifications for the censoring distribution appear to be mandatory (e.g., Siannis et al. 2005; Chen 2010). If the research interest lies in the effect of genes on both survival time and censoring time, the proportional hazards model for the censoring distribution may provide useful information. For instance, researchers may be interested in selecting genes associated with both disease-specific survival and time-to-death due to other causes as in the competing risks setting (Escarela and Carrière 2003).